* Re: aic94xx: failing on high load (another data point) [not found] ` <20080130091403.GA14887@alaris.suse.cz> @ 2008-01-30 10:59 ` Keith Hopkins 2008-01-30 19:29 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Keith Hopkins @ 2008-01-30 10:59 UTC (permalink / raw) To: Jan Sembera; +Cc: linux-scsi On 01/30/2008 05:14 PM, Jan Sembera wrote: > > We tried firmware versions V28, V30, and even V32 that is, as > far as I know, not yet available on adaptec website. All of them were > unfortunately displaying exactly the same behaviour :-(. Did you get your > SAS controller working? And if so, with which firmware was that? > V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. --Keith ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-01-30 10:59 ` aic94xx: failing on high load (another data point) Keith Hopkins @ 2008-01-30 19:29 ` Darrick J. Wong 2008-02-14 16:11 ` Keith Hopkins 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2008-01-30 19:29 UTC (permalink / raw) To: Keith Hopkins; +Cc: Jan Sembera, linux-scsi On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: > > V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. Adaptec posted a V30 sequencer on their website; does that fix the problems? http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm --D ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-01-30 19:29 ` Darrick J. Wong @ 2008-02-14 16:11 ` Keith Hopkins 2008-02-15 15:28 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Keith Hopkins @ 2008-02-14 16:11 UTC (permalink / raw) To: Darrick J. Wong, Jan Sembera; +Cc: linux-scsi [-- Attachment #1: Type: text/plain, Size: 669 bytes --] On 01/31/2008 03:29 AM, Darrick J. Wong wrote: > On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: >> V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. > > Adaptec posted a V30 sequencer on their website; does that fix the > problems? > > http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm > I lost connectivity to the drive again, and had to reboot to recover the drive, so it seemed a good time to try out the V30 firmware. Unfortunately, it didn't work any better. Details are in the attachment. --Keith [-- Attachment #2: post-v30-update.txt --] [-- Type: text/plain, Size: 17928 bytes --] Running V28 Firmware Feb 14 21:45:55 titan syslog-ng[28369]: STATS: dropped 60 Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: raid1: sdb2: rescheduling sector 0 Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: sd 0:0:1:0: rejecting I/O to offline device Feb 14 21:47:59 titan kernel: raid1: Disk failure on sdb2, disabling device. Feb 14 21:47:59 titan kernel: Operation continuing on 1 devices Feb 14 21:47:59 titan kernel: raid1: sda2: redirecting sector 0 to another mirror Feb 14 21:47:59 titan kernel: RAID1 conf printout: Feb 14 21:47:59 titan kernel: --- wd:1 rd:2 Feb 14 21:47:59 titan kernel: disk 0, wo:1, o:0, dev:sdb2 Feb 14 21:47:59 titan kernel: disk 1, wo:0, o:1, dev:sda2 Feb 14 21:47:59 titan kernel: RAID1 conf printout: Feb 14 21:47:59 titan kernel: --- wd:1 rd:2 Feb 14 21:47:59 titan kernel: disk 1, wo:0, o:1, dev:sda2 Feb 14 21:50:08 titan smartd[28072]: Device: /dev/sdb, No such device or address, open() failed V30 Firmware was installed in OS via rpm. Ran mkinitrd and... == manually reboot to get drive back online == (lots of kruft removed) Linux version 2.6.22.16-0.1-default (geeko@buildhost) (gcc version 4.2.1 (SUSE Linux)) #1 SMP 2008/01/23 14:28:52 UTC Command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default crashkernel=128M@16M splash=off nosplash SMP: Allowing 8 CPUs, 0 hotplug CPUs Kernel command line: root=/dev/vgtitan/lvroot vga=0x346 noresume splash=silent PROFILE=default profile=default crashkernel=128M@16M splash=off nosplash bootsplash: silent mode. Initializing CPU#0 time.c: Detected 2327.500 MHz processor. Memory: 14256200k/15728640k available (2053k kernel code, 422808k reserved, 1017k data, 316k init) CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 4096K CPU 0/0 -> Node 0 using mwait in idle threads. CPU: Physical Processor ID: 0 CPU: Processor Core ID: 0 CPU0: Thermal monitoring handled by SMI SMP alternatives: switching to UP code Unpacking initramfs... done Freeing initrd memory: 4931k freed ACPI: Core revision 20070126 Brought up 8 CPUs io scheduler cfq registered (default) Boot video device is 0000:08:00.0 Freeing unused kernel memory: 316k freed ACPI Error (dsopcode-0250): No pointer back to NS node in buffer obj ffff8103b0131c60 [20070126] ACPI Error (psparse-0537): Method parse/execution failed [\_PR_.CPU0._PDC] (Node ffff8103b0ae0770), AE_AML_INTERNAL ACPI: Processor [CPU0] (supports 8 throttling states) md: raid1 personality registered for level 1 device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com BIOS EDD facility v0.16 2004-Jun-25, 4 devices found SCSI subsystem initialized aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 16 (level, low) -> IRQ 16 aic94xx: found Adaptec AIC-9410W SAS/SATA Host Adapter, device 0000:05:02.0 scsi0 : aic94xx aic94xx: BIOS present (1,1), 1918 aic94xx: ue num:2, ue size:88 aic94xx: manuf sect SAS_ADDR 50000d10002d9380 aic94xx: manuf sect PCBA SN 0BB0C54904VA aic94xx: ms: num_phy_desc: 8 aic94xx: ms: phy0: ENABLED aic94xx: ms: phy1: ENABLED aic94xx: ms: phy2: ENABLED aic94xx: ms: phy3: ENABLED aic94xx: ms: phy4: ENABLED aic94xx: ms: phy5: ENABLED aic94xx: ms: phy6: ENABLED aic94xx: ms: phy7: ENABLED aic94xx: ms: max_phys:0x8, num_phys:0x8 aic94xx: ms: enabled_phys:0xff aic94xx: ctrla: phy0: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy1: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy2: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy3: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy4: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy5: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy6: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: ctrla: phy7: sas_addr: 50000d10002d9380, sas rate:0x9-0x8, sata rate:0x0-0x0, flags:0x0 aic94xx: max_scbs:512, max_ddbs:128 aic94xx: setting phy0 addr to 50000d10002d9380 aic94xx: setting phy1 addr to 50000d10002d9380 aic94xx: setting phy2 addr to 50000d10002d9380 aic94xx: setting phy3 addr to 50000d10002d9380 aic94xx: setting phy4 addr to 50000d10002d9380 aic94xx: setting phy5 addr to 50000d10002d9380 aic94xx: setting phy6 addr to 50000d10002d9380 aic94xx: setting phy7 addr to 50000d10002d9380 aic94xx: num_edbs:21 aic94xx: num_escbs:3 aic94xx: Found sequencer Firmware version 1.1 (V30) aic94xx: downloading CSEQ... aic94xx: dma-ing 8192 bytes aic94xx: verified 8192 bytes, passed aic94xx: downloading LSEQs... aic94xx: dma-ing 14336 bytes aic94xx: LSEQ0 verified 14336 bytes, passed aic94xx: LSEQ1 verified 14336 bytes, passed aic94xx: LSEQ2 verified 14336 bytes, passed aic94xx: LSEQ3 verified 14336 bytes, passed aic94xx: LSEQ4 verified 14336 bytes, passed aic94xx: LSEQ5 verified 14336 bytes, passed aic94xx: LSEQ6 verified 14336 bytes, passed aic94xx: LSEQ7 verified 14336 bytes, passed aic94xx: max_scbs:446 aic94xx: first_scb_site_no:0x20 aic94xx: last_scb_site_no:0x1fe aic94xx: First SCB dma_handle: 0x3ac6f6000 aic94xx: device 0000:05:02.0: SAS addr 50000d10002d9380, PCBA SN 0BB0C54904VA, 8 phys, 8 enabled phys, flash present, BIOS build 1918 aic94xx: posting 3 escbs aic94xx: escbs posted aic94xx: posting 8 control phy scbs aic94xx: control_phy_tasklet_complete: phy4, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy4: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 10 00 00 08 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 00 c5 00 aic94xx: 10: 01 a5 20 c5 aic94xx: 14: 00 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: asd_form_port: updating phy_mask 0x10 for phy4 aic94xx: control_phy_tasklet_complete: phy6, lrate:0x9, proto:0xe aic94xx: escb_tasklet_complete: phy6: BYTES_DMAED aic94xx: SAS proto IDENTIFY: aic94xx: 00: 10 00 00 08 aic94xx: 04: 00 00 00 00 aic94xx: 08: 00 00 00 00 aic94xx: 0c: 50 00 c5 00 aic94xx: 10: 01 a5 1d e9 aic94xx: 14: 00 00 00 00 aic94xx: 18: 00 00 00 00 aic94xx: asd_form_port: updating phy_mask 0x40 for phy6 aic94xx: control_phy_tasklet_complete: phy0: no device present: oob_status:0x0 aic94xx: control_phy_tasklet_complete: phy1: no device present: oob_status:0x0 aic94xx: control_phy_tasklet_complete: phy2: no device present: oob_status:0x0 aic94xx: control_phy_tasklet_complete: phy3: no device present: oob_status:0x0 aic94xx: control_phy_tasklet_complete: phy5: no device present: oob_status:0x0 aic94xx: control_phy_tasklet_complete: phy7: no device present: oob_status:0x0 sas: phy4 added to port0, phy_mask:0x10 sas: phy6 added to port1, phy_mask:0x40 sas: DOING DISCOVERY on port 0, pid:902 scsi 0:0:0:0: Direct-Access SEAGATE ST3146855SS S513 PQ: 0 ANSI: 5 sas: DONE DISCOVERY on port 0, pid:902, result:0 sas: DOING DISCOVERY on port 1, pid:902 scsi 0:0:1:0: Direct-Access SEAGATE ST3146855SS S513 PQ: 0 ANSI: 5 sas: DONE DISCOVERY on port 1, pid:902, result:0 sd 0:0:0:0: [sda] 286749480 512-byte hardware sectors (146816 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: b3 00 10 08 sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA sd 0:0:0:0: [sda] 286749480 512-byte hardware sectors (146816 MB) sd 0:0:0:0: [sda] Write Protect is off sd 0:0:0:0: [sda] Mode Sense: b3 00 10 08 sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, supports DPO and FUA sda: sda1 sda2 sda3 sd 0:0:0:0: [sda] Attached SCSI disk sd 0:0:1:0: [sdb] 286749480 512-byte hardware sectors (146816 MB) sd 0:0:1:0: [sdb] Write Protect is off sd 0:0:1:0: [sdb] Mode Sense: b3 00 10 08 sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA sd 0:0:1:0: [sdb] 286749480 512-byte hardware sectors (146816 MB) sd 0:0:1:0: [sdb] Write Protect is off sd 0:0:1:0: [sdb] Mode Sense: b3 00 10 08 sd 0:0:1:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA sdb: sdb1 sdb2 sdb3 sd 0:0:1:0: [sdb] Attached SCSI disk md: raid0 personality registered for level 0 raid5: automatically using best checksumming function: generic_sse generic_sse: 8311.000 MB/sec raid5: using function: generic_sse (8311.000 MB/sec) raid6: int64x1 2185 MB/s raid6: int64x2 2757 MB/s raid6: int64x4 2672 MB/s raid6: int64x8 2162 MB/s raid6: sse2x1 2978 MB/s raid6: sse2x2 6625 MB/s raid6: sse2x4 7339 MB/s raid6: using algorithm sse2x4 (7339 MB/s) md: raid6 personality registered for level 6 md: raid5 personality registered for level 5 md: raid4 personality registered for level 4 md: md2 stopped. md: bind<sdb3> md: bind<sda3> md: kicking non-fresh sdb3 from array! md: unbind<sdb3> md: export_rdev(sdb3) raid1: raid set md2 active with 1 out of 2 mirrors md: linear personality registered for level -1 sd 0:0:0:0: Attached scsi generic sg0 type 0 sd 0:0:1:0: Attached scsi generic sg1 type 0 Unable to find swap-space signature md: md0 stopped. md: bind<sdb1> md: bind<sda1> md: kicking non-fresh sdb1 from array! md: unbind<sdb1> md: export_rdev(sdb1) raid1: raid set md0 active with 1 out of 2 mirrors md: md1 stopped. md: bind<sdb2> md: bind<sda2> md: kicking non-fresh sdb2 from array! md: unbind<sdb2> md: export_rdev(sdb2) raid1: raid set md1 active with 1 out of 2 mirrors md: bind<sdb1> RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:1, o:1, dev:sdb1 disk 1, wo:0, o:1, dev:sda1 md: recovery of RAID array md0 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 530048 blocks. aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! md: md0: recovery done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdb1 disk 1, wo:0, o:1, dev:sda1 md: bind<sdb2> RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:1, o:1, dev:sdb2 disk 1, wo:0, o:1, dev:sda2 md: recovery of RAID array md1 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 16779776 blocks. aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=28) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! md: md1: recovery done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdb2 disk 1, wo:0, o:1, dev:sda2 md: bind<sdb3> RAID1 conf printout: --- wd:1 rd:2 disk 0, wo:1, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sda3 md: recovery of RAID array md2 md: minimum _guaranteed_ speed: 1000 KB/sec/disk. md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery. md: using 128k window, over a total of 126053952 blocks. aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x5 aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=23) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=13) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=10) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=10) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=11) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=17) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=9) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=4) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=5) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=5) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=17) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=8) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=20) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=7) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=12) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=7) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=15) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=14) to abort! aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 aic94xx: escb_tasklet_complete: Can't find task (tc=16) to abort! md: md2: recovery done. RAID1 conf printout: --- wd:2 rd:2 disk 0, wo:0, o:1, dev:sdb3 disk 1, wo:0, o:1, dev:sda3 ==EOdmesg== 05:02.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor HBA non-RAID) (rev 08) Subsystem: Adaptec ASC-48300 (Spirit non-RAID) Flags: bus master, 66MHz, slow devsel, latency 32, IRQ 16 Memory at d0200000 (64-bit, non-prefetchable) [size=256K] Memory at d0000000 (64-bit, prefetchable) [size=128K] I/O ports at 3000 [size=256] [virtual] Expansion ROM at d0080000 [disabled] [size=512K] Capabilities: [40] PCI-X non-bridge device Capabilities: [58] Power Management version 2 Capabilities: [e0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/2 Enable- ACPI: PCI Interrupt 0000:00:02.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:00:03.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:03:00.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:00:04.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:05:02.0[A] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:00:1d.3[D] -> GSI 16 (level, low) -> IRQ 16 ACPI: PCI Interrupt 0000:09:03.0[A] -> GSI 16 (level, low) -> IRQ 16 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 2 (rev 31) 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 31) 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01) 03:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01) 00:04.0 PCI bridge: Intel Corporation 5000X Chipset PCI Express x16 Port 4-7 (rev 31) 05:02.0 Serial Attached SCSI controller: Adaptec AIC-9410W SAS (Razor HBA non-RAID) (rev 08) <- PCI-X Card / PCI-X/133 Slot 00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #4 (rev 09) 09:03.0 SCSI storage controller: Adaptec AHA-2944UW / AIC-7884U (rev 01) <- old 5v PCI Card ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-14 16:11 ` Keith Hopkins @ 2008-02-15 15:28 ` James Bottomley 2008-02-15 16:28 ` Keith Hopkins 2008-02-18 14:26 ` Keith Hopkins 0 siblings, 2 replies; 16+ messages in thread From: James Bottomley @ 2008-02-15 15:28 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote: > On 01/31/2008 03:29 AM, Darrick J. Wong wrote: > > On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: > >> V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. > > > > Adaptec posted a V30 sequencer on their website; does that fix the > > problems? > > > > http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm > > > > I lost connectivity to the drive again, and had to reboot to recover > the drive, so it seemed a good time to try out the V30 firmware. > Unfortunately, it didn't work any better. Details are in the > attachment. Well, I can offer some hope. The errors you report: > aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 > aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! Are requests by the sequencer to abort a task because of a protocol error. IBM did some extensive testing with seagate drives and found that the protocol errors were genuine and the result of drive firmware problems. IBM released a version of seagate firmware (BA17) to correct these. Unfortunately, your drive identifies its firmware as S513 which is likely OEM firmware from another vendor ... however, that vendor may have an update which corrects the problem. Of course, the other issue is this: > aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! This is a bug in the driver. It's not finding the task in the outstanding list. The problem seems to be that it's taking the task from the escb which, by definition, is always NULL. It should be taking the task from the ascb it finds by looping over the pending queue. If you're willing, could you try this patch which may correct the problem? It's sort of like falling off a cliff: if you never go near the edge (i.e. you upgrade the drive fw) you never fall off; alternatively, it would be nice if you could help me put up guard rails just in case. Thanks, James --- diff --git a/drivers/scsi/aic94xx/aic94xx_scb.c b/drivers/scsi/aic94xx/aic94xx_scb.c index 0febad4..ab35050 100644 --- a/drivers/scsi/aic94xx/aic94xx_scb.c +++ b/drivers/scsi/aic94xx/aic94xx_scb.c @@ -458,13 +458,19 @@ static void escb_tasklet_complete(struct asd_ascb *ascb, tc_abort = le16_to_cpu(tc_abort); list_for_each_entry_safe(a, b, &asd_ha->seq.pend_q, list) { - struct sas_task *task = ascb->uldd_task; + struct sas_task *task = a->uldd_task; + + if (a->tc_index != tc_abort) + continue; - if (task && a->tc_index == tc_abort) { + if (task) { failed_dev = task->dev; sas_task_abort(task); - break; + } else { + ASD_DPRINTK("R_T_A for non TASK scb 0x%x\n", + a->scb->header.opcode); } + break; } if (!failed_dev) { @@ -478,7 +484,7 @@ static void escb_tasklet_complete(struct asd_ascb *ascb, * that the EH will wake up and do something. */ list_for_each_entry_safe(a, b, &asd_ha->seq.pend_q, list) { - struct sas_task *task = ascb->uldd_task; + struct sas_task *task = a->uldd_task; if (task && task->dev == failed_dev && ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-15 15:28 ` James Bottomley @ 2008-02-15 16:28 ` Keith Hopkins 2008-02-18 14:26 ` Keith Hopkins 1 sibling, 0 replies; 16+ messages in thread From: Keith Hopkins @ 2008-02-15 16:28 UTC (permalink / raw) To: James Bottomley; +Cc: linux-scsi On 02/15/2008 11:28 PM, James Bottomley wrote: > If you're willing, could you try this patch which may correct the > problem? It's sort of like falling off a cliff: if you never go near > the edge (i.e. you upgrade the drive fw) you never fall off; > alternatively, it would be nice if you could help me put up guard rails > just in case. Hi James, Thanks for your feedback & suggestions. Yes, I'll give the patch a try. It might take a few days to get onto the system. The system/drive isn't IBM, but I'll also see if I can track down a firmware update too for the protocol errors. --Keith ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-15 15:28 ` James Bottomley 2008-02-15 16:28 ` Keith Hopkins @ 2008-02-18 14:26 ` Keith Hopkins 2008-02-18 16:18 ` James Bottomley 2008-02-19 16:22 ` James Bottomley 1 sibling, 2 replies; 16+ messages in thread From: Keith Hopkins @ 2008-02-18 14:26 UTC (permalink / raw) To: James Bottomley; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On 02/15/2008 11:28 PM, James Bottomley wrote: > On Fri, 2008-02-15 at 00:11 +0800, Keith Hopkins wrote: >> On 01/31/2008 03:29 AM, Darrick J. Wong wrote: >>> On Wed, Jan 30, 2008 at 06:59:34PM +0800, Keith Hopkins wrote: >>>> V28. My controller functions well with a single drive (low-medium load). Unfortunately, all attempts to get the mirrors in sync fail and usually hang the whole box. >>> Adaptec posted a V30 sequencer on their website; does that fix the >>> problems? >>> >>> http://www.adaptec.com/en-US/speed/scsi/linux/aic94xx-seq-30-1_tar_gz.htm >>> >> I lost connectivity to the drive again, and had to reboot to recover >> the drive, so it seemed a good time to try out the V30 firmware. >> Unfortunately, it didn't work any better. Details are in the >> attachment. > > Well, I can offer some hope. The errors you report: > >> aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 >> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! > > Are requests by the sequencer to abort a task because of a protocol > error. IBM did some extensive testing with seagate drives and found > that the protocol errors were genuine and the result of drive firmware > problems. IBM released a version of seagate firmware (BA17) to correct > these. Unfortunately, your drive identifies its firmware as S513 which > is likely OEM firmware from another vendor ... however, that vendor may > have an update which corrects the problem. > > Of course, the other issue is this: > >> aic94xx: escb_tasklet_complete: Can't find task (tc=6) to abort! > > This is a bug in the driver. It's not finding the task in the > outstanding list. The problem seems to be that it's taking the task > from the escb which, by definition, is always NULL. It should be taking > the task from the ascb it finds by looping over the pending queue. > > If you're willing, could you try this patch which may correct the > problem? It's sort of like falling off a cliff: if you never go near > the edge (i.e. you upgrade the drive fw) you never fall off; > alternatively, it would be nice if you could help me put up guard rails > just in case. > Well, that made life interesting.... but didn't seem to fix anything. The behavior is about the same as before, but with more verbose errors. I failed one member of the raid and had it rebuild as a test...which hangs for a while and the drive falls off-line. Please grab the dmesg output in all its gory glory from here: http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz The drive is a Dell OEM drive, but it's not in a Dell system. There is at least one firmware (S527) upgrade for it, but the Dell loader refuses to load it (because it isn't in a Dell system...) Does anyone know a generic way to load a new firmware onto a SAS drive? --Keith ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-18 14:26 ` Keith Hopkins @ 2008-02-18 16:18 ` James Bottomley 2008-02-19 16:22 ` James Bottomley 1 sibling, 0 replies; 16+ messages in thread From: James Bottomley @ 2008-02-18 16:18 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote: > Well, that made life interesting.... > but didn't seem to fix anything. > > The behavior is about the same as before, but with more verbose > errors. I failed one member of the raid and had it rebuild as a > test...which hangs for a while and the drive falls off-line. Actually, it now finds the task and tries to do error handling for it ... so we've now uncovered bugs in the error handler. It may not look like it, but this is actually progress. Although, I'm afraid it's going to be a bit like peeling an onion: every time one error gets fixed, you just get to the next layer of errors. > Please grab the dmesg output in all its gory glory from here: > http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz > > The drive is a Dell OEM drive, but it's not in a Dell system. There > is at least one firmware (S527) upgrade for it, but the Dell loader > refuses to load it (because it isn't in a Dell system...) > Does anyone know a generic way to load a new firmware onto a SAS drive? The firmware upgrade tools are usually vendor specific, though because the format of the firmware file is vendor specific. Could you just put it in a dell box to upgrade? James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-18 14:26 ` Keith Hopkins 2008-02-18 16:18 ` James Bottomley @ 2008-02-19 16:22 ` James Bottomley 2008-02-19 18:44 ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong 2008-02-20 3:48 ` aic94xx: failing on high load (another data point) James Bottomley 1 sibling, 2 replies; 16+ messages in thread From: James Bottomley @ 2008-02-19 16:22 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote: > Well, that made life interesting.... > but didn't seem to fix anything. > > The behavior is about the same as before, but with more verbose > errors. I failed one member of the raid and had it rebuild as a > test...which hangs for a while and the drive falls off-line. > > Please grab the dmesg output in all its gory glory from here: > http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz I had a look through this. Amazingly, in spite of the message spew, up to here: > sas: Enter sas_scsi_recover_host > sas: trying to find task 0xffff81033c3d3d80 > sas: sas_scsi_find_task: aborting task 0xffff81033c3d3d80 > aic94xx: tmf timed out > aic94xx: tmf came back Everything is going normally (the REQ_TASK_ABORT are properly aborted an retried). At this point (around L3449 in the trace) the aborts start failing. Unfortunately, there's a bug in TMF timeout handling in the driver, it leaves the sequencer entry pending, but frees the ascb. If the sequencer ever picks this up it will get very confused, as it does a while down in the trace: > aic94xx: BUG:sequencer:dl:no ascb?! > aic94xx: BUG:sequencer:dl:no ascb?! That's where the sequencer adds an ascb to the done list that we've already freed. From this point on confusion reigns and the error handler eventually offlines the device. I'll see if I can come up with patches to fix this ... or at least mitigate the problems it causes. James ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) 2008-02-19 16:22 ` James Bottomley @ 2008-02-19 18:44 ` Darrick J. Wong 2008-02-19 18:52 ` James Bottomley 2008-02-28 14:56 ` Keith Hopkins 2008-02-20 3:48 ` aic94xx: failing on high load (another data point) James Bottomley 1 sibling, 2 replies; 16+ messages in thread From: Darrick J. Wong @ 2008-02-19 18:44 UTC (permalink / raw) To: James Bottomley Cc: Keith Hopkins, Jan Sembera, linux-scsi, Alexis Bruemmer, Peter Bogdanovic, Gilbert Wu If we send an ABORT_TASK ascb that doesn't return within the timeout period, we should not free that ascb because the sequencer is still holding onto it. Hopefully it will fix what James Bottomley describes below: On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote: > Unfortunately, there's a bug in TMF timeout handling in the driver, it > leaves the sequencer entry pending, but frees the ascb. If the > sequencer ever picks this up it will get very confused, as it does a > while down in the trace: > > > aic94xx: BUG:sequencer:dl:no ascb?! > > aic94xx: BUG:sequencer:dl:no ascb?! > > That's where the sequencer adds an ascb to the done list that we've > already freed. From this point on confusion reigns and the error > handler eventually offlines the device. > > I'll see if I can come up with patches to fix this ... or at least > mitigate the problems it causes. Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> --- drivers/scsi/aic94xx/aic94xx_tmf.c | 7 ++++++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c index b52124f..4b24bd3 100644 --- a/drivers/scsi/aic94xx/aic94xx_tmf.c +++ b/drivers/scsi/aic94xx/aic94xx_tmf.c @@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task) AIC94XX_SCB_TIMEOUT); spin_lock_irqsave(&task->task_state_lock, flags); if (leftover < 1) - res = TMF_RESP_FUNC_FAILED; + goto out_not_reported; if (task->task_state_flags & SAS_TASK_STATE_DONE) res = TMF_RESP_FUNC_COMPLETE; spin_unlock_irqrestore(&task->task_state_lock, flags); @@ -487,6 +487,11 @@ out: asd_ascb_free(ascb); ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res); return res; + +out_not_reported: + spin_unlock_irqrestore(&task->task_state_lock, flags); + ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task); + return res; } /** ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) 2008-02-19 18:44 ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong @ 2008-02-19 18:52 ` James Bottomley 2008-02-28 14:56 ` Keith Hopkins 1 sibling, 0 replies; 16+ messages in thread From: James Bottomley @ 2008-02-19 18:52 UTC (permalink / raw) To: Darrick J. Wong Cc: Keith Hopkins, Jan Sembera, linux-scsi, Alexis Bruemmer, Peter Bogdanovic, Gilbert Wu On Tue, 2008-02-19 at 10:44 -0800, Darrick J. Wong wrote: > If we send an ABORT_TASK ascb that doesn't return within the timeout period, > we should not free that ascb because the sequencer is still holding onto it. > Hopefully it will fix what James Bottomley describes below: > > On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote: > > > Unfortunately, there's a bug in TMF timeout handling in the driver, it > > leaves the sequencer entry pending, but frees the ascb. If the > > sequencer ever picks this up it will get very confused, as it does a > > while down in the trace: > > > > > aic94xx: BUG:sequencer:dl:no ascb?! > > > aic94xx: BUG:sequencer:dl:no ascb?! > > > > That's where the sequencer adds an ascb to the done list that we've > > already freed. From this point on confusion reigns and the error > > handler eventually offlines the device. > > > > I'll see if I can come up with patches to fix this ... or at least > > mitigate the problems it causes. > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Actually, unfortunately, this is only a tiny part of it. The message that triggered all of this is > sas: sas_scsi_find_task: aborting task 0xffff81033c3d3d80 > aic94xx: tmf timed out > aic94xx: tmf came back That's caused by a timeout at asd_enqueue_internal() further up in the code base. James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) 2008-02-19 18:44 ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong 2008-02-19 18:52 ` James Bottomley @ 2008-02-28 14:56 ` Keith Hopkins 2008-02-28 16:10 ` James Bottomley 1 sibling, 1 reply; 16+ messages in thread From: Keith Hopkins @ 2008-02-28 14:56 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-scsi On 02/20/2008 02:44 AM, Darrick J. Wong wrote: > If we send an ABORT_TASK ascb that doesn't return within the timeout period, > we should not free that ascb because the sequencer is still holding onto it. > Hopefully it will fix what James Bottomley describes below: > > On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote: > >> Unfortunately, there's a bug in TMF timeout handling in the driver, it >> leaves the sequencer entry pending, but frees the ascb. If the >> sequencer ever picks this up it will get very confused, as it does a >> while down in the trace: >> >>> aic94xx: BUG:sequencer:dl:no ascb?! >>> aic94xx: BUG:sequencer:dl:no ascb?! >> That's where the sequencer adds an ascb to the done list that we've >> already freed. From this point on confusion reigns and the error >> handler eventually offlines the device. >> >> I'll see if I can come up with patches to fix this ... or at least >> mitigate the problems it causes. > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > --- > > drivers/scsi/aic94xx/aic94xx_tmf.c | 7 ++++++- > 1 files changed, 6 insertions(+), 1 deletions(-) > > diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c > index b52124f..4b24bd3 100644 > --- a/drivers/scsi/aic94xx/aic94xx_tmf.c > +++ b/drivers/scsi/aic94xx/aic94xx_tmf.c > @@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task) > AIC94XX_SCB_TIMEOUT); > spin_lock_irqsave(&task->task_state_lock, flags); > if (leftover < 1) > - res = TMF_RESP_FUNC_FAILED; > + goto out_not_reported; > if (task->task_state_flags & SAS_TASK_STATE_DONE) > res = TMF_RESP_FUNC_COMPLETE; > spin_unlock_irqrestore(&task->task_state_lock, flags); > @@ -487,6 +487,11 @@ out: > asd_ascb_free(ascb); > ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res); > return res; > + > +out_not_reported: > + spin_unlock_irqrestore(&task->task_state_lock, flags); > + ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task); > + return res; > } > > /** > - Hi Darrick, Is this the only patch for ascb sequencer use after free problems, or are you still looking into that? --Keith ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) 2008-02-28 14:56 ` Keith Hopkins @ 2008-02-28 16:10 ` James Bottomley 0 siblings, 0 replies; 16+ messages in thread From: James Bottomley @ 2008-02-28 16:10 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, linux-scsi On Thu, 2008-02-28 at 22:56 +0800, Keith Hopkins wrote: > On 02/20/2008 02:44 AM, Darrick J. Wong wrote: > > If we send an ABORT_TASK ascb that doesn't return within the timeout period, > > we should not free that ascb because the sequencer is still holding onto it. > > Hopefully it will fix what James Bottomley describes below: > > > > On Tue, Feb 19, 2008 at 10:22:20AM -0600, James Bottomley wrote: > > > >> Unfortunately, there's a bug in TMF timeout handling in the driver, it > >> leaves the sequencer entry pending, but frees the ascb. If the > >> sequencer ever picks this up it will get very confused, as it does a > >> while down in the trace: > >> > >>> aic94xx: BUG:sequencer:dl:no ascb?! > >>> aic94xx: BUG:sequencer:dl:no ascb?! > >> That's where the sequencer adds an ascb to the done list that we've > >> already freed. From this point on confusion reigns and the error > >> handler eventually offlines the device. > >> > >> I'll see if I can come up with patches to fix this ... or at least > >> mitigate the problems it causes. > > > > Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> > > --- > > > > drivers/scsi/aic94xx/aic94xx_tmf.c | 7 ++++++- > > 1 files changed, 6 insertions(+), 1 deletions(-) > > > > diff --git a/drivers/scsi/aic94xx/aic94xx_tmf.c b/drivers/scsi/aic94xx/aic94xx_tmf.c > > index b52124f..4b24bd3 100644 > > --- a/drivers/scsi/aic94xx/aic94xx_tmf.c > > +++ b/drivers/scsi/aic94xx/aic94xx_tmf.c > > @@ -463,7 +463,7 @@ int asd_abort_task(struct sas_task *task) > > AIC94XX_SCB_TIMEOUT); > > spin_lock_irqsave(&task->task_state_lock, flags); > > if (leftover < 1) > > - res = TMF_RESP_FUNC_FAILED; > > + goto out_not_reported; > > if (task->task_state_flags & SAS_TASK_STATE_DONE) > > res = TMF_RESP_FUNC_COMPLETE; > > spin_unlock_irqrestore(&task->task_state_lock, flags); > > @@ -487,6 +487,11 @@ out: > > asd_ascb_free(ascb); > > ASD_DPRINTK("task 0x%p aborted, res: 0x%x\n", task, res); > > return res; > > + > > +out_not_reported: > > + spin_unlock_irqrestore(&task->task_state_lock, flags); > > + ASD_DPRINTK("task 0x%p aborted? but not reported.\n", task); > > + return res; > > } > > > > /** > > - > > Hi Darrick, > > Is this the only patch for ascb sequencer use after free problems, or are you still looking into that? Sorry, I forgot to cc you. Actually this one is the full one: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-rc-fixes-2.6.git;a=commit;h=e2396f1e4ecd438a15fa653a028b93e95013caa3 Unfortunately, there are another five patches in that git tree that you'll also need to see if we can get aic94xx working on your box. If you're willing, could you use 2.6.25-rc3 as the base kernel and just apply http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff On top of it? That should give you a kernel patched with all of the pending aic94xx and libsas fixes. Thanks, James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-19 16:22 ` James Bottomley 2008-02-19 18:44 ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong @ 2008-02-20 3:48 ` James Bottomley 2008-02-20 9:54 ` Keith Hopkins 1 sibling, 1 reply; 16+ messages in thread From: James Bottomley @ 2008-02-20 3:48 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote: > I'll see if I can come up with patches to fix this ... or at least > mitigate the problems it causes. Darrick's working on the ascb sequencer use after free problem. I looked into some of the error handling in libsas, and apparently that's a bit of a huge screw up too. There are a number of places where we won't complete a task that is being errored out and thus causes timeout errors. This patch is actually for libsas to fix all of this. I've managed to reproduce some of your problem by firing random resets across a disk under load, and this recovers the protocol errors for me. However, I can't reproduce the TMF timeout which caused the sequencer screw up, so you still need to wait for Darrick's fix as well. James --- diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c index f869fba..b656e29 100644 --- a/drivers/scsi/libsas/sas_scsi_host.c +++ b/drivers/scsi/libsas/sas_scsi_host.c @@ -51,8 +51,6 @@ static void sas_scsi_task_done(struct sas_task *task) { struct task_status_struct *ts = &task->task_status; struct scsi_cmnd *sc = task->uldd_task; - struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(sc->device->host); - unsigned ts_flags = task->task_state_flags; int hs = 0, stat = 0; if (unlikely(!sc)) { @@ -120,11 +118,7 @@ static void sas_scsi_task_done(struct sas_task *task) sc->result = (hs << 16) | stat; list_del_init(&task->list); sas_free_task(task); - /* This is very ugly but this is how SCSI Core works. */ - if (ts_flags & SAS_TASK_STATE_ABORTED) - scsi_eh_finish_cmd(sc, &sas_ha->eh_done_q); - else - sc->scsi_done(sc); + sc->scsi_done(sc); } static enum task_attribute sas_scsi_get_task_attr(struct scsi_cmnd *cmd) @@ -255,13 +249,27 @@ out: return res; } +static void sas_eh_finish_cmd(struct scsi_cmnd *cmd) +{ + struct sas_task *task = TO_SAS_TASK(cmd); + struct sas_ha_struct *sas_ha = SHOST_TO_SAS_HA(cmd->device->host); + + /* First off call task_done. However, task will + * be free'd after this */ + task->task_done(task); + /* now finish the command and move it on to the error + * handler done list, this also takes it off the + * error handler pending list */ + scsi_eh_finish_cmd(cmd, &sas_ha->eh_done_q); +} + static void sas_scsi_clear_queue_lu(struct list_head *error_q, struct scsi_cmnd *my_cmd) { struct scsi_cmnd *cmd, *n; list_for_each_entry_safe(cmd, n, error_q, eh_entry) { if (cmd == my_cmd) - list_del_init(&cmd->eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -274,7 +282,7 @@ static void sas_scsi_clear_queue_I_T(struct list_head *error_q, struct domain_device *x = cmd_to_domain_dev(cmd); if (x == dev) - list_del_init(&cmd->eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -288,7 +296,7 @@ static void sas_scsi_clear_queue_port(struct list_head *error_q, struct asd_sas_port *x = dev->port; if (x == port) - list_del_init(&cmd->eh_entry); + sas_eh_finish_cmd(cmd); } } @@ -528,14 +536,14 @@ Again: case TASK_IS_DONE: SAS_DPRINTK("%s: task 0x%p is done\n", __FUNCTION__, task); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); continue; case TASK_IS_ABORTED: SAS_DPRINTK("%s: task 0x%p is aborted\n", __FUNCTION__, task); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); continue; @@ -547,7 +555,7 @@ Again: "recovered\n", SAS_ADDR(task->dev), cmd->device->lun); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); sas_scsi_clear_queue_lu(work_q, cmd); @@ -562,7 +570,7 @@ Again: if (tmf_resp == TMF_RESP_FUNC_COMPLETE) { SAS_DPRINTK("I_T %016llx recovered\n", SAS_ADDR(task->dev->sas_addr)); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); sas_scsi_clear_queue_I_T(work_q, task->dev); @@ -577,7 +585,7 @@ Again: if (res == TMF_RESP_FUNC_COMPLETE) { SAS_DPRINTK("clear nexus port:%d " "succeeded\n", port->id); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); sas_scsi_clear_queue_port(work_q, @@ -591,10 +599,10 @@ Again: if (res == TMF_RESP_FUNC_COMPLETE) { SAS_DPRINTK("clear nexus ha " "succeeded\n"); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); - goto out; + goto clear_q; } } /* If we are here -- this means that no amount @@ -606,21 +614,18 @@ Again: SAS_ADDR(task->dev->sas_addr), cmd->device->lun); - task->task_done(task); + sas_eh_finish_cmd(cmd); if (need_reset) try_to_reset_cmd_device(shost, cmd); goto clear_q; } } -out: return list_empty(work_q); clear_q: SAS_DPRINTK("--- Exit %s -- clear_q\n", __FUNCTION__); - list_for_each_entry_safe(cmd, n, work_q, eh_entry) { - struct sas_task *task = TO_SAS_TASK(cmd); - list_del_init(&cmd->eh_entry); - task->task_done(task); - } + list_for_each_entry_safe(cmd, n, work_q, eh_entry) + sas_eh_finish_cmd(cmd); + return list_empty(work_q); } ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-20 3:48 ` aic94xx: failing on high load (another data point) James Bottomley @ 2008-02-20 9:54 ` Keith Hopkins 2008-02-20 16:22 ` James Bottomley 0 siblings, 1 reply; 16+ messages in thread From: Keith Hopkins @ 2008-02-20 9:54 UTC (permalink / raw) To: James Bottomley, Darrick J. Wong; +Cc: Jan Sembera, linux-scsi On 02/20/2008 11:48 AM, James Bottomley wrote: > On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote: >> I'll see if I can come up with patches to fix this ... or at least >> mitigate the problems it causes. > > Darrick's working on the ascb sequencer use after free problem. > > I looked into some of the error handling in libsas, and apparently > that's a bit of a huge screw up too. There are a number of places where > we won't complete a task that is being errored out and thus causes > timeout errors. This patch is actually for libsas to fix all of this. > > I've managed to reproduce some of your problem by firing random resets > across a disk under load, and this recovers the protocol errors for me. > However, I can't reproduce the TMF timeout which caused the sequencer > screw up, so you still need to wait for Darrick's fix as well. > > James > Hi James, Darrick, Thanks again for looking more into this. I'll wait for Darrick's patch and try it together with this libsas patch. Should I leave James' first patch in also? I'm still looking for a Dell machine to use, and will upgrade the drives' firmware the first chance I get. Thanks again, --Keith ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) 2008-02-20 9:54 ` Keith Hopkins @ 2008-02-20 16:22 ` James Bottomley 0 siblings, 0 replies; 16+ messages in thread From: James Bottomley @ 2008-02-20 16:22 UTC (permalink / raw) To: Keith Hopkins; +Cc: Darrick J. Wong, Jan Sembera, linux-scsi On Wed, 2008-02-20 at 17:54 +0800, Keith Hopkins wrote: > On 02/20/2008 11:48 AM, James Bottomley wrote: > > On Tue, 2008-02-19 at 10:22 -0600, James Bottomley wrote: > >> I'll see if I can come up with patches to fix this ... or at least > >> mitigate the problems it causes. > > > > Darrick's working on the ascb sequencer use after free problem. > > > > I looked into some of the error handling in libsas, and apparently > > that's a bit of a huge screw up too. There are a number of places where > > we won't complete a task that is being errored out and thus causes > > timeout errors. This patch is actually for libsas to fix all of this. > > > > I've managed to reproduce some of your problem by firing random resets > > across a disk under load, and this recovers the protocol errors for me. > > However, I can't reproduce the TMF timeout which caused the sequencer > > screw up, so you still need to wait for Darrick's fix as well. > > > > James > > > > Hi James, Darrick, > > Thanks again for looking more into this. I'll wait for Darrick's > patch and try it together with this libsas patch. Should I leave > James' first patch in also? Yes, that's a requirement just to get the REQ_TASK_ABORT for the protocol errors actually to work ... I'm afraid this is like peeling an onion as I said .. and you're going to build up layers of patches. However, the ones that are obvious bug fixes and I can test (all of them so far), I'm putting in the rc fixes tree of SCSI, so you can download a rollup here: http://www.kernel.org/pub/linux/kernel/people/jejb/scsi-rc-fixes-2.6.diff James ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: aic94xx: failing on high load (another data point) @ 2008-01-30 10:55 Keith Hopkins 0 siblings, 0 replies; 16+ messages in thread From: Keith Hopkins @ 2008-01-30 10:55 UTC (permalink / raw) To: linux-scsi > We've tried new adaptec firmware shipped with SLES and we got > ourselves new error string that appears just above error messages that you > have seen before and that were attached to the original message: > kernel: aic94xx: escb_tasklet_complete: REQ_TASK_ABORT, reason=0x6 > kernel: aic94xx: escb_tasklet_complete: Can't find task (tc=71) to abort! > > Do you think they have any significance? Hi Jan, Which firmware version is that? I get similar errors under a high load (rebuilding sw raid1 partitions) with sequencer Firmware version 1.1 (V28), which will eventually hang my box. Prev fw versions would also hang in similar situations. My box being OpenSuSE 10.3, a 2.6.22.13-0.3-default kernel, 2x quad core Xeon CPU E5345 @ 2.33GHz stepping 0b, 14G of DDR2-667 memory, and a Adaptec 48300 (AIC-9410W SAS/SATA Host Adapter, device 0000:05:02.0) directly connected to two SEAGATE ST3146855SS My 2 bits. --Keith Hopkins ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-02-28 16:10 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <479FB3ED.3080401@hopnet.net>
[not found] ` <20080130091403.GA14887@alaris.suse.cz>
2008-01-30 10:59 ` aic94xx: failing on high load (another data point) Keith Hopkins
2008-01-30 19:29 ` Darrick J. Wong
2008-02-14 16:11 ` Keith Hopkins
2008-02-15 15:28 ` James Bottomley
2008-02-15 16:28 ` Keith Hopkins
2008-02-18 14:26 ` Keith Hopkins
2008-02-18 16:18 ` James Bottomley
2008-02-19 16:22 ` James Bottomley
2008-02-19 18:44 ` [PATCH] aic94xx: Don't free ABORT_TASK SCBs that are timed out (Was: Re: aic94xx: failing on high load) Darrick J. Wong
2008-02-19 18:52 ` James Bottomley
2008-02-28 14:56 ` Keith Hopkins
2008-02-28 16:10 ` James Bottomley
2008-02-20 3:48 ` aic94xx: failing on high load (another data point) James Bottomley
2008-02-20 9:54 ` Keith Hopkins
2008-02-20 16:22 ` James Bottomley
2008-01-30 10:55 Keith Hopkins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).