From mboxrd@z Thu Jan 1 00:00:00 1970 From: Luben Tuikov Subject: Re: aic79xx blowups in 2.6.8-1.521smp (RHAT) Date: Tue, 05 Oct 2004 17:51:24 -0400 Sender: linux-scsi-owner@vger.kernel.org Message-ID: <4163175C.8010001@adaptec.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from magic.adaptec.com ([216.52.22.17]:33505 "EHLO magic.adaptec.com") by vger.kernel.org with ESMTP id S266127AbUJEVvd (ORCPT ); Tue, 5 Oct 2004 17:51:33 -0400 In-Reply-To: List-Id: linux-scsi@vger.kernel.org To: Bryce as root Cc: linux-scsi@vger.kernel.org Bryce as root wrote: > where to start,.. > > I have a system which has an onboard aic79xx controller > > lspci -vv : > 03:0a.0 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03) > Subsystem: Adaptec: Unknown device ffff > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > ParErr- Stepping- SERR- FastB2B- > Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- > SERR- Latency: 32 (10000ns min, 6250ns max), Cache Line Size 08 > Interrupt: pin A routed to IRQ 217 > Region 0: I/O ports at 7c00 [disabled] > Region 1: Memory at f3008000 (64-bit, non-prefetchable) [size=8K] > Region 3: I/O ports at 8000 [disabled] [size=256] > Capabilities: [dc] Power Management version 1 > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > PME(D0-,D1-,D2-,D3hot-,D3cold-) > Status: D0 PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [a0] Message Signalled Interrupts: 64bit+ > Queue=0/1 Enable- > Address: 0000000000000000 Data: 0000 > Capabilities: [94] PCI-X non-bridge device. > Command: DPERE- ERO+ RBC=0 OST=4 > Status: Bus=255 Dev=31 Func=0 64bit+ 133MHz+ SCD- USC-, > DC=simple, DMMRBC=0, DMOST=4, DMCRS=1, RSCEM- > > This is connected by certified U320 cable to a 15K U320 Seagate 37Gb drive > > SCSI subsystem initialized > ACPI: PCI interrupt 0000:03:0a.0[A] -> GSI 22 (level, low) -> IRQ 217 > ACPI: PCI interrupt 0000:03:0a.1[B] -> GSI 19 (level, low) -> IRQ 177 > scsi0 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11 > > aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI 33 or 66Mhz, > 512 SCBs > > (scsi0:A:0): 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) > Vendor: SEAGATE Model: ST336753LW Rev: 0006 > Type: Direct-Access ANSI SCSI revision: 03 > scsi0:A:0:0: Tagged Queuing enabled. Depth 4 > SCSI device sda: 71687372 512-byte hdwr sectors (36704 MB) > SCSI device sda: drive cache: write back > sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 > > Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 > scsi1 : Adaptec AIC79XX PCI-X SCSI HBA DRIVER, Rev 1.3.11 > > aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI 33 or 66Mhz, > 512 SCBs > > Ok, to the problem itself. I've found that the system will choke badly > after > anything from a few days to a couple of weeks, always over a LQI Phase > error. > I've asked around to help work out what the LQI Phase error is about and > I still don't understand it. There was an "illegal" phase change while receiving an LQI packet, since you're at 320, and have parallel information units on. The card dump state isn't complete as some registers were cleared when the card was paused and gotten out of critial region, but it looks like there was a READ(10) and the chip was just about to transfer some data to host memory. If the failure can be reproduced consistently, a SCSI Bus trace would narrow down where the problem is. Luben > kernel dmesg dump : > Reseting Channel for LQI Phase error > >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< > scsi0: Dumping Card State at program address 0x8 Mode 0x33 > Card was paused > HS_MAILBOX[0x0] INTCTL[0x80] SEQINTSTAT[0x0] SAVED_MODE[0x11] > DFFSTAT[0x11] SCSISIGI[0x0] SCSIPHASE[0x0] SCSIBUS[0x0] > LASTPHASE[0x1] SCSISEQ0[0x0] SCSISEQ1[0x12] SEQCTL0[0x10] > SEQINTCTL[0x0] SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0] > SSTAT1[0x9] SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] > SIMODE1[0xa4] LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] > LQOSTAT0[0x0] LQOSTAT1[0x0] LQOSTAT2[0x1] > > SCB Count = 4 CMDS_PENDING = 2 LASTSCB 0x1 CURRSCB 0x0 NEXTSCB 0xff00 > qinstart = 40786 qinfifonext = 40786 > QINFIFO: > WAITING_TID_QUEUES: > Pending list: > 0 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x7] > 1 FIFO_USE[0x0] SCB_CONTROL[0x60] SCB_SCSIID[0x7] > Total 2 > Kernel Free SCB list: 2 3 > Sequencer Complete DMA-inprog list: > Sequencer Complete list: > Sequencer DMA-Up and Complete list: > > scsi0: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x1 > SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89] > SG_CACHE_SHADOW[0x2] SG_STATE[0x0] DFFSXFRCTL[0x0] > SOFFCNT[0x66] MDFFSTAT[0x5] SHADDR = 0x00, SHCNT = 0x0 > HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x88] > scsi0: FIFO1 Active, LONGJMP == 0x257, SCB 0x1 > SEQIMODE[0x3f] SEQINTSRC[0x0] DFCNTRL[0x8] DFSTATUS[0x0] > SG_CACHE_SHADOW[0x30] SG_STATE[0x6] DFFSXFRCTL[0x0] > SOFFCNT[0x66] MDFFSTAT[0xa] SHADDR = 0x06c420c04, SHCNT = 0x3fc > HADDR = 0x06c420ce0, HCNT = 0x320 CCSGCTL[0x98] > LQIN: 0x4 0x0 0x0 0x1 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x32 0x0 > 0x0 0x0 0 > x2 0x0 > scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x42 > scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x2 > SIMODE0[0xc] > CCSCBCTL[0x4] > scsi0: REG0 == 0x3, SINDEX = 0x133, DINDEX = 0x102 > scsi0: SCBPTR == 0x0, SCB_NEXT == 0xff00, SCB_NEXT2 == 0xff11 > CDB 28 0 1 93 e9 d6 > STACK: 0x125 0x0 0x0 0x257 0x240 0x93 0x29 0x1 > <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>> > DevQ(0:0:0): 0 waiting > SCSI error : <0 0 0 0> return code = 0x10000 > end_request: I/O error, dev sda, sector 26470870 > SCSI error : <0 0 0 0> return code = 0x10000 > end_request: I/O error, dev sda, sector 26470878 > SCSI error : <0 0 0 0> return code = 0x10000 > end_request: I/O error, dev sda, sector 26470886 > SCSI error : <0 0 0 0> return code = 0x10000 > end_request: I/O error, dev sda, sector 26470894 > SCSI error : <0 0 0 0> return code = 0x8000002 > Info fld=0x0, Current sda: sense key Aborted Command > end_request: I/O error, dev sda, sector 26470766 > SCSI error : <0 0 0 0> return code = 0x8000002 > Info fld=0x0, Current sda: sense key Aborted Command > end_request: I/O error, dev sda, sector 26470774 > SCSI error : <0 0 0 0> return code = 0x8000002 > Info fld=0x0, Current sda: sense key Aborted Command > end_request: I/O error, dev sda, sector 51448252 > Buffer I/O error on device sda7, logical block 11048 > lost page write due to I/O error on sda7 > SCSI error : <0 0 0 0> return code = 0x8000002 > Info fld=0x0, Current sda: sense key Aborted Command > end_request: I/O error, dev sda, sector 26470782 > SCSI error : <0 0 0 0> return code = 0x8000002 > Info fld=0x0, Current sda: sense key Aborted Command > end_request: I/O error, dev sda, sector 68501204 > Buffer I/O error on device sda7, logical block 2142667 > lost page write due to I/O error on sda7 > (scsi0:A:0): 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit) > > > When the system is up and running, the disk is stable and I > get a quite respectable 72MB/s max throughput. > > I have other card dumps however since this isn't my area of expertise I've > not included them as they're roughly the same problem always starting off > with a "Reseting Channel for LQI Phase error" error. > > I should mention that I have a sister machine with the exact same HW > that exhibits the issues and though not a 100% indication, it > would suggest the issue is more to do with the driver. > > I did also try to rebuild the kernel with aic79xx-linux-2.6-20040522-tar.gz > however I've not run it long enough on the box to say that it fixes the > issue > > Ideas anyone? or shall I just keep banging on 2.0.12 and hope it > fixes/masks > the problem? > > Phil > =--= > - > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >