From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Gabor FUNK" Subject: Re: JMicron - hard resetting link Date: Fri, 15 Feb 2008 00:02:32 +0100 Message-ID: <005c01c86f5d$b050b4e0$4d0fa8c0@M2007> References: <009401c86d5c$5eb57bf0$4d0fa8c0@M2007> <47B19997.1010404@gmail.com> <003801c86d84$fdae0510$4d0fa8c0@M2007> <47B1B299.3010208@gmail.com> <002f01c86d9c$94542f50$4d0fa8c0@M2007> <47B230CA.9060506@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-2"; reply-type=original Content-Transfer-Encoding: 7bit Return-path: Received: from ns1.huweb.hu ([62.112.193.37]:55603 "EHLO ns1.huweb.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754813AbYBNXCp (ORCPT ); Thu, 14 Feb 2008 18:02:45 -0500 Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Tejun Heo Cc: IDE/ATA development list To be honest, I didn't believe that doing anything with the PSU would do something. However, seemingly it did. I have also updated the BIOS, but I guess this has not much to do with it. So a different brand PSU was additionally installed, and this one got the motherboard and the 4 disk which were failing. The "old" PSU got the second 4 hdds and the 2 other system HDDs. Test was started yesterday (Feb 13) about 16:30 CET including array building up and file copies. About today (14) 20:22 the problem appeared, but seemingly "moved" with the PSU to the other 4 disks bunch (on nvidia controller) - more precisely, only 2 of them (array is still operational). Feb 14 20:22:32 storage1 kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Feb 14 20:22:32 storage1 kernel: ata10.00: cmd c8/00:00:c3:d5:3b/00:00:00:00:00/e2 tag 0 dma 131072 in Feb 14 20:22:32 storage1 kernel: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Feb 14 20:22:32 storage1 kernel: ata10.00: status: { DRDY } Feb 14 20:22:32 storage1 kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen Feb 14 20:22:32 storage1 kernel: ata9.00: cmd c8/00:00:c3:d5:3b/00:00:00:00:00/e2 tag 0 dma 131072 in Feb 14 20:22:32 storage1 kernel: res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Feb 14 20:22:32 storage1 kernel: ata9.00: status: { DRDY } Feb 14 20:22:33 storage1 kernel: ata10: soft resetting link Feb 14 20:22:33 storage1 kernel: ata9: soft resetting link Feb 14 20:22:33 storage1 kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 14 20:22:33 storage1 kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 14 20:23:03 storage1 kernel: ata9.00: qc timeout (cmd 0x27) Feb 14 20:23:03 storage1 kernel: ata9.00: failed to read native max address (err_mask=0x4) Feb 14 20:23:03 storage1 kernel: ata9.00: HPA support seems broken, will skip HPA handling Feb 14 20:23:03 storage1 kernel: ata9.00: revalidation failed (errno=-5) Feb 14 20:23:03 storage1 kernel: ata9: failed to recover some devices, retrying in 5 secs Feb 14 20:23:03 storage1 kernel: ata10.00: qc timeout (cmd 0x27) Feb 14 20:23:03 storage1 kernel: ata10.00: failed to read native max address (err_mask=0x4) Feb 14 20:23:03 storage1 kernel: ata10.00: HPA support seems broken, will skip HPA handling Feb 14 20:23:03 storage1 kernel: ata10.00: revalidation failed (errno=-5) Feb 14 20:23:03 storage1 kernel: ata10: failed to recover some devices, retrying in 5 secs Feb 14 20:23:08 storage1 kernel: ata9: hard resetting link Feb 14 20:23:08 storage1 kernel: ata10: hard resetting link ... Full kern.log is at: http://www.huweb.hu/maques/tmp/jmicron/kern0214.log So it seems that there is definitely something with the "old" PSU. Also, I tried to mount the failed drives, without success. Thought I let you know. Now I will try with the only one, "new" PSU to see what happens... G. ----- Original Message ----- From: "Tejun Heo" To: "Gabor FUNK" Cc: "IDE/ATA development list" Sent: Wednesday, February 13, 2008 12:50 AM Subject: Re: JMicron - hard resetting link > Hello, > > Gabor FUNK wrote: >>> What I said was that timeouts occurring due to transmission errors >>> should be recoverable. It seems like IRQ delivery didn't work probably >>> due to screaming IRQ. I need to see the messages before the first >>> relevant error message. It's always a good idea to post full kernel log >>> from boot till failure. Things which don't seem relevant are often >>> relevant. >> Naturally. Full kern.log with boot: >> http://www.huweb.hu/maques/tmp/jmicron/kern.log >> (no edits, there are really only those 2 lines between Feb 6 and Feb 9's >> 1st exception) > > Hmmm... Indeed. This is the first time this mode of failure is reported. > >> Previously there was kernel 2.6.23.9 and I noticed the following in >> syslog by then: >> Feb 6 19:10:19 storage1 kernel: ata4: D2H reg with I during NCQ, this >> message won't be printed again >> Feb 6 19:10:20 storage1 kernel: ata1: D2H reg with I during NCQ, this >> message won't be printed again >> Feb 6 19:10:20 storage1 kernel: ata2: D2H reg with I during NCQ, this >> message won't be printed again >> Feb 6 19:10:21 storage1 kernel: ata3: D2H reg with I during NCQ, this >> message won't be printed again >> >> I googled and saw that there was some fixes related to this (maybe it >> was you), so that's why we hoped that 2.6.24 will fix this. Actually the >> above error messages were gone, but... > > Yeap, those are gone. > >>> Till now, none of this kind of problem has been tracked down to MB or >>> the controller while 90% of hardware problems turned out to be power >>> related. >> I'll put a brand new, probably different PSU in the case and put the MB >> and the 4 disks of the problematic controller on it, and put the 2 system >> and other 4 disks to this one (or even another one). > > Yeap, please keep me posted. > >> Meanwhile I'd welcome if you have any suggestion why controller reset >> causing a "fatal error"... >> BTW, the drives were accessible after the array broke (when I got there). > > What do you mean by 'drives were accessible'? /dev/sdX nodes were > accessible? > > -- > tejun > - > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >