From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tejun Heo Subject: Re: JMicron - hard resetting link Date: Tue, 12 Feb 2008 23:52:09 +0900 Message-ID: <47B1B299.3010208@gmail.com> References: <009401c86d5c$5eb57bf0$4d0fa8c0@M2007> <47B19997.1010404@gmail.com> <003801c86d84$fdae0510$4d0fa8c0@M2007> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-2 Content-Transfer-Encoding: 7bit Return-path: Received: from el-out-1112.google.com ([209.85.162.176]:42481 "EHLO el-out-1112.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754615AbYBLOwR (ORCPT ); Tue, 12 Feb 2008 09:52:17 -0500 Received: by el-out-1112.google.com with SMTP id v27so1985247ele.23 for ; Tue, 12 Feb 2008 06:52:16 -0800 (PST) In-Reply-To: <003801c86d84$fdae0510$4d0fa8c0@M2007> Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Gabor FUNK Cc: IDE/ATA development list Gabor FUNK wrote: >> It shouldn't kill the RAID. Hmmm... The log is truncated. Can you >> please post full kernel log spanning from boot to array death? > > RAID "dies" because controller dies, then it loses 4 disks out of 8... > Actually, the server last time was up and running for 2 months. > Then when it failed the 1st time, I did some tests and it went on for > 3 days, including building the raid and heavy test file copy. > The full log from the 1st relevant error message till the death of > the array is here: > http://www.huweb.hu/maques/tmp/jmicron/syslog What I said was that timeouts occurring due to transmission errors should be recoverable. It seems like IRQ delivery didn't work probably due to screaming IRQ. I need to see the messages before the first relevant error message. It's always a good idea to post full kernel log from boot till failure. Things which don't seem relevant are often relevant. >> Move half of the drives to the new PSU and see whether the problem goes >> away. > > This is a new server, with a Chieftec GPS650AB, 650W PSU in it. > Though AFAIK a harddisk consumes around 10W, and I will try to use > more than one PSU-s. I've recently tracked down IO problems a server product line from a major (really, one of the top three) vendor to malfunctioning PSU, so don't trust the labeling too much. > The main problem is that I can't immediately see if it helps or not. > Even if it will work without this problem for a week, I can't be sure it > still will in 2 months... > Because of this - and because I believe that this problem related to the HW > (motherboard, chipset) - I'd rather just throw away the MB and use an > other one with two extra 4 port SATA cards. Till now, none of this kind of problem has been tracked down to MB or the controller while 90% of hardware problems turned out to be power related. Thanks. -- tejun