From mboxrd@z Thu Jan 1 00:00:00 1970 From: pg_mh@mh.to.sabi.co.UK (Peter Grandi) Subject: RE: RAID halting Date: Sun, 5 Apr 2009 23:20:43 +0100 Message-ID: <18905.11963.968493.29417@tree.ty.sabi.co.uk> References: <20090405203331.FWRT1944.cdptpa-omta02.mail.rr.com@Leslie> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20090405203331.FWRT1944.cdptpa-omta02.mail.rr.com@Leslie> Sender: linux-raid-owner@vger.kernel.org To: Linux RAID List-Id: linux-raid.ids > [ ... ] The evidence so far does not strongly suggest a > hardware issue, at least not a drive issue, [ ... ] > [ ... ] the drive system previously reported tons of sector > remaps when the drives were in a different, clearly broken, > enclosure, and they continue to do so on the 320G drive with > known issues. >> * Did you look into firmware? Are the drives and/or firmware >> revisions qualified by your controller vendor? > Yes. I did that before purchasing the controller. No, I did not > look into the drives. The controller vendor does not qualify > drives. Controllers don't get any more generic than the one I > purchased (I don't recall the brand at this time - it's based on > the Silicon Image SiI3124 controller chip). Uhhh, I'd invest in something else. Just in case. The SiL chips are a bit low end, and most SiL based cards I have seeen were of the cheap and cheerful variety, and those sometimes have fairly marginal electrical/noise designs. > More importantly, the fact the system ran for months without the > problem, and the problem only occurred after changing the array > chassis and the file system strongly suggests this is not the > root of the issue. Not necessarily: a different file system may trigger different bugs in the host adapter fw and in the drive fw by doing operations in a different sequence with different timing. > [ ... ] "HOW DO I RUN A FULL BLOCK-LEVEL HARDWARE TEST?" I agree that it seems unlikely that it is a physically defective disk. More likely bad cabling, bad backplane, bad fw, bad electrical/noise design. Anyhow it is practically impossible on modern drives to run a full black level hardware test on disk drives, which are more like block servers, with several layers of interpolation between the command level and the hardware. However to run a *logical* block test, 'badblocks' from the 'e2fsprogs' package is the common choice. But I'd leave running the CERN "silent corruption" daemon and other checks/diagnostics and look carefully at the system logs for host adapter errors. For most people doing significant storage systems and self-built systems of a certain size keeping current with the HEPiX workshops seems to me a good idea.