From mboxrd@z Thu Jan 1 00:00:00 1970 From: Gavin Flower Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, system unresponsive Date: Sun, 10 Apr 2011 23:50:07 -0700 (PDT) Message-ID: <165228.90505.qm@web65113.mail.ac2.yahoo.com> References: <20110408215000.15c881bb@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20110408215000.15c881bb@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids --- On Fri, 8/4/11, NeilBrown wrote: > From: NeilBrown > Subject: Re: RAID6 data-check took almost 2 hours, clicking sounds, s= ystem unresponsive > To: "Gavin Flower" > Cc: linux-raid@vger.kernel.org > Date: Friday, 8 April, 2011, 23:50 > On Fri, 8 Apr 2011 02:59:52 -0700 > (PDT) Gavin Flower > wrote: >=20 > >=20 > > --- On Fri, 8/4/11, NeilBrown > wrote: > >=20 > > > From: NeilBrown > > > Subject: Re: RAID6 data-check took almost 2 > hours, clicking sounds, system unresponsive [...] > > > Obviously there is some sort of hardware issue - > possible a > > > drive, possibly a > > > bus problem - I really don't know. > > >=20 > > > Apart from that things look normal. > > >=20 > > > What exactly did you want explained? > > >=20 > > > NeilBrown > >=20 > > I guess I was surprised that the RAID system appeared > normal and that it did not register any errors.=A0 I was > hoping to get an idea as to which drive was problematic. >=20 > sdc2 was reporting read error.=A0 md/raid6 computed the > data from the other > devices and wrote it back to sdc2.=A0 This appeared to > work so md/raid6 assumed > everything was fine again.=A0 It reported this: >=20 > Apr=A0 7 08:42:08 saturn kernel: [210414.109880] > md/raid:md1: read error corrected (8 sectors at 17195840 on > sdc2)=20 >=20 > but didn't fail anything. >=20 >=20 > >=20 > > I get the feeling, from your reply, that this is not > specifically a RAID problem, that it just happens to affect > a RAID array. >=20 > No, it was clearly a disk-drive problem. > e.g. > Apr=A0 7 14:42:12 saturn kernel: [231957.756023] > ata3.00: failed command: READ FPDMA QUEUED >=20 > a READ command sent to a n 'ata' device failed.=A0 i.e. > disk error. >=20 > >=20 > > I had thought that the RAID system should have been > able to give me better diagnostics, but possibly I am being > (inadvertently) unreasonable! >=20 > Well.... it did tell you that it got a read error and > corrected it. >=20 >=20 > >=20 > > Not sure what the significance of this mismatch is, > and what I should do about it. > > # cat /sys/block/md2/md/mismatch_cnt=20 > > 28904=20 > > #=20 >=20 > I'm not sure if read errors end up counting as > mismatches..=A0 They seem to for > raid1.=A0 The raid6 code is more complex and I don't > feel like decoding it > right now. >=20 > In terms of "what to do about it" - the first thing must be > to fix sdc. > Maybe there is a loose cable or a broken cable.=A0 Maybe > the device needs to be > replaced. >=20 > Once you have resolved that and are fairly sure yours > drives are all working, > =A0 =A0 echo check > > /sys/block/md2/md/sync_action >=20 > once that finishes mismatch_cnt should ideally be > zero.=A0 If it isn't, try > =A0 =A0 echo repair > > /sys/block/md2/md/sync_action >=20 > but only do that if you are confident that your devices are > good. > This will result in the same mismatch_cnt.=A0 However a > subsequent 'check' > should then show zero. >=20 > NeilBrown Thanks, I followed your suggestions and all 'appears' to be fine now. Reality was a wee bit more dramatic than I would have liked! Machine refused to boot this morning, complaining about disk errors. F= ortunately, I had arranged for a hardware capable friend to come around= =2E He adjusted the cable on the offending drive and I ran fsck twice (= lots of alarming messages first time). On rebooting, the system came up= , but a video driver problem prevented the desktop from working. Fortu= nately I was able to log in from another machine and apply your suggest= ed remedy. After the repair, I rebooted and was able to get into my de= sktop, subsequent checks revealed the mismatch counts to be all zero (I= checked the failed RAID array and the other 2) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html