From mboxrd@z Thu Jan 1 00:00:00 1970 From: Neil Brown Subject: Re: read errors corrected Date: Fri, 31 Dec 2010 10:12:43 +1100 Message-ID: <20101231101243.666e0f9e@notabene.brown> References: <20101230201501.2f39a85f@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: Sender: linux-raid-owner@vger.kernel.org To: James Cc: linux-raid@vger.kernel.org List-Id: linux-raid.ids On Thu, 30 Dec 2010 11:35:59 -0500 James wrote: > Sorry Neil, I meant to reply-all. >=20 > -james >=20 > On Thu, Dec 30, 2010 at 11:35, James wrote: > > Inline. > > > > On Thu, Dec 30, 2010 at 04:15, Neil Brown wrote: > >> On Thu, 30 Dec 2010 03:20:48 +0000 James wrote: > >> > >>> All, > >>> > >>> I'm looking for a bit of guidance here. I have a RAID 6 set up on= my > >>> system and am seeing some errors in my logs as follows: > >>> > >>> # cat messages | grep "read erro" > >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (= 8 > >>> sectors at 974262528 on sda4) > >>> Dec 15 15:40:34 nuova kernel: md/raid:md4: read error corrected (= 8 > >>> sectors at 974262536 on sda4) > >> ..... > >> > >>> > >>> I've Google'd the heck out of this error message but am not seein= g a > >>> clear and concise message: is this benign? What would cause these > >>> errors? Should I be concerned? > >>> > >>> There is an error message (read error corrected) on each of the d= rives > >>> in the array. They all seem to be functioning properly. The I/O o= n the > >>> drives is pretty heavy for some parts of the day. > >>> > >>> # cat /proc/mdstat > >>> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] > >>> [raid4] [multipath] > >>> md1 : active raid6 sdb1[1] sda1[0] sdd1[3] sdc1[2] > >>> =A0 =A0 =A0 497792 blocks level 6, 64k chunk, algorithm 2 [4/4] [= UUUU] > >>> > >>> md2 : active raid6 sdb2[1] sda2[0] sdd2[3] sdc2[2] > >>> =A0 =A0 =A0 4000000 blocks level 6, 64k chunk, algorithm 2 [4/4] = [UUUU] > >>> > >>> md3 : active raid6 sdb3[1] sda3[0] sdd3[3] sdc3[2] > >>> =A0 =A0 =A0 25992960 blocks level 6, 64k chunk, algorithm 2 [4/4]= [UUUU] > >>> > >>> md4 : active raid6 sdb4[1] sda4[0] sdd4[3] sdc4[2] > >>> =A0 =A0 =A0 2899780480 blocks level 6, 64k chunk, algorithm 2 [4/= 4] [UUUU] > >>> > >>> unused devices: > >>> > >>> I have a really hard time believing there's something wrong with = all > >>> of the drives in the array, although admittedly they're the same = model > >>> from the same manufacturer. > >>> > >>> Can someone point me in the right direction? > >>> (a) what causes these errors precisely? > >> > >> When md/raid6 tries to read from a device and gets a read error, i= t try to > >> read from other other devices. =A0When that succeeds it computes t= he data that > >> it had tried to read and then write it back to the original drive.= =A0If this > >> succeeded is assumes that the read error has been correct by a wri= te, and > >> prints the message that you see. > >> > >> > >>> (b) is the error benign? How can I determine if it is *likely* a > >>> hardware problem? (I imagine it's probably impossible to tell if = it's > >>> HW until it's too late) > >> > >> A few occasional messages like this are fairly benign. =A0The coul= d be a sign > >> that the drive surface is degrading. =A0If you see lots of these m= essages, then > >> you should seriously consider replacing the drive. > > > > Wow, this is hard for me to believe considering this is happening o= n > > all the drives. It's not impossible, however, since the drives are > > likely from the same batch. > > > >> As you are seeing these message across all devices, it is possible= that the > >> problem is with the sata controller rather than the disks. =A0Do k= now which you > >> should check the errors that are reported in dmesg. =A0If you don'= t understand > >> these message, then post them to the list - feel free to post seve= ral hundred > >> lines of logs - too much is much much better than not enough. > > > > I posted a few errors in my response to the thread a bit ago -- her= e's > > another snippet: > > > > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Unhandled error cod= e > > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] Result: hostbyte=3D= 0x00 > > driverbyte=3D0x06 > > Dec 29 01:55:03 nuova kernel: sd 1:0:0:0: [sdc] CDB: cdb[0]=3D0x28:= 28 > > 00 25 a2 a0 6a 00 00 80 00 > > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdc, sect= or 631414890 > > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Unhandled error cod= e > > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] Result: hostbyte=3D= 0x00 > > driverbyte=3D0x06 "Unhandled error code" sounds like it could be a driver problem... Try googling that error message... http://us.generation-nt.com/answer/2-6-33-libata-issues-via-sata-pata-c= ontroller-help-197123882.html "Also, please try the latest 2.6.34-rc kernel, as that has several fixe= s for both pata_via and sata_via which did not make 2.6.33." What kernel are you running??? NeilBrown > > Dec 29 01:55:03 nuova kernel: sd 0:0:1:0: [sdb] CDB: cdb[0]=3D0x28:= 28 > > 00 25 a2 a0 ea 00 00 38 00 > > Dec 29 01:55:03 nuova kernel: end_request: I/O error, dev sdb, sect= or 631415018 > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923648 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923656 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923664 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923672 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923680 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923688 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923696 on sdb4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923520 on sdc4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923528 on sdc4) > > Dec 29 01:58:23 nuova kernel: md/raid:md4: read error corrected (8 > > sectors at 600923536 on sdc4) > > > > Is there a good way to determine if the issue is with the motherboa= rd > > (where the SATA controller is), or with the drives themselves? > > > >> NeilBrown > >> > >> > >> > >>> (c) are these errors expected in a RAID array that is heavily use= d? > >>> (d) what kind of errors should I see regarding "read errors" that > >>> *would* indicate an imminent hardware failure? > >>> > >>> Thoughts and ideas would be welcomed. I'm sure a thread where som= e > >>> hefty discussion is thrown at this topic will help future Googler= s > >>> like me. :) > >>> > >>> -james > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-r= aid" in > >>> the body of a message to majordomo@vger.kernel.org > >>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.h= tml > >> > >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html