RAID6 check found different events, how should I proceed?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID6 check found different events, how should I proceed?
@ 2011-08-06 13:23 Mathias Burén
  2011-08-06 16:02 ` Mathias Burén
  2011-08-06 17:54 ` Alexander Kühn
  0 siblings, 2 replies; 5+ messages in thread
From: Mathias Burén @ 2011-08-06 13:23 UTC (permalink / raw)
  To: Linux-RAID

First, thanks for this:

> The primary purpose of data scrubbing a RAID is to detect & correct
> read errors on any of the member devices; both check and repair
> perform this function.  Finding (and w/ repair correcting) mismatches
> is only a secondary purpose - it is only if there are no read errors
> but the data copy or parity blocks are found to be inconsistent that a
> mismatch is reported.  In order to repair a mismatch, MD needs to
> restore consistency, by over writing the inconsistent data copy or
> parity blocks w/ the correct data.  But, because the underlying member
> devices did not return any errors, MD has no way of knowing which
> blocks are correct, and which are incorrect; when it is told to do a
> repair, it makes the assumption that the first copy in a RAID1 or
> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
> corrects the mismatch based on that assumption.
>
> That assumption may or may not be correct, but MD has no way of
> determining that reliably - but the user might be able to, by using
> additional knowledge or tools, so MD gives the user the option to
> perform data scrubbing either with (repair) or without (check) MD
> correcting the mismatches using that assumption.
>
>
> I hope that answers your question,
> Beolach

My RAID6 is currently degraded with one HDD (panic mail on the list),
and my weekly cron job kicked in doing the RAID6 check action. This is
the result:

DEV	EVENTS	REALL	PEND	UNCORR	CRC	RAW 	ZONE	END
sdb1	6239487	0		0		0		2	0		0
sdc1	6239487	0		0		0		0	0		0
sdd1	6239487	0		0		0		0	0		0
sde1	6239487	0		0		0		0	0		0
sdf1	6239490	0		0		0		0	49		6
sdg1	6239491	0		0		0		0	0		0
sdh1	(missing, on RMA trip)


(so the SMART is actually fine for all drives)

Personalities : [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
      9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
[7/6] [UUUUU_U]

unused devices: <none>


/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 19 08:58:41 2010
     Raid Level : raid6
     Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
  Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
   Raid Devices : 7
  Total Devices : 6
    Persistence : Superblock is persistent

    Update Time : Sat Aug  6 14:13:08 2011
          State : clean, degraded
 Active Devices : 6
Working Devices : 6
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : ion:0  (local to host ion)
           UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
         Events : 6239491

    Number   Major   Minor   RaidDevice State
       0       8       97        0      active sync   /dev/sdg1
       1       8       17        1      active sync   /dev/sdb1
       4       8       49        2      active sync   /dev/sdd1
       3       8       33        3      active sync   /dev/sdc1
       5       8       81        4      active sync   /dev/sdf1
       5       0        0        5      removed
       7       8       65        6      active sync   /dev/sde1

So sdf1 and sdg1 have a different event count. Does this mean the HDDs
have silently corrupted the data? I have no way of checking if the
data itself is corrupt or not, except for perhaps a fsck of the
filesystem? Does that make sense?

* Should I run a repair?
* Chould I run a check again, to see if the event count changes?
* Is it likely I've 2 more bad harddrives that will die soon?
* Is it wise to run another smartctl -t long on all devices?

Thanks,
Mathias

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 check found different events, how should I proceed?
  2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
@ 2011-08-06 16:02 ` Mathias Burén
       [not found]   ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
  2011-08-08 22:57   ` NeilBrown
  2011-08-06 17:54 ` Alexander Kühn
  1 sibling, 2 replies; 5+ messages in thread
From: Mathias Burén @ 2011-08-06 16:02 UTC (permalink / raw)
  To: Linux-RAID

On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
> My RAID6 is currently degraded with one HDD (panic mail on the list),
> and my weekly cron job kicked in doing the RAID6 check action. This is
> the result:
>
> DEV     EVENTS  REALL   PEND    UNCORR  CRC     RAW     ZONE    END
> sdb1    6239487 0               0               0               2       0               0
> sdc1    6239487 0               0               0               0       0               0
> sdd1    6239487 0               0               0               0       0               0
> sde1    6239487 0               0               0               0       0               0
> sdf1    6239490 0               0               0               0       49              6
> sdg1    6239491 0               0               0               0       0               0
> sdh1    (missing, on RMA trip)
>
(snip)
> * Should I run a repair?
> * Chould I run a check again, to see if the event count changes?
> * Is it likely I've 2 more bad harddrives that will die soon?
> * Is it wise to run another smartctl -t long on all devices?
>
> Thanks,
> Mathias
>

A followup;

I ran smartctl -t long on all devices, and they all passed, SMART is
fine. The number of events is also the same for all HDDs now:

DEV	EVENTS	REALL	PEND	UNCORR	CRC	RAW	ZONE	END
sdb1	6244415	0	0	0	2	0	0	
sdc1	6244415	0	0	0	0	0	0	
sdd1	6244415	0	0	0	0	0	0	
sde1	6244415	0	0	0	0	0	0	
sdf1	6244415	0	0	0	0	49	6	
sdg1	6244415	0	0	0	0	0	0	
sdh1								

This is without me running repair or anything like that.

Mathias

^ permalink raw reply	[flat|nested] 5+ messages in thread

[parent not found: <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>]

* Re: RAID6 check found different events, how should I proceed?
       [not found]   ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
@ 2011-08-06 17:09     ` Cal Leeming [Simplicity Media Ltd]
  0 siblings, 0 replies; 5+ messages in thread
From: Cal Leeming [Simplicity Media Ltd] @ 2011-08-06 17:09 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

Can't offer any advice on this issue, but would be very interested to
hear the debrief once the situation is resolved.

On Sat, Aug 6, 2011 at 6:08 PM, Cal Leeming [Simplicity Media Ltd]
<cal.leeming@simplicitymedialtd.co.uk> wrote:
>
> Can't offer any advice on this issue, but would be very interested to hear the debrief once the situation is resolved.
> On Sat, Aug 6, 2011 at 5:02 PM, Mathias Burén <mathias.buren@gmail.com> wrote:
>>
>> On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
>> > My RAID6 is currently degraded with one HDD (panic mail on the list),
>> > and my weekly cron job kicked in doing the RAID6 check action. This is
>> > the result:
>> >
>> > DEV     EVENTS  REALL   PEND    UNCORR  CRC     RAW     ZONE    END
>> > sdb1    6239487 0               0               0               2       0               0
>> > sdc1    6239487 0               0               0               0       0               0
>> > sdd1    6239487 0               0               0               0       0               0
>> > sde1    6239487 0               0               0               0       0               0
>> > sdf1    6239490 0               0               0               0       49              6
>> > sdg1    6239491 0               0               0               0       0               0
>> > sdh1    (missing, on RMA trip)
>> >
>> (snip)
>> > * Should I run a repair?
>> > * Chould I run a check again, to see if the event count changes?
>> > * Is it likely I've 2 more bad harddrives that will die soon?
>> > * Is it wise to run another smartctl -t long on all devices?
>> >
>> > Thanks,
>> > Mathias
>> >
>>
>> A followup;
>>
>> I ran smartctl -t long on all devices, and they all passed, SMART is
>> fine. The number of events is also the same for all HDDs now:
>>
>> DEV     EVENTS  REALL   PEND    UNCORR  CRC     RAW     ZONE    END
>> sdb1    6244415 0       0       0       2       0       0
>> sdc1    6244415 0       0       0       0       0       0
>> sdd1    6244415 0       0       0       0       0       0
>> sde1    6244415 0       0       0       0       0       0
>> sdf1    6244415 0       0       0       0       49      6
>> sdg1    6244415 0       0       0       0       0       0
>> sdh1
>>
>> This is without me running repair or anything like that.
>>
>> Mathias
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 check found different events, how should I proceed?
  2011-08-06 16:02 ` Mathias Burén
       [not found]   ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
@ 2011-08-08 22:57   ` NeilBrown
  1 sibling, 0 replies; 5+ messages in thread
From: NeilBrown @ 2011-08-08 22:57 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

On Sat, 6 Aug 2011 17:02:48 +0100 Mathias Burén <mathias.buren@gmail.com>
wrote:

> On 6 August 2011 14:23, Mathias Burén <mathias.buren@gmail.com> wrote:
> > My RAID6 is currently degraded with one HDD (panic mail on the list),
> > and my weekly cron job kicked in doing the RAID6 check action. This is
> > the result:
> >
> > DEV     EVENTS  REALL   PEND    UNCORR  CRC     RAW     ZONE    END
> > sdb1    6239487 0               0               0               2       0               0
> > sdc1    6239487 0               0               0               0       0               0
> > sdd1    6239487 0               0               0               0       0               0
> > sde1    6239487 0               0               0               0       0               0
> > sdf1    6239490 0               0               0               0       49              6
> > sdg1    6239491 0               0               0               0       0               0
> > sdh1    (missing, on RMA trip)
> >
> (snip)
> > * Should I run a repair?
> > * Chould I run a check again, to see if the event count changes?
> > * Is it likely I've 2 more bad harddrives that will die soon?
> > * Is it wise to run another smartctl -t long on all devices?
> >
> > Thanks,
> > Mathias
> >
> 
> A followup;
> 
> I ran smartctl -t long on all devices, and they all passed, SMART is
> fine. The number of events is also the same for all HDDs now:
> 
> DEV	EVENTS	REALL	PEND	UNCORR	CRC	RAW	ZONE	END
> sdb1	6244415	0	0	0	2	0	0	
> sdc1	6244415	0	0	0	0	0	0	
> sdd1	6244415	0	0	0	0	0	0	
> sde1	6244415	0	0	0	0	0	0	
> sdf1	6244415	0	0	0	0	49	6	
> sdg1	6244415	0	0	0	0	0	0	
> sdh1								
> 
> This is without me running repair or anything like that.

The thing that you did which produced the change was that you let time pass.

Presumably there was a time delay (maybe small) between extracting the
'events' number from sde1 and sdf1, then sdf1 and sdg1.  During these times
the events on all devices in the array was updated.  This implies some thread
was writing, but possibly not writing very heavily.

When you sampled them all the second time and got the same number there were
presumably no writes happening, so the event numbers didn't change.

When there are occasional writes the array oscillates between  'clean' and
'active' and each change updates the 'events' number.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 check found different events, how should I proceed?
  2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
  2011-08-06 16:02 ` Mathias Burén
@ 2011-08-06 17:54 ` Alexander Kühn
  1 sibling, 0 replies; 5+ messages in thread
From: Alexander Kühn @ 2011-08-06 17:54 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Linux-RAID

I'd do _nothing_ until I got a replacement drive. Then plug that in  
and let it regain full redundancy.
After that you can start stressing the disks with the actions you  
suggested if you like.
Alex.

Zitat von Mathias Burén <mathias.buren@gmail.com>:

> First, thanks for this:
>
>> The primary purpose of data scrubbing a RAID is to detect & correct
>> read errors on any of the member devices; both check and repair
>> perform this function.  Finding (and w/ repair correcting) mismatches
>> is only a secondary purpose - it is only if there are no read errors
>> but the data copy or parity blocks are found to be inconsistent that a
>> mismatch is reported.  In order to repair a mismatch, MD needs to
>> restore consistency, by over writing the inconsistent data copy or
>> parity blocks w/ the correct data.  But, because the underlying member
>> devices did not return any errors, MD has no way of knowing which
>> blocks are correct, and which are incorrect; when it is told to do a
>> repair, it makes the assumption that the first copy in a RAID1 or
>> RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
>> corrects the mismatch based on that assumption.
>>
>> That assumption may or may not be correct, but MD has no way of
>> determining that reliably - but the user might be able to, by using
>> additional knowledge or tools, so MD gives the user the option to
>> perform data scrubbing either with (repair) or without (check) MD
>> correcting the mismatches using that assumption.
>>
>>
>> I hope that answers your question,
>> Beolach
>
> My RAID6 is currently degraded with one HDD (panic mail on the list),
> and my weekly cron job kicked in doing the RAID6 check action. This is
> the result:
>
> DEV	EVENTS	REALL	PEND	UNCORR	CRC	RAW 	ZONE	END
> sdb1	6239487	0		0		0		2	0		0
> sdc1	6239487	0		0		0		0	0		0
> sdd1	6239487	0		0		0		0	0		0
> sde1	6239487	0		0		0		0	0		0
> sdf1	6239490	0		0		0		0	49		6
> sdg1	6239491	0		0		0		0	0		0
> sdh1	(missing, on RMA trip)
>
>
> (so the SMART is actually fine for all drives)
>
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid6 sdf1[5] sdg1[0] sdd1[4] sde1[7] sdc1[3] sdb1[1]
>       9751756800 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [7/6] [UUUUU_U]
>
> unused devices: <none>
>
>
> /dev/md0:
>         Version : 1.2
>   Creation Time : Tue Oct 19 08:58:41 2010
>      Raid Level : raid6
>      Array Size : 9751756800 (9300.00 GiB 9985.80 GB)
>   Used Dev Size : 1950351360 (1860.00 GiB 1997.16 GB)
>    Raid Devices : 7
>   Total Devices : 6
>     Persistence : Superblock is persistent
>
>     Update Time : Sat Aug  6 14:13:08 2011
>           State : clean, degraded
>  Active Devices : 6
> Working Devices : 6
>  Failed Devices : 0
>   Spare Devices : 0
>
>          Layout : left-symmetric
>      Chunk Size : 64K
>
>            Name : ion:0  (local to host ion)
>            UUID : e6595c64:b3ae90b3:f01133ac:3f402d20
>          Events : 6239491
>
>     Number   Major   Minor   RaidDevice State
>        0       8       97        0      active sync   /dev/sdg1
>        1       8       17        1      active sync   /dev/sdb1
>        4       8       49        2      active sync   /dev/sdd1
>        3       8       33        3      active sync   /dev/sdc1
>        5       8       81        4      active sync   /dev/sdf1
>        5       0        0        5      removed
>        7       8       65        6      active sync   /dev/sde1
>
> So sdf1 and sdg1 have a different event count. Does this mean the HDDs
> have silently corrupted the data? I have no way of checking if the
> data itself is corrupt or not, except for perhaps a fsck of the
> filesystem? Does that make sense?
>
> * Should I run a repair?
> * Chould I run a check again, to see if the event count changes?
> * Is it likely I've 2 more bad harddrives that will die soon?
> * Is it wise to run another smartctl -t long on all devices?
>
> Thanks,
> Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-08-08 22:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-06 13:23 RAID6 check found different events, how should I proceed? Mathias Burén
2011-08-06 16:02 ` Mathias Burén
     [not found]   ` <CALvtuFTKBRtco1VFw9xv6x3qsLx+rdJg2wL8E9+1g3LQf=Xkuw@mail.gmail.com>
2011-08-06 17:09     ` Cal Leeming [Simplicity Media Ltd]
2011-08-08 22:57   ` NeilBrown
2011-08-06 17:54 ` Alexander Kühn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).