data scrubbing

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* data scrubbing
@ 2011-07-29  8:50 Nikolay Kichukov
  2011-07-29 10:03 ` Mikael Abrahamsson
  2011-07-29 17:17 ` Thomas Harold
  0 siblings, 2 replies; 8+ messages in thread
From: Nikolay Kichukov @ 2011-07-29  8:50 UTC (permalink / raw)
  To: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

Recently on this list it was discussed it is a good practice to perform data scrubbing for some raid levels.
Can someone advise what raid levels need that operation scheduled on a regular basis? Perhaps all raid arrays that have:

/sys/block/md*/md/sync_action

[sync_action] property?

For example is it good for raid1 array?

Cheers,
- -Nik
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOMnRFAAoJEDFLYVOGGjgX9c8H+wSgfQwiTsE5bjLClmiset2Q
CIBJoqyzVMX8MTLr3yeSEtk2rjG1byKCuc9+Ie7GR0gVx2hW2Hnvb13myOQB1Uww
GH1LI3sTGyet43fPK5JXMwyhBrAiAnh4HMLCSTK3WdWrjfRtaanddDMQDdk4DHVF
wg7xB1NWfsnkOtA0vdgMXQ9Oki1LuBPi9PuZg2Gr4IxdSPm010wDCbJjDRqYBlr4
jE99Elh6oZes+6OImmeMRGz7UJaqC+581/nM/KVMpBEwkOT9jMJKujgRAhLc0pf2
KjjDq6o2/UpIyVTf+EEgdThRL4/PM7g8TaDMBA/pthQKBzoHHJudTa/flzzW6rE=
=WpkM
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29  8:50 data scrubbing Nikolay Kichukov
@ 2011-07-29 10:03 ` Mikael Abrahamsson
  2011-07-29 13:25   ` Nikolay Kichukov
  2011-07-29 17:17 ` Thomas Harold
  1 sibling, 1 reply; 8+ messages in thread
From: Mikael Abrahamsson @ 2011-07-29 10:03 UTC (permalink / raw)
  To: Nikolay Kichukov; +Cc: linux-raid

On Fri, 29 Jul 2011, Nikolay Kichukov wrote:

> For example is it good for raid1 array?

Yes, it's good for all raid levels that have any kind of redundancy. You 
want to read the information on the drives regularily to make sure it can 
still be read, and if it can't, it can be recomputed from parity and 
written.

Otherwise not-often-read data might have an error on one drive, and then 
another drive fails and now when you try to rebuild you don't have this 
data anywhere all of a sudden (RAID1 and RAID5), and you had no idea about 
this.

Scrubbing is good, do it regularily (at least monthly).

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29 10:03 ` Mikael Abrahamsson
@ 2011-07-29 13:25   ` Nikolay Kichukov
  2011-07-29 20:48     ` Beolach
  0 siblings, 1 reply; 8+ messages in thread
From: Nikolay Kichukov @ 2011-07-29 13:25 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

This is a good to know!

Just performed a check on a raid1 and got:

Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
found: 128

So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?

cat /sys/block/md1/md/mismatch_cnt
128


Cheers,
- -Nik

On 07/29/2011 01:03 PM, Mikael Abrahamsson wrote:
> On Fri, 29 Jul 2011, Nikolay Kichukov wrote:
> 
>> For example is it good for raid1 array?
> 
> Yes, it's good for all raid levels that have any kind of redundancy. You want to read the information on the drives
> regularily to make sure it can still be read, and if it can't, it can be recomputed from parity and written.
> 
> Otherwise not-often-read data might have an error on one drive, and then another drive fails and now when you try to
> rebuild you don't have this data anywhere all of a sudden (RAID1 and RAID5), and you had no idea about this.
> 
> Scrubbing is good, do it regularily (at least monthly).
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOMrTTAAoJEDFLYVOGGjgXVPoH/0WDSWUhR8LvuaSizBBbbN48
iAWWsiA/fJr9DIO9+E1cTFXAqUOxsEY/iAJX7IVKAbS+R3/eYITHj0r6HajG3XnE
wiqY3hoJU79aGBNOtxwAH8QeNtdGooVxL6TW0TRNFr/PFbWiBc2Aj2/aFizuqPHE
EaYd1V02/i0wugWmGAFUAE81qG40jpuwq/B/KL18TDF8aayzj9T1PWLJh2QC3qJZ
ugj708g34+X7yWY7C5gWYjHoX13IbyU+hbaM1Yrt7z0wLBFw+VxtNFDeWvOI/7zn
E1c4DSmb4mAWL/CY8QlKP8oN5EkjS8o3VOz3UckkibiVqJw3X1msYZ52SY3UXeY=
=LfWV
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29 13:25   ` Nikolay Kichukov
@ 2011-07-29 20:48     ` Beolach
  2011-07-29 21:51       ` Mathias Burén
  0 siblings, 1 reply; 8+ messages in thread
From: Beolach @ 2011-07-29 20:48 UTC (permalink / raw)
  To: Nikolay Kichukov; +Cc: Mdadm

On Fri, Jul 29, 2011 at 07:25, Nikolay Kichukov <hijacker@oldum.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> This is a good to know!
>
> Just performed a check on a raid1 and got:
>
> Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
> found: 128
>
> So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?
>
> cat /sys/block/md1/md/mismatch_cnt
> 128
>
>

That depends on if you did a "check" or a "repair" - see the SCRUBBING
AND MISMATCHES section of the md(4) man page:
"If  check  was used, then no action is taken to handle the mismatch,
it is simply recorded.  If repair  was  used,  then  a  mismatch  will
 be repaired  in  the same way that resync repairs arrays."


Good luck,
Beolach
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29 20:48     ` Beolach
@ 2011-07-29 21:51       ` Mathias Burén
  2011-07-29 22:16         ` David Brown
  2011-07-29 22:37         ` Beolach
  0 siblings, 2 replies; 8+ messages in thread
From: Mathias Burén @ 2011-07-29 21:51 UTC (permalink / raw)
  To: Beolach; +Cc: Nikolay Kichukov, Mdadm

On 29 July 2011 21:48, Beolach <beolach@gmail.com> wrote:
> On Fri, Jul 29, 2011 at 07:25, Nikolay Kichukov <hijacker@oldum.net> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Hi,
>>
>> This is a good to know!
>>
>> Just performed a check on a raid1 and got:
>>
>> Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
>> found: 128
>>
>> So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?
>>
>> cat /sys/block/md1/md/mismatch_cnt
>> 128
>>
>>
>
> That depends on if you did a "check" or a "repair" - see the SCRUBBING
> AND MISMATCHES section of the md(4) man page:
> "If  check  was used, then no action is taken to handle the mismatch,
> it is simply recorded.  If repair  was  used,  then  a  mismatch  will
>  be repaired  in  the same way that resync repairs arrays."
>
>
> Good luck,
> Beolach
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Sorry to chime in like this. After reading the above, is there a
reason why anyone shouldn't _always_ use repair instead of check on a
weekly RAID6 check? You have to run repair anyway after a check if any
issues are found, right?

Or does the system become vulnerable during a repair? (less redundant)

Thanks,
Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29 21:51       ` Mathias Burén
@ 2011-07-29 22:16         ` David Brown
  2011-07-29 22:37         ` Beolach
  1 sibling, 0 replies; 8+ messages in thread
From: David Brown @ 2011-07-29 22:16 UTC (permalink / raw)
  To: linux-raid

On 29/07/11 23:51, Mathias Burén wrote:
> On 29 July 2011 21:48, Beolach<beolach@gmail.com>  wrote:
>> On Fri, Jul 29, 2011 at 07:25, Nikolay Kichukov<hijacker@oldum.net>  wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Hi,
>>>
>>> This is a good to know!
>>>
>>> Just performed a check on a raid1 and got:
>>>
>>> Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
>>> found: 128
>>>
>>> So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?
>>>
>>> cat /sys/block/md1/md/mismatch_cnt
>>> 128
>>>
>>>
>>
>> That depends on if you did a "check" or a "repair" - see the SCRUBBING
>> AND MISMATCHES section of the md(4) man page:
>> "If  check  was used, then no action is taken to handle the mismatch,
>> it is simply recorded.  If repair  was  used,  then  a  mismatch  will
>>   be repaired  in  the same way that resync repairs arrays."
>>
>>
>> Good luck,
>> Beolach
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Sorry to chime in like this. After reading the above, is there a
> reason why anyone shouldn't _always_ use repair instead of check on a
> weekly RAID6 check? You have to run repair anyway after a check if any
> issues are found, right?
>
> Or does the system become vulnerable during a repair? (less redundant)
>
> Thanks,
> Mathias

If you do a repair, then when a mismatch is found one of the disks is 
taken as the "bad" one, and re-created.  For raid1, the first copy is 
assumed correct.  For raid5/6, the data blocks are assumed correct and 
the parities re-created.  As Neil Brown explained on his blog, without 
any more information then this is as good as md raid can do.  However, 
it is not necessarily as good as /you/ can do.  For example, you might 
be able to determine which files use the blocks in the mismatched 
stripe, and figure out which block was bad.  Or for 3-disk raid1 you 
could pick the bad block as the odd one out (assuming the other two 
matched).  For raid6, it's possible to spot if it is a single-disk 
mismatch and correct that one disk (for each disk in turn, assume it is 
missing and re-create it from the other disks using normal raid6 
recovery.  If the stripe is then consistent, you've fixed the mismatch). 
  However, such approaches are not necessarily the correct one.  Thus 
the "repair" just does the simplest and fastest correction of the 
mismatch, and "check" does not change the stripe in case you want to 
manually pick a different method.

<http://neil.brown.name/blog/20100211050355>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29 21:51       ` Mathias Burén
  2011-07-29 22:16         ` David Brown
@ 2011-07-29 22:37         ` Beolach
  1 sibling, 0 replies; 8+ messages in thread
From: Beolach @ 2011-07-29 22:37 UTC (permalink / raw)
  To: Mathias Burén; +Cc: Mdadm

On Fri, Jul 29, 2011 at 15:51, Mathias Burén <mathias.buren@gmail.com> wrote:
> On 29 July 2011 21:48, Beolach <beolach@gmail.com> wrote:
>> On Fri, Jul 29, 2011 at 07:25, Nikolay Kichukov <hijacker@oldum.net> wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Hi,
>>>
>>> This is a good to know!
>>>
>>> Just performed a check on a raid1 and got:
>>>
>>> Jul 29 15:37:36 hanna64 mdadm[2277]: RebuildFinished event detected on md device /dev/md1, component device  mismatches
>>> found: 128
>>>
>>> So I presume those mismatches have now been rewritten to both disks successfully. Am I wrong there?
>>>
>>> cat /sys/block/md1/md/mismatch_cnt
>>> 128
>>>
>>>
>>
>> That depends on if you did a "check" or a "repair" - see the SCRUBBING
>> AND MISMATCHES section of the md(4) man page:
>> "If  check  was used, then no action is taken to handle the mismatch,
>> it is simply recorded.  If repair  was  used,  then  a  mismatch  will
>>  be repaired  in  the same way that resync repairs arrays."
>>
>>
>> Good luck,
>> Beolach
>
> Sorry to chime in like this. After reading the above, is there a
> reason why anyone shouldn't _always_ use repair instead of check on a
> weekly RAID6 check? You have to run repair anyway after a check if any
> issues are found, right?
>
> Or does the system become vulnerable during a repair? (less redundant)
>
> Thanks,
> Mathias
>

The primary purpose of data scrubbing a RAID is to detect & correct
read errors on any of the member devices; both check and repair
perform this function.  Finding (and w/ repair correcting) mismatches
is only a secondary purpose - it is only if there are no read errors
but the data copy or parity blocks are found to be inconsistent that a
mismatch is reported.  In order to repair a mismatch, MD needs to
restore consistency, by over writing the inconsistent data copy or
parity blocks w/ the correct data.  But, because the underlying member
devices did not return any errors, MD has no way of knowing which
blocks are correct, and which are incorrect; when it is told to do a
repair, it makes the assumption that the first copy in a RAID1 or
RAID10, or the data (non-parity) blocks in RAID4/5/6 are correct, and
corrects the mismatch based on that assumption.

That assumption may or may not be correct, but MD has no way of
determining that reliably - but the user might be able to, by using
additional knowledge or tools, so MD gives the user the option to
perform data scrubbing either with (repair) or without (check) MD
correcting the mismatches using that assumption.

I hope that answers your question,
Beolach
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: data scrubbing
  2011-07-29  8:50 data scrubbing Nikolay Kichukov
  2011-07-29 10:03 ` Mikael Abrahamsson
@ 2011-07-29 17:17 ` Thomas Harold
  1 sibling, 0 replies; 8+ messages in thread
From: Thomas Harold @ 2011-07-29 17:17 UTC (permalink / raw)
  To: Nikolay Kichukov; +Cc: linux-raid

On 7/29/2011 4:50 AM, Nikolay Kichukov wrote:
> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
>
> Hi all,
>
> Recently on this list it was discussed it is a good practice to
> perform data scrubbing for some raid levels. Can someone advise what
> raid levels need that operation scheduled on a regular basis? Perhaps
> all raid arrays that have:
>
> /sys/block/md*/md/sync_action
>
> [sync_action] property?
>
> For example is it good for raid1 array?
>

Yes, we run a script every week (different arrays on different nights)
that looks like:

#!/bin/sh

echo check > /sys/block/md0/md/sync_action
mdadm --wait /dev/md0
cat /sys/block/md0/md/mismatch_cnt


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-07-29 22:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-29  8:50 data scrubbing Nikolay Kichukov
2011-07-29 10:03 ` Mikael Abrahamsson
2011-07-29 13:25   ` Nikolay Kichukov
2011-07-29 20:48     ` Beolach
2011-07-29 21:51       ` Mathias Burén
2011-07-29 22:16         ` David Brown
2011-07-29 22:37         ` Beolach
2011-07-29 17:17 ` Thomas Harold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).