Re: raid 5 mismatch_cnt errors

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Doug Ledford <dledford@redhat.com>
To: Neil Brown <neilb@suse.de>
Cc: Trey Scarborough <treys@locallinux.com>,
	"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: raid 5 mismatch_cnt errors
Date: Thu, 20 May 2010 22:16:07 -0400	[thread overview]
Message-ID: <4BF5ECE7.7020907@redhat.com> (raw)
In-Reply-To: <20100521083819.54680dfb@notabene.brown>

[-- Attachment #1: Type: text/plain, Size: 3801 bytes --]

On 05/20/2010 06:38 PM, Neil Brown wrote:
> On Thu, 20 May 2010 17:29:37 -0500
> Trey Scarborough <treys@locallinux.com> wrote:
> 
>> Neil Brown wrote:
>>> On Thu, 20 May 2010 12:02:23 -0500
>>> Trey Scarborough <treys@locallinux.com> wrote:
>>>
>>>   
>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
>>>> growing. This is causing file corruption on the underlaying file systems 
>>>> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there 
>>>> anyway to run a report on the count per drive that these mismatches 
>>>> occur. I have run smarttools test and do not see one drive that stands 
>>>> out to be causing errors. Could something else be causing these errors?
>>>>     

While a bad drive is certainly a possibility here, this is precisely the
type of failure scenario that would make me suspect bad RAM,
motherboard, or CPU.  So I wouldn't rule those out as possibilities either.

>>>
>>> When RAID5 detects an inconsistency there is no way to know which device was
>>> wrong.
>>> SMART only detects some errors, not all.
>>> I have had hard drives before which appears to have a single-bit error in
>>> their internal buffer.  No error would be reported, but data you read would
>>> sometimes be wrong.
>>> RAID5 cannot help you with this sort of error.
>>>
>>> I would suggest backing up all your data (if it isn't already to late),
>>> breaking the array, and testing each device individually.
>>> e.g. create a filesystem on the device and try copying data on and reading it
>>> off.
>>>
>>> NeilBrown
>>>   
>> Thats what I was afraid of. The problem I have is if I back it up 
>> knowing what data is bad. Luckily it appears to be a write error because 
>> once written and correct I can do sums on all the files and I do not see 
>> anymore errors. I was thinking that there might be a way of do a resync 
>> and turning up the debug somehow so that it would log the mismatches 
>> with both the drives that it was reading from at the time. I could then 
>> take that information and considering there are 9 drives in the array 
>> the one that comes out having the most should be the culprit. I could 
>> then remove that drive from the array and test it leaving the rest in a 
>> state that could be rebuilt and the data being consistant because the 
>> drive with the bad write errors would be removed. Is this something that 
>> might be possible?
> 
> To detect a mismatch, raid5 reads from all drives in parallel, calculates the
> parity across the data blocks and compares that to the parity block.
> So no: something like that is not possible.
> 
> only thing I can suggest:
> 
> - add a write-intent bitmap so you can remove/re-add devices fairly cheaply
> - create a v.large file.
> - write random data to the file without truncating it. (use dd of=file
>   conv=notrunc) then read it back and see if it matches.   If it does, then
>   this approach doesn't help.  If it doesn't:
> 
>   1 by 1, fail/remove a drive from the array.  Write new random data to the
>   same file and read it back and compare.  Then --readd the missing device.
>   I'm hoping that you will get an error every time except when the 'bad'
>   device has been removed.
> 
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

next prev parent reply	other threads:[~2010-05-21  2:16 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-20 17:02 raid 5 mismatch_cnt errors Trey Scarborough
2010-05-20 21:16 ` Neil Brown
2010-05-20 22:29   ` Trey Scarborough
2010-05-20 22:38     ` Neil Brown
2010-05-21  2:16       ` Doug Ledford [this message]
2010-05-21 16:40         ` MRK
2010-05-21 20:57           ` Doug Ledford
2010-05-24  9:34             ` Tim Small
2010-05-25 19:09               ` Robert Hancock
2010-05-26 15:07         ` Bill Davidsen
2010-05-26 15:49           ` Doug Ledford
  -- strict thread matches above, loose matches on Subject: below --
2010-05-20 16:58 Trey Scarborough

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BF5ECE7.7020907@redhat.com \
    --to=dledford@redhat.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=treys@locallinux.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.