raid 5 mismatch_cnt errors

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid 5 mismatch_cnt errors
@ 2010-05-20 16:58 Trey Scarborough
  0 siblings, 0 replies; 12+ messages in thread
From: Trey Scarborough @ 2010-05-20 16:58 UTC (permalink / raw)
  To: linux-raid

I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
growing. This is causing file corruption on the underlaying file systems 
as well.  I can copy a group of 100 100mb files and then do a md5sum on 
them and 1-3 will be corrupt. If this is a drive that is bad is there 
anyway to run a report on the count per drive that these mismatches 
occur. I have run smarttools test and do not see one drive that stands 
out to be causing errors. Could something else be causing these errors?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* raid 5 mismatch_cnt errors
@ 2010-05-20 17:02 Trey Scarborough
  2010-05-20 21:16 ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Trey Scarborough @ 2010-05-20 17:02 UTC (permalink / raw)
  To: linux-raid

I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
growing. This is causing file corruption on the underlaying file systems 
as well.  I can copy a group of 100 100mb files and then do a md5sum on 
them and 1-3 will be corrupt. If this is a drive that is bad is there 
anyway to run a report on the count per drive that these mismatches 
occur. I have run smarttools test and do not see one drive that stands 
out to be causing errors. Could something else be causing these errors?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-20 17:02 Trey Scarborough
@ 2010-05-20 21:16 ` Neil Brown
  2010-05-20 22:29   ` Trey Scarborough
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2010-05-20 21:16 UTC (permalink / raw)
  To: Trey Scarborough; +Cc: linux-raid

On Thu, 20 May 2010 12:02:23 -0500
Trey Scarborough <treys@locallinux.com> wrote:

> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
> growing. This is causing file corruption on the underlaying file systems 
> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
> them and 1-3 will be corrupt. If this is a drive that is bad is there 
> anyway to run a report on the count per drive that these mismatches 
> occur. I have run smarttools test and do not see one drive that stands 
> out to be causing errors. Could something else be causing these errors?

When RAID5 detects an inconsistency there is no way to know which device was
wrong.
SMART only detects some errors, not all.
I have had hard drives before which appears to have a single-bit error in
their internal buffer.  No error would be reported, but data you read would
sometimes be wrong.
RAID5 cannot help you with this sort of error.

I would suggest backing up all your data (if it isn't already to late),
breaking the array, and testing each device individually.
e.g. create a filesystem on the device and try copying data on and reading it
off.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-20 21:16 ` Neil Brown
@ 2010-05-20 22:29   ` Trey Scarborough
  2010-05-20 22:38     ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Trey Scarborough @ 2010-05-20 22:29 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid@vger.kernel.org

Neil Brown wrote:
> On Thu, 20 May 2010 12:02:23 -0500
> Trey Scarborough <treys@locallinux.com> wrote:
>
>   
>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
>> growing. This is causing file corruption on the underlaying file systems 
>> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
>> them and 1-3 will be corrupt. If this is a drive that is bad is there 
>> anyway to run a report on the count per drive that these mismatches 
>> occur. I have run smarttools test and do not see one drive that stands 
>> out to be causing errors. Could something else be causing these errors?
>>     
>
>
> When RAID5 detects an inconsistency there is no way to know which device was
> wrong.
> SMART only detects some errors, not all.
> I have had hard drives before which appears to have a single-bit error in
> their internal buffer.  No error would be reported, but data you read would
> sometimes be wrong.
> RAID5 cannot help you with this sort of error.
>
> I would suggest backing up all your data (if it isn't already to late),
> breaking the array, and testing each device individually.
> e.g. create a filesystem on the device and try copying data on and reading it
> off.
>
> NeilBrown
>   
Thats what I was afraid of. The problem I have is if I back it up 
knowing what data is bad. Luckily it appears to be a write error because 
once written and correct I can do sums on all the files and I do not see 
anymore errors. I was thinking that there might be a way of do a resync 
and turning up the debug somehow so that it would log the mismatches 
with both the drives that it was reading from at the time. I could then 
take that information and considering there are 9 drives in the array 
the one that comes out having the most should be the culprit. I could 
then remove that drive from the array and test it leaving the rest in a 
state that could be rebuilt and the data being consistant because the 
drive with the bad write errors would be removed. Is this something that 
might be possible?

Thanks,
Trey


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-20 22:29   ` Trey Scarborough
@ 2010-05-20 22:38     ` Neil Brown
  2010-05-21  2:16       ` Doug Ledford
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2010-05-20 22:38 UTC (permalink / raw)
  To: Trey Scarborough; +Cc: linux-raid@vger.kernel.org

On Thu, 20 May 2010 17:29:37 -0500
Trey Scarborough <treys@locallinux.com> wrote:

> Neil Brown wrote:
> > On Thu, 20 May 2010 12:02:23 -0500
> > Trey Scarborough <treys@locallinux.com> wrote:
> >
> >   
> >> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
> >> growing. This is causing file corruption on the underlaying file systems 
> >> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
> >> them and 1-3 will be corrupt. If this is a drive that is bad is there 
> >> anyway to run a report on the count per drive that these mismatches 
> >> occur. I have run smarttools test and do not see one drive that stands 
> >> out to be causing errors. Could something else be causing these errors?
> >>     
> >
> >
> > When RAID5 detects an inconsistency there is no way to know which device was
> > wrong.
> > SMART only detects some errors, not all.
> > I have had hard drives before which appears to have a single-bit error in
> > their internal buffer.  No error would be reported, but data you read would
> > sometimes be wrong.
> > RAID5 cannot help you with this sort of error.
> >
> > I would suggest backing up all your data (if it isn't already to late),
> > breaking the array, and testing each device individually.
> > e.g. create a filesystem on the device and try copying data on and reading it
> > off.
> >
> > NeilBrown
> >   
> Thats what I was afraid of. The problem I have is if I back it up 
> knowing what data is bad. Luckily it appears to be a write error because 
> once written and correct I can do sums on all the files and I do not see 
> anymore errors. I was thinking that there might be a way of do a resync 
> and turning up the debug somehow so that it would log the mismatches 
> with both the drives that it was reading from at the time. I could then 
> take that information and considering there are 9 drives in the array 
> the one that comes out having the most should be the culprit. I could 
> then remove that drive from the array and test it leaving the rest in a 
> state that could be rebuilt and the data being consistant because the 
> drive with the bad write errors would be removed. Is this something that 
> might be possible?

To detect a mismatch, raid5 reads from all drives in parallel, calculates the
parity across the data blocks and compares that to the parity block.
So no: something like that is not possible.

only thing I can suggest:

- add a write-intent bitmap so you can remove/re-add devices fairly cheaply
- create a v.large file.
- write random data to the file without truncating it. (use dd of=file
  conv=notrunc) then read it back and see if it matches.   If it does, then
  this approach doesn't help.  If it doesn't:

  1 by 1, fail/remove a drive from the array.  Write new random data to the
  same file and read it back and compare.  Then --readd the missing device.
  I'm hoping that you will get an error every time except when the 'bad'
  device has been removed.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-20 22:38     ` Neil Brown
@ 2010-05-21  2:16       ` Doug Ledford
  2010-05-21 16:40         ` MRK
  2010-05-26 15:07         ` Bill Davidsen
  0 siblings, 2 replies; 12+ messages in thread
From: Doug Ledford @ 2010-05-21  2:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: Trey Scarborough, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3801 bytes --]

On 05/20/2010 06:38 PM, Neil Brown wrote:
> On Thu, 20 May 2010 17:29:37 -0500
> Trey Scarborough <treys@locallinux.com> wrote:
> 
>> Neil Brown wrote:
>>> On Thu, 20 May 2010 12:02:23 -0500
>>> Trey Scarborough <treys@locallinux.com> wrote:
>>>
>>>   
>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
>>>> growing. This is causing file corruption on the underlaying file systems 
>>>> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there 
>>>> anyway to run a report on the count per drive that these mismatches 
>>>> occur. I have run smarttools test and do not see one drive that stands 
>>>> out to be causing errors. Could something else be causing these errors?
>>>>     

While a bad drive is certainly a possibility here, this is precisely the
type of failure scenario that would make me suspect bad RAM,
motherboard, or CPU.  So I wouldn't rule those out as possibilities either.

>>>
>>> When RAID5 detects an inconsistency there is no way to know which device was
>>> wrong.
>>> SMART only detects some errors, not all.
>>> I have had hard drives before which appears to have a single-bit error in
>>> their internal buffer.  No error would be reported, but data you read would
>>> sometimes be wrong.
>>> RAID5 cannot help you with this sort of error.
>>>
>>> I would suggest backing up all your data (if it isn't already to late),
>>> breaking the array, and testing each device individually.
>>> e.g. create a filesystem on the device and try copying data on and reading it
>>> off.
>>>
>>> NeilBrown
>>>   
>> Thats what I was afraid of. The problem I have is if I back it up 
>> knowing what data is bad. Luckily it appears to be a write error because 
>> once written and correct I can do sums on all the files and I do not see 
>> anymore errors. I was thinking that there might be a way of do a resync 
>> and turning up the debug somehow so that it would log the mismatches 
>> with both the drives that it was reading from at the time. I could then 
>> take that information and considering there are 9 drives in the array 
>> the one that comes out having the most should be the culprit. I could 
>> then remove that drive from the array and test it leaving the rest in a 
>> state that could be rebuilt and the data being consistant because the 
>> drive with the bad write errors would be removed. Is this something that 
>> might be possible?
> 
> To detect a mismatch, raid5 reads from all drives in parallel, calculates the
> parity across the data blocks and compares that to the parity block.
> So no: something like that is not possible.
> 
> only thing I can suggest:
> 
> - add a write-intent bitmap so you can remove/re-add devices fairly cheaply
> - create a v.large file.
> - write random data to the file without truncating it. (use dd of=file
>   conv=notrunc) then read it back and see if it matches.   If it does, then
>   this approach doesn't help.  If it doesn't:
> 
>   1 by 1, fail/remove a drive from the array.  Write new random data to the
>   same file and read it back and compare.  Then --readd the missing device.
>   I'm hoping that you will get an error every time except when the 'bad'
>   device has been removed.
> 
> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-21  2:16       ` Doug Ledford
@ 2010-05-21 16:40         ` MRK
  2010-05-21 20:57           ` Doug Ledford
  2010-05-26 15:07         ` Bill Davidsen
  1 sibling, 1 reply; 12+ messages in thread
From: MRK @ 2010-05-21 16:40 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Neil Brown, Trey Scarborough, linux-raid@vger.kernel.org

On 05/21/2010 04:16 AM, Doug Ledford wrote:
> On 05/20/2010 06:38 PM, Neil Brown wrote:
>    
>> On Thu, 20 May 2010 17:29:37 -0500
>> Trey Scarborough<treys@locallinux.com>  wrote:
>>
>>      
>>> Neil Brown wrote:
>>>        
>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>> Trey Scarborough<treys@locallinux.com>  wrote:
>>>>
>>>>
>>>>          
>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps
>>>>> growing. This is causing file corruption on the underlaying file systems
>>>>> as well.  I can copy a group of 100 100mb files and then do a md5sum on
>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there
>>>>> anyway to run a report on the count per drive that these mismatches
>>>>> occur. I have run smarttools test and do not see one drive that stands
>>>>> out to be causing errors. Could something else be causing these errors?
>>>>>
>>>>>            
> While a bad drive is certainly a possibility here, this is precisely the
> type of failure scenario that would make me suspect bad RAM,
> motherboard, or CPU.  So I wouldn't rule those out as possibilities either.
>    

Could the cabling to the drive be causing this? (maybe failing or maybe 
it's partly disconnected)
I don't remember at what point Linux is at implementing the checksums 
between the controller and the drive.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-21 16:40         ` MRK
@ 2010-05-21 20:57           ` Doug Ledford
  2010-05-24  9:34             ` Tim Small
  0 siblings, 1 reply; 12+ messages in thread
From: Doug Ledford @ 2010-05-21 20:57 UTC (permalink / raw)
  To: MRK; +Cc: Neil Brown, Trey Scarborough, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1945 bytes --]

On 05/21/2010 12:40 PM, MRK wrote:
> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>> On 05/20/2010 06:38 PM, Neil Brown wrote:
>>   
>>> On Thu, 20 May 2010 17:29:37 -0500
>>> Trey Scarborough<treys@locallinux.com>  wrote:
>>>
>>>     
>>>> Neil Brown wrote:
>>>>       
>>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>>> Trey Scarborough<treys@locallinux.com>  wrote:
>>>>>
>>>>>
>>>>>         
>>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that
>>>>>> keeps
>>>>>> growing. This is causing file corruption on the underlaying file
>>>>>> systems
>>>>>> as well.  I can copy a group of 100 100mb files and then do a
>>>>>> md5sum on
>>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there
>>>>>> anyway to run a report on the count per drive that these mismatches
>>>>>> occur. I have run smarttools test and do not see one drive that
>>>>>> stands
>>>>>> out to be causing errors. Could something else be causing these
>>>>>> errors?
>>>>>>
>>>>>>            
>> While a bad drive is certainly a possibility here, this is precisely the
>> type of failure scenario that would make me suspect bad RAM,
>> motherboard, or CPU.  So I wouldn't rule those out as possibilities
>> either.
>>    
> 
> Could the cabling to the drive be causing this? (maybe failing or maybe
> it's partly disconnected)
> I don't remember at what point Linux is at implementing the checksums
> between the controller and the drive.

I don't know.  I'm not up on the SATA signaling details so I don't know
if it uses CRC on the signal, but I suspect it does and a bad cable
would cause failed requests.  But I wouldn't bet my house on it, so I
would ask some SATA gurus.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-21 20:57           ` Doug Ledford
@ 2010-05-24  9:34             ` Tim Small
  2010-05-25 19:09               ` Robert Hancock
  0 siblings, 1 reply; 12+ messages in thread
From: Tim Small @ 2010-05-24  9:34 UTC (permalink / raw)
  To: Doug Ledford
  Cc: MRK, Neil Brown, Trey Scarborough, linux-raid@vger.kernel.org,
	linux-ide

On 21/05/10 21:57, Doug Ledford wrote:
> On 05/21/2010 12:40 PM, MRK wrote:
>    
>> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>>      
>> Could the cabling to the drive be causing this? (maybe failing or maybe
>> it's partly disconnected)
>> I don't remember at what point Linux is at implementing the checksums
>> between the controller and the drive.
>>      
> I don't know.  I'm not up on the SATA signaling details so I don't know
> if it uses CRC on the signal, but I suspect it does and a bad cable
> would cause failed requests.  But I wouldn't bet my house on it, so I
> would ask some SATA gurus.
>    

I wouldn't call myself that, but I believe PATA and SATA-level CRC 
errors show up in the UDMA_CRC_Error_Count SMART variable - look for a 
non-zero raw value in the smartctl output.  This is presumably just the 
error-count from the drive's point of view (bad data recd at drive 
end).  I don't know what happens with CRC errors detected at the Linux 
end - and whether detection is controller-dependant.  Better ask on 
linux-ide.

 From the SMART attribute name, presumably the earlier PATA transfer 
modes don't support CRC error detection.

An easy thing to check might be to reduce the libata transfer speed from 
3GBps to 1.5GBps.  Similarly, try to test each drive and SATA port in 
isolation if you can....

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-24  9:34             ` Tim Small
@ 2010-05-25 19:09               ` Robert Hancock
  0 siblings, 0 replies; 12+ messages in thread
From: Robert Hancock @ 2010-05-25 19:09 UTC (permalink / raw)
  To: Tim Small
  Cc: Doug Ledford, MRK, Neil Brown, Trey Scarborough,
	linux-raid@vger.kernel.org, linux-ide

On 05/24/2010 03:34 AM, Tim Small wrote:
> On 21/05/10 21:57, Doug Ledford wrote:
>> On 05/21/2010 12:40 PM, MRK wrote:
>>> On 05/21/2010 04:16 AM, Doug Ledford wrote:
>>> Could the cabling to the drive be causing this? (maybe failing or maybe
>>> it's partly disconnected)
>>> I don't remember at what point Linux is at implementing the checksums
>>> between the controller and the drive.
>> I don't know. I'm not up on the SATA signaling details so I don't know
>> if it uses CRC on the signal, but I suspect it does and a bad cable
>> would cause failed requests. But I wouldn't bet my house on it, so I
>> would ask some SATA gurus.
>
> I wouldn't call myself that, but I believe PATA and SATA-level CRC
> errors show up in the UDMA_CRC_Error_Count SMART variable - look for a
> non-zero raw value in the smartctl output. This is presumably just the
> error-count from the drive's point of view (bad data recd at drive end).
> I don't know what happens with CRC errors detected at the Linux end -
> and whether detection is controller-dependant. Better ask on linux-ide.
>
>
>  From the SMART attribute name, presumably the earlier PATA transfer
> modes don't support CRC error detection.
>
> An easy thing to check might be to reduce the libata transfer speed from
> 3GBps to 1.5GBps. Similarly, try to test each drive and SATA port in
> isolation if you can....

ATA transfer errors should cause a bad CRC resulting in a failed 
transfer which will cause complaints in the kernel log. For PATA, only 
UDMA modes can detect CRC errors, PIO and MWDMA transfers can't.

There are other places where data corruption can occur however, like 
inside the controller or the drive itself..

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-21  2:16       ` Doug Ledford
  2010-05-21 16:40         ` MRK
@ 2010-05-26 15:07         ` Bill Davidsen
  2010-05-26 15:49           ` Doug Ledford
  1 sibling, 1 reply; 12+ messages in thread
From: Bill Davidsen @ 2010-05-26 15:07 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Neil Brown, Trey Scarborough, linux-raid@vger.kernel.org

Doug Ledford wrote:
> On 05/20/2010 06:38 PM, Neil Brown wrote:
>   
>> On Thu, 20 May 2010 17:29:37 -0500
>> Trey Scarborough <treys@locallinux.com> wrote:
>>
>>     
>>> Neil Brown wrote:
>>>       
>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>> Trey Scarborough <treys@locallinux.com> wrote:
>>>>
>>>>   
>>>>         
>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that keeps 
>>>>> growing. This is causing file corruption on the underlaying file systems 
>>>>> as well.  I can copy a group of 100 100mb files and then do a md5sum on 
>>>>> them and 1-3 will be corrupt. If this is a drive that is bad is there 
>>>>> anyway to run a report on the count per drive that these mismatches 
>>>>> occur. I have run smarttools test and do not see one drive that stands 
>>>>> out to be causing errors. Could something else be causing these errors?
>>>>>     
>>>>>           
>
> While a bad drive is certainly a possibility here, this is precisely the
> type of failure scenario that would make me suspect bad RAM,
> motherboard, or CPU.  So I wouldn't rule those out as possibilities either.
>   

I have the same thought, I would remove half the RAM from the system and 
test again, then swap to the "other" half and repeat. Of course running 
memtest first is a good idea, but I have seen failures which only happen 
on disk access.

If the system is O/C obviously the first step is to cut the speed back...

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: raid 5 mismatch_cnt errors
  2010-05-26 15:07         ` Bill Davidsen
@ 2010-05-26 15:49           ` Doug Ledford
  0 siblings, 0 replies; 12+ messages in thread
From: Doug Ledford @ 2010-05-26 15:49 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Trey Scarborough, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1955 bytes --]

On 05/26/2010 11:07 AM, Bill Davidsen wrote:
> Doug Ledford wrote:
>> On 05/20/2010 06:38 PM, Neil Brown wrote:
>>  
>>> On Thu, 20 May 2010 17:29:37 -0500
>>> Trey Scarborough <treys@locallinux.com> wrote:
>>>
>>>    
>>>> Neil Brown wrote:
>>>>      
>>>>> On Thu, 20 May 2010 12:02:23 -0500
>>>>> Trey Scarborough <treys@locallinux.com> wrote:
>>>>>
>>>>>          
>>>>>> I have a raid 5 array with 9 disks and I have a mismatch_cnt that
>>>>>> keeps growing. This is causing file corruption on the underlaying
>>>>>> file systems as well.  I can copy a group of 100 100mb files and
>>>>>> then do a md5sum on them and 1-3 will be corrupt. If this is a
>>>>>> drive that is bad is there anyway to run a report on the count per
>>>>>> drive that these mismatches occur. I have run smarttools test and
>>>>>> do not see one drive that stands out to be causing errors. Could
>>>>>> something else be causing these errors?
>>>>>>               
>>
>> While a bad drive is certainly a possibility here, this is precisely the
>> type of failure scenario that would make me suspect bad RAM,
>> motherboard, or CPU.  So I wouldn't rule those out as possibilities
>> either.
>>   
> 
> I have the same thought, I would remove half the RAM from the system and
> test again, then swap to the "other" half and repeat. Of course running
> memtest first is a good idea, but I have seen failures which only happen
> on disk access.

Indeed, I've seen lots of failures that only happen with disk access and
not with memory testers.  Hence why I have a shell script on my web page
in my sig that uses disk access to test memory.

> If the system is O/C obviously the first step is to cut the speed back...
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-05-26 15:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-20 16:58 raid 5 mismatch_cnt errors Trey Scarborough
  -- strict thread matches above, loose matches on Subject: below --
2010-05-20 17:02 Trey Scarborough
2010-05-20 21:16 ` Neil Brown
2010-05-20 22:29   ` Trey Scarborough
2010-05-20 22:38     ` Neil Brown
2010-05-21  2:16       ` Doug Ledford
2010-05-21 16:40         ` MRK
2010-05-21 20:57           ` Doug Ledford
2010-05-24  9:34             ` Tim Small
2010-05-25 19:09               ` Robert Hancock
2010-05-26 15:07         ` Bill Davidsen
2010-05-26 15:49           ` Doug Ledford

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).