Question: how to identify failing disk in a RAID1

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Question: how to identify failing disk in a RAID1
@ 2008-04-13 19:14 Maurice Hilarius
  2008-04-13 19:29 ` Justin Piszcz
  0 siblings, 1 reply; 16+ messages in thread
From: Maurice Hilarius @ 2008-04-13 19:14 UTC (permalink / raw)
  To: linux-raid

Hi there.

Recently I have been frequently seeing a damaged filesystem on a RAID1 
on boot.
a lengthy fsck does get it working, but I am seeing files disappearing 
as a result.

I am pretty sure that one of the drives has developed some issues and 
needs to be replaced.

How does one identify which of the 2 disks is the one that is failing?

The system has 2 identical disks, and  / is on md0

fstab:
/dev/md0                /                       ext3    defaults        1 1
LABEL=/boot1            /boot                   ext2    defaults        1 2
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=/boot11           /boot1                  ext2    defaults        1 2
LABEL=SWAP-sdb3         swap                    swap    defaults        0 0
LABEL=SWAP-sda2         swap                    swap    defaults        0 0

fdisk -l shows me:
Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13      104391   83  Linux
/dev/sda2              14         535     4192965   82  Linux swap / Solaris
/dev/sda3             536       48641   386411445   fd  Linux raid 
autodetect

Disk /dev/sdb: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          13      104391   83  Linux
/dev/sdb2              14       48118   386403412+  fd  Linux raid 
autodetect
/dev/sdb3           48119       48640     4192965   82  Linux swap / Solaris

Disk /dev/md0: 395.6 GB, 395677007872 bytes
2 heads, 4 sectors/track, 96600832 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Anyone have a suggestion, please?
Responses off list are probably most appropriate.

Thanks for any help.

-- 
Regards, Maurice
mhilarius@gmail.com




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-13 19:14 Question: how to identify failing disk in a RAID1 Maurice Hilarius
@ 2008-04-13 19:29 ` Justin Piszcz
  2008-04-14  1:14   ` Bill Davidsen
  0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-13 19:29 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: linux-raid



On Sun, 13 Apr 2008, Maurice Hilarius wrote:

> Hi there.
>
> Recently I have been frequently seeing a damaged filesystem on a RAID1 on 
> boot.
> a lengthy fsck does get it working, but I am seeing files disappearing as a 
> result.
>
> I am pretty sure that one of the drives has developed some issues and needs 
> to be replaced.
>
> How does one identify which of the 2 disks is the one that is failing?
>
> The system has 2 identical disks, and  / is on md0
>
> fstab:
> /dev/md0                /                       ext3    defaults        1 1
> LABEL=/boot1            /boot                   ext2    defaults        1 2
> tmpfs                   /dev/shm                tmpfs   defaults        0 0
> devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
> sysfs                   /sys                    sysfs   defaults        0 0
> proc                    /proc                   proc    defaults        0 0
> LABEL=/boot11           /boot1                  ext2    defaults        1 2
> LABEL=SWAP-sdb3         swap                    swap    defaults        0 0
> LABEL=SWAP-sda2         swap                    swap    defaults        0 0
>
> fdisk -l shows me:
> Disk /dev/sda: 400.0 GB, 400088457216 bytes
> 255 heads, 63 sectors/track, 48641 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
>  Device Boot      Start         End      Blocks   Id  System
> /dev/sda1   *           1          13      104391   83  Linux
> /dev/sda2              14         535     4192965   82  Linux swap / Solaris
> /dev/sda3             536       48641   386411445   fd  Linux raid autodetect
>
> Disk /dev/sdb: 400.0 GB, 400088457216 bytes
> 255 heads, 63 sectors/track, 48641 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
>  Device Boot      Start         End      Blocks   Id  System
> /dev/sdb1   *           1          13      104391   83  Linux
> /dev/sdb2              14       48118   386403412+  fd  Linux raid autodetect
> /dev/sdb3           48119       48640     4192965   82  Linux swap / Solaris
>
> Disk /dev/md0: 395.6 GB, 395677007872 bytes
> 2 heads, 4 sectors/track, 96600832 cylinders
> Units = cylinders of 8 * 512 = 4096 bytes
>
> Anyone have a suggestion, please?
> Responses off list are probably most appropriate.
>
> Thanks for any help.
>
> -- 
> Regards, Maurice
> mhilarius@gmail.com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

smartctl -a /dev/sda
smartctl -a /dev/sdb

also, how come swap was not on the raid1?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-13 19:29 ` Justin Piszcz
@ 2008-04-14  1:14   ` Bill Davidsen
       [not found]     ` <4802CDA2.605@harddata.com>
       [not found]     ` <480F7105.9030405@harddata.com>
  0 siblings, 2 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-04-14  1:14 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: Maurice Hilarius, linux-raid

Justin Piszcz wrote:
>
>
> On Sun, 13 Apr 2008, Maurice Hilarius wrote:
>
>> Hi there.
>>
>> Recently I have been frequently seeing a damaged filesystem on a 
>> RAID1 on boot.
>> a lengthy fsck does get it working, but I am seeing files 
>> disappearing as a result.
>>
>> I am pretty sure that one of the drives has developed some issues and 
>> needs to be replaced.
>>
>> How does one identify which of the 2 disks is the one that is failing?
>>
>> The system has 2 identical disks, and  / is on md0
>>
>> fstab:
>> /dev/md0                /                       ext3    
>> defaults        1 1
>> LABEL=/boot1            /boot                   ext2    
>> defaults        1 2
>> tmpfs                   /dev/shm                tmpfs   
>> defaults        0 0
>> devpts                  /dev/pts                devpts  
>> gid=5,mode=620  0 0
>> sysfs                   /sys                    sysfs   
>> defaults        0 0
>> proc                    /proc                   proc    
>> defaults        0 0
>> LABEL=/boot11           /boot1                  ext2    
>> defaults        1 2
>> LABEL=SWAP-sdb3         swap                    swap    
>> defaults        0 0
>> LABEL=SWAP-sda2         swap                    swap    
>> defaults        0 0
>>
>> fdisk -l shows me:
>> Disk /dev/sda: 400.0 GB, 400088457216 bytes
>> 255 heads, 63 sectors/track, 48641 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>
>>  Device Boot      Start         End      Blocks   Id  System
>> /dev/sda1   *           1          13      104391   83  Linux
>> /dev/sda2              14         535     4192965   82  Linux swap / 
>> Solaris
>> /dev/sda3             536       48641   386411445   fd  Linux raid 
>> autodetect
>>
>> Disk /dev/sdb: 400.0 GB, 400088457216 bytes
>> 255 heads, 63 sectors/track, 48641 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>
>>  Device Boot      Start         End      Blocks   Id  System
>> /dev/sdb1   *           1          13      104391   83  Linux
>> /dev/sdb2              14       48118   386403412+  fd  Linux raid 
>> autodetect
>> /dev/sdb3           48119       48640     4192965   82  Linux swap / 
>> Solaris
>>
>> Disk /dev/md0: 395.6 GB, 395677007872 bytes
>> 2 heads, 4 sectors/track, 96600832 cylinders
>> Units = cylinders of 8 * 512 = 4096 bytes
>>
>> Anyone have a suggestion, please?
>> Responses off list are probably most appropriate.
>>
>> Thanks for any help.
>>
>> -- 
>> Regards, Maurice
>> mhilarius@gmail.com
>>
>
> smartctl -a /dev/sda
> smartctl -a /dev/sdb
>
> also, how come swap was not on the raid1?

Very unexpected that the data would be bad without any hardware errors. 
Did you look at your logs to see if one of your drives, or perhasps 
both, are getting hardware errors? I would run a 'check' and and see 
what mdadm finds on the array, you may have other problems.

Actually, I think I would run memtest86 for at least a few hours, 
starting from a really cold system (not just a cold boot, off for a few 
hours). Your comment "on boot" may come from memory or other component 
which needs to physically get up to temperature before working reliably. 
Particularly if you don't get additional errors after you have been up 
for a while.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <4802CDA2.605@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]     ` <4802CDA2.605@harddata.com>
@ 2008-04-14 16:38       ` Bill Davidsen
       [not found]         ` <4804CD4F.7080303@harddata.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-14 16:38 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: Linux RAID

Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> ..
>>>> I am pretty sure that one of the drives has developed some issues 
>>>> and needs to be replaced.
>>>> ..
>>
>> Very unexpected that the data would be bad without any hardware errors. 
> I DID say:
> "I am pretty sure that one of the drives has developed some issues and 
> needs to be replaced. "
>> Did you look at your logs to see if one of your drives, or perhasps 
>> both, are getting hardware errors?
> Oh, I KNOW one does..
> The question is WHICH one?
>
I no longer have any old logs showing errors, but /var/log/messages 
and/or dmesg should have an error message with a drive identification if 
you are getting disk errors.
>> I would run a 'check' and and see what mdadm finds on the array, you 
>> may have other problems.
>>
> Pardon my stupidity, care to share some syntax for that?

cd /sys/block/md0/md
echo check >sync_action; cat mismatch_cnt

That's the count of errors found. Replace 'check' with 'repair' to make 
the errors go away, reboot, run 'check' again.

>> Actually, I think I would run memtest86 for at least a few hours, 
>> starting from a really cold system (not just a cold boot, off for a 
>> few hours).
> Did that already.
>> Your comment "on boot" may come from memory or other component which 
>> needs to physically get up to temperature before working reliably. 
>> Particularly if you don't get additional errors after you have been 
>> up for a while.
>>
> It happens cold or hot.
>
>
> -- 
> Regards, Maurice
>


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <4804CD4F.7080303@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]         ` <4804CD4F.7080303@harddata.com>
@ 2008-04-15 18:14           ` Bill Davidsen
       [not found]             ` <48050DD6.7020404@harddata.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-15 18:14 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: Linux RAID

Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> ..
>> cd /sys/block/md0/md
>> echo check >sync_action; cat mismatch_cnt
>>
> Hi Bill,
>
> I am doing this as root.
> I am seeing:
>
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt
> -bash: echo: write error: Device or resource busy
> 0
>
> any suggestions?

First, did you cd to the /sys/block/mdX/md directory? And did you wait 
for the check to finish, watching /proc/mdstat?
I left that out, assumed you had read it in the man pages...

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <48050DD6.7020404@harddata.com>]

[parent not found: <48055EFA.8060505@tmr.com>]

[parent not found: <480607F2.3060504@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]                 ` <480607F2.3060504@harddata.com>
@ 2008-04-17 13:12                   ` Bill Davidsen
       [not found]                     ` <48076096.2020804@harddata.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-17 13:12 UTC (permalink / raw)
  To: Maurice Hilarius, Linux RAID

Maurice Hilarius wrote:
> Morning Bill.
>
> BTW< I want to say "Thanks for your help with this" first.
> Just in case I forgot.
>
> So, I ran "check" once. It complained, and failed.
>
Does the failure provide any useful information?

> A few hours later, I ran it again, and it immediately returned "0"
>
I totally don't understand that, assuming that the first check was sone.

> I am still puzzled:
> Why it failed the first time
> Why it returned a result in a couple of seconds the second time.
> What it tells me?
>    I gather this means the md0 device is healthy.
>
I don't think so, I've never had a check *fail*, I just expect it to 
tell me how bad things are.

> So, meanwhile back at the ranch, I still think sda is failing..

I think it's time to be keeping a good backup, and hopefully someone 
else has a good thought on running this down more.

> Any thoughts on that?

The only thought I have at the moment is marginal power supply, and 
that's just because it can generate all manner of odd behaviors, rather 
than any other hints. Sorry.

If you aren't getting errors from SMART or logs, and I don't remember 
you sending me that info, I'm not sure how you determine which drive is 
the problem.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <48076096.2020804@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]                     ` <48076096.2020804@harddata.com>
@ 2008-04-18 13:17                       ` Bill Davidsen
  0 siblings, 0 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-04-18 13:17 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: vger majordomo for lists

Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> Maurice Hilarius wrote:
>>> Morning Bill.
>>>
>>> BTW< I want to say "Thanks for your help with this" first.
>>> Just in case I forgot.
>>>
>>> So, I ran "check" once. It complained, and failed.
>>>
>> Does the failure provide any useful information?
>>
> No.
> Here is what I got the first time:
>
> root@localhost md]# echo check >sync_action; cat mismatch_cnt
> -bash: echo: write error: Device or resource busy
> 0
>
> Later, on my second try, a few hours later, it worked, reporting no error.
> ..
> [maurice@localhost ~]$ su -
> Password:
> [root@localhost ~]# cd /sys/block/md0/md
> [root@localhost md]# cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active raid1 sda3[0] sdb2[1]
>       386403328 blocks [2/2] [UU]
>
> unused devices: <none>
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt
> 0
>
>>
>> I think it's time to be keeping a good backup, and hopefully someone 
>> else has a good thought on running this down more.
>>
> Thanks, updated that backup at the first sign of trouble
>>> Any thoughts on that?
>>
>> The only thought I have at the moment is marginal power supply, and 
>> that's just because it can generate all manner of odd behaviors, 
>> rather than any other hints. Sorry.
>>
> Yeah. I am going to replace *both* disks, and then run the 
> manufacturers utility (Seatest) on them.
>> If you aren't getting errors from SMART or logs, and I don't remember 
>> you sending me that info, I'm not sure how you determine which drive 
>> is the problem.
> Exactly.
>
> Thanks a LOT for trying, Bill..

Actually, my though is that you may not actually be getting hardware 
errors, which is why they are not being report by either the kernel or 
SMART. That's why I thought of memory and/or power issues, either of 
which could cause what you are seeing.

Guess I have to leave it there, maybe someone else will have a thought.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 



^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <480F7105.9030405@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]     ` <480F7105.9030405@harddata.com>
@ 2008-04-23 18:54       ` Justin Piszcz
       [not found]         ` <480F8830.6020207@harddata.com>
  0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-23 18:54 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: Bill Davidsen, linux-raid



On Wed, 23 Apr 2008, Maurice Hilarius wrote:

> Hi all.
>
> With much appreciated help from Bell Davidsen and Justin Piszcz I recently 
> dealt with a problem with a RAID1 set, caused by a failing hard disk.
>
> At the end, there is one question remaining, which I think is quite 
> important:
> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm rapidly 
> kicks out the offending device.
> Some might say "too easily" but that is another thread.
>
> On a RAID1 set, until the failing disk completely "packs it in" it remains 
> part of the RAID.
>
> Why??
>
> Some more background:
> Since the issue was reported and explored I have recreated this on a test 
> machine.
> Installed RAID1 with one known good and one know error prone drive.
> Easy to do as the error drive has a thermal issue.
> Keep it cold, no problems, but after 30 minutes use in a +25C room it start 
> to generate data errors.
> I reproduced exactly the problem I saw before:
> Data errors occur, the other drive in the RAID1 set gets "infected" with the 
> bad data, and the file system will get corrupted.
> On BOTH drives.
>
> This is highly reproducible.
>
> In summary:
> 1) RAID1 lacks significant protection from the effects of a data error 
> condition on a failing drive
> 2) I recommend anyone using madadm refrain from using RAID1 until this issue 
> is addressed and resolved.
>
> Thanks again.
I can confirm this, until you actually REBOOT the host with RAID1 only 
then will it kick it out.  Whereas with RAID5, I experienced the same 
thing, it kicks it out right away, would need to wait for the 
linux-raid/developers to answer this one.

Justin.

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <480F8830.6020207@harddata.com>]

* Re: Question: how to identify failing disk in a RAID1
       [not found]         ` <480F8830.6020207@harddata.com>
@ 2008-04-23 19:26           ` Justin Piszcz
  2008-04-27 17:03             ` Keith Roberts
  0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-23 19:26 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: Bill Davidsen, linux-raid



On Wed, 23 Apr 2008, Maurice Hilarius wrote:

> Justin Piszcz wrote:
>> 
>> 
>> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>> 
>>> Hi all.
>>> 
>>> With much appreciated help from Bell Davidsen and Justin Piszcz I recently 
>>> dealt with a problem with a RAID1 set, caused by a failing hard disk.
>>> 
>>> At the end, there is one question remaining, which I think is quite 
>>> important:
>>> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm rapidly 
>>> kicks out the offending device.
>>> Some might say "too easily" but that is another thread.
>>> 
>>> On a RAID1 set, until the failing disk completely "packs it in" it remains 
>>> part of the RAID.
>>> 
>>> Why??
>>> 
>>> Some more background:
>>> Since the issue was reported and explored I have recreated this on a test 
>>> machine.
>>> Installed RAID1 with one known good and one know error prone drive.
>>> Easy to do as the error drive has a thermal issue.
>>> Keep it cold, no problems, but after 30 minutes use in a +25C room it 
>>> start to generate data errors.
>>> I reproduced exactly the problem I saw before:
>>> Data errors occur, the other drive in the RAID1 set gets "infected" with 
>>> the bad data, and the file system will get corrupted.
>>> On BOTH drives.
>>> 
>>> This is highly reproducible.
>>> 
>>> In summary:
>>> 1) RAID1 lacks significant protection from the effects of a data error 
>>> condition on a failing drive
>>> 2) I recommend anyone using madadm refrain from using RAID1 until this 
>>> issue is addressed and resolved.
>>> 
>>> Thanks again.
>> I can confirm this, until you actually REBOOT the host with RAID1 only then 
>> will it kick it out.  Whereas with RAID5, I experienced the same thing, it 
>> kicks it out right away, would need to wait for the linux-raid/developers 
>> to answer this one.
>> 
>> Justin.
>> 
> Actually reboot does not help me.
> mdadm seems to NEVER "kick out" the bad disk.
> Even when it is horribly erroring.
>
> I think this is a CRITICAL problem, as, if one is using RAID1 thinking it 
> will enhance their data reliability,
> they stand a very good chance of getting a nasty surprise.
Yikes, what kernel+mobo+chipset+drives are in use (the developers will 
want to know) also are you using drives on different channels?  Or e.g., 
two drives on one ide cable?  (To summarize for the developers)

Justin.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-23 19:26           ` Justin Piszcz
@ 2008-04-27 17:03             ` Keith Roberts
  2008-04-27 19:28               ` Richard Scobie
  2008-04-27 21:53               ` Mark Hahn
  0 siblings, 2 replies; 16+ messages in thread
From: Keith Roberts @ 2008-04-27 17:03 UTC (permalink / raw)
  To: linux-raid

On Wed, 23 Apr 2008, Justin Piszcz wrote:

> To: Maurice Hilarius <maurice@harddata.com>
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> Subject: Re: Question: how to identify failing disk in a RAID1
> 
>
>
> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>
>> Justin Piszcz wrote:
>>> 
>>> 
>>> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>>> 
>>>> Hi all.
>>>> 
>>>> With much appreciated help from Bell Davidsen and Justin Piszcz I 
>>>> recently dealt with a problem with a RAID1 set, caused by a failing 
>>>> hard disk.
>>>> 
>>>> At the end, there is one question remaining, which I think is quite 
>>>> important:
>>>> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm 
>>>> rapidly kicks out the offending device.
>>>> Some might say "too easily" but that is another thread.
>>>> 
>>>> On a RAID1 set, until the failing disk completely "packs it in" it 
>>>> remains part of the RAID.
>>>> 
>>>> Why??
>>>> 
>>>> Some more background:
>>>> Since the issue was reported and explored I have recreated this on a 
>>>> test machine.
>>>> Installed RAID1 with one known good and one know error prone drive.
>>>> Easy to do as the error drive has a thermal issue.
>>>> Keep it cold, no problems, but after 30 minutes use in a +25C room it 
>>>> start to generate data errors.
>>>> I reproduced exactly the problem I saw before:
>>>> Data errors occur, the other drive in the RAID1 set gets "infected" 
>>>> with the bad data, and the file system will get corrupted.
>>>> On BOTH drives.
>>>> 
>>>> This is highly reproducible.
>>>> 
>>>> In summary:
>>>> 1) RAID1 lacks significant protection from the effects of a data 
>>>> error condition on a failing drive
>>>> 2) I recommend anyone using madadm refrain from using RAID1 until 
>>>> this issue is addressed and resolved.
>>>> 
>>>> Thanks again.
>>> I can confirm this, until you actually REBOOT the host with RAID1 only 
>>> then will it kick it out.  Whereas with RAID5, I experienced the same 
>>> thing, it kicks it out right away, would need to wait for the 
>>> linux-raid/developers to answer this one.
>>> 
>>> Justin.
>>> 
>> Actually reboot does not help me.
>> mdadm seems to NEVER "kick out" the bad disk.
>> Even when it is horribly erroring.
>> 
>> I think this is a CRITICAL problem, as, if one is using RAID1 thinking 
>> it will enhance their data reliability,
>> they stand a very good chance of getting a nasty surprise.
> Yikes, what kernel+mobo+chipset+drives are in use (the developers will 
> want to know) also are you using drives on different channels?  Or e.g., 
> two drives on one ide cable?  (To summarize for the developers)
>
> Justin.

I'm now looking at using smartmontools to monitor my hard 
drive's status, maybe instead of using RAID1 arrays.

http://en.wikipedia.org/wiki/S.M.A.R.T.

http://smartmontools.sourceforge.net/

It appears that smartmontools will not work with the linux 
software RAID layer. So I guess I need to make a choice of 
which one to use - smartmontools or RAID1 mirrors?

Obviously I don't want to be mirroring corrupted drive data.

It would be nice to be able to use smartmontools to monitor 
the health of the drives in a RAID1 array. Get the best of 
both worlds then.

Is there any way that the smartmontools code can be included 
in the md driver code, to allow mdadm access to the SMART 
data on a RAID1 set of disks please?

Kind Regards

Keith Roberts

-----------------------------------------------------------------
Websites:
http://www.php-debuggers.net
http://www.karsites.net
http://www.raised-from-the-dead.org.uk

All email addresses are challenge-response protected with
TMDA [http://tmda.net]
-----------------------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-27 17:03             ` Keith Roberts
@ 2008-04-27 19:28               ` Richard Scobie
  2008-04-28  5:29                 ` Keith Roberts
  2008-04-27 21:53               ` Mark Hahn
  1 sibling, 1 reply; 16+ messages in thread
From: Richard Scobie @ 2008-04-27 19:28 UTC (permalink / raw)
  To: Linux RAID Mailing List

Keith Roberts wrote:

> It would be nice to be able to use smartmontools to monitor the health 
> of the drives in a RAID1 array. Get the best of both worlds then.

I have been doing this for years.

What problems are you seeing using smartd on a RAID1?

Regards,

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-27 19:28               ` Richard Scobie
@ 2008-04-28  5:29                 ` Keith Roberts
  2008-04-28  6:06                   ` Michael Tokarev
  2008-04-28  7:01                   ` Richard Scobie
  0 siblings, 2 replies; 16+ messages in thread
From: Keith Roberts @ 2008-04-28  5:29 UTC (permalink / raw)
  To: linux-raid

On Mon, 28 Apr 2008, Richard Scobie wrote:

> To: Linux RAID Mailing List <linux-raid@vger.kernel.org>
> From: Richard Scobie <richard@sauce.co.nz>
> Subject: Re: Question: how to identify failing disk in a RAID1
> 
> Keith Roberts wrote:
>
>> It would be nice to be able to use smartmontools to monitor the health 
>> of the drives in a RAID1 array. Get the best of both worlds then.
>
> I have been doing this for years.
>
> What problems are you seeing using smartd on a RAID1?
>
> Regards,
>
> Richard

Reading the documentation for smartmontools I got the 
impression that it cannot work with RAID controllers, apart 
from 3ware and some Highpoint. Maybe I'm getting mixed up 
with hardware raid?

So is it safe to use all features of smartmontools, 
including running tests, on a Linux software RAID1 array?

Kind Regards

Keith Roberts

-----------------------------------------------------------------
Websites:
http://www.php-debuggers.net
http://www.karsites.net
http://www.raised-from-the-dead.org.uk

All email addresses are challenge-response protected with
TMDA [http://tmda.net]
-----------------------------------------------------------------



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-28  5:29                 ` Keith Roberts
@ 2008-04-28  6:06                   ` Michael Tokarev
  2008-04-28  7:01                   ` Richard Scobie
  1 sibling, 0 replies; 16+ messages in thread
From: Michael Tokarev @ 2008-04-28  6:06 UTC (permalink / raw)
  To: linux-raid

Keith Roberts wrote:
[]

i started writing a reply but later noticed this:

> All email addresses are challenge-response protected with
> TMDA [http://tmda.net]

And removed the reply.  Don't outsource YOUR mail filtering
to everyone else.  Thank you.

/mjt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-28  5:29                 ` Keith Roberts
  2008-04-28  6:06                   ` Michael Tokarev
@ 2008-04-28  7:01                   ` Richard Scobie
  1 sibling, 0 replies; 16+ messages in thread
From: Richard Scobie @ 2008-04-28  7:01 UTC (permalink / raw)
  To: linux-raid

Keith Roberts wrote:

> Reading the documentation for smartmontools I got the impression that it 
> cannot work with RAID controllers, apart from 3ware and some Highpoint. 
> Maybe I'm getting mixed up with hardware raid?

Hardware controllers are a different story, (you can add to LSI to the 
above). There are no problems with md RAID.

> So is it safe to use all features of smartmontools, including running 
> tests, on a Linux software RAID1 array?

No problems at all that I am aware of. I run smartd and perform long 
self checks weekly on all drives I have in live RAID 1 and 5 arrays.

Regards,

Richard

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
  2008-04-27 17:03             ` Keith Roberts
  2008-04-27 19:28               ` Richard Scobie
@ 2008-04-27 21:53               ` Mark Hahn
  1 sibling, 0 replies; 16+ messages in thread
From: Mark Hahn @ 2008-04-27 21:53 UTC (permalink / raw)
  To: Keith Roberts; +Cc: linux-raid

> Is there any way that the smartmontools code can be included in the md driver 
> code, to allow mdadm access to the SMART data on a RAID1 set of disks please?

eh?  smart monitors disks.  the disks that comprise your raids are still 
available as disks, and smart can monitor them just fine...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: how to identify failing disk in a RAID1
@ 2008-04-18 17:36 David Lethe
  0 siblings, 0 replies; 16+ messages in thread
From: David Lethe @ 2008-04-18 17:36 UTC (permalink / raw)
  To: Bill Davidsen, Maurice Hilarius; +Cc: vger majordomo for lists

The sympoms are indicative of a standard bad block reallocation.  Depending on make, model, firmare rev and even location of the new defect it could take several seconds for the disk to grab a spare from the reserved are and fix the defect. No reason for concern ... The system worked like it was desigmed to .

-----Original Message-----

From:  "Bill Davidsen" <davidsen@tmr.com>
Subj:  Re: Question: how to identify failing disk in a RAID1
Date:  Fri Apr 18, 2008 8:15 am
Size:  2K
To:  "Maurice Hilarius" <maurice@harddata.com>
cc:  "vger majordomo for lists" <linux-raid@vger.kernel.org>

Maurice Hilarius wrote: 
> Bill Davidsen wrote: 
>> Maurice Hilarius wrote: 
>>> Morning Bill. 
>>> 
>>> BTW< I want to say "Thanks for your help with this" first. 
>>> Just in case I forgot. 
>>> 
>>> So, I ran "check" once. It complained, and failed. 
>>> 
>> Does the failure provide any useful information? 
>> 
> No. 
> Here is what I got the first time: 
> 
> root@localhost md]# echo check >sync_action; cat mismatch_cnt 
> -bash: echo: write error: Device or resource busy 
> 0 
> 
> Later, on my second try, a few hours later, it worked, reporting no error. 
> .. 
> [maurice@localhost ~]$ su - 
> Password: 
> [root@localhost ~]# cd /sys/block/md0/md 
> [root@localhost md]# cat /proc/mdstat 
> Personalities : [raid1] [raid6] [raid5] [raid4] 
> md0 : active raid1 sda3[0] sdb2[1] 
>       386403328 blocks [2/2] [UU] 
> 
> unused devices: <none> 
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt 
> 0 
> 
>> 
>> I think it's time to be keeping a good backup, and hopefully someone  
>> else has a good thought on running this down more. 
>> 
> Thanks, updated that backup at the first sign of trouble 
>>> Any thoughts on that? 
>> 
>> The only thought I have at the moment is marginal power supply, and  
>> that's just because it can generate all manner of odd behaviors,  
>> rather than any other hints. Sorry. 
>> 
> Yeah. I am going to replace *both* disks, and then run the  
> manufacturers utility (Seatest) on them. 
>> If you aren't getting errors from SMART or logs, and I don't remember  
>> you sending me that info, I'm not sure how you determine which drive  
>> is the problem. 
> Exactly. 
> 
> Thanks a LOT for trying, Bill.. 
 
Actually, my though is that you may not actually be getting hardware  
errors, which is why they are not being report by either the kernel or  
SMART. That's why I thought of memory and/or power issues, either of  
which could cause what you are seeing. 
 
Guess I have to leave it there, maybe someone else will have a thought. 
 
--  
Bill Davidsen <davidsen@tmr.com> 
  "Woe unto the statesman who makes war without a reason that will still 
  be valid when the war is over..." Otto von Bismark  
 
 
-- 
To unsubscribe from this list: send the line "unsubscribe linux-raid" in 
the body of a message to majordomo@vger.kernel.org 
More majordomo info at  http://vger.kernel.org/majordomo-info.html 
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-04-28  7:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-13 19:14 Question: how to identify failing disk in a RAID1 Maurice Hilarius
2008-04-13 19:29 ` Justin Piszcz
2008-04-14  1:14   ` Bill Davidsen
     [not found]     ` <4802CDA2.605@harddata.com>
2008-04-14 16:38       ` Bill Davidsen
     [not found]         ` <4804CD4F.7080303@harddata.com>
2008-04-15 18:14           ` Bill Davidsen
     [not found]             ` <48050DD6.7020404@harddata.com>
     [not found]               ` <48055EFA.8060505@tmr.com>
     [not found]                 ` <480607F2.3060504@harddata.com>
2008-04-17 13:12                   ` Bill Davidsen
     [not found]                     ` <48076096.2020804@harddata.com>
2008-04-18 13:17                       ` Bill Davidsen
     [not found]     ` <480F7105.9030405@harddata.com>
2008-04-23 18:54       ` Justin Piszcz
     [not found]         ` <480F8830.6020207@harddata.com>
2008-04-23 19:26           ` Justin Piszcz
2008-04-27 17:03             ` Keith Roberts
2008-04-27 19:28               ` Richard Scobie
2008-04-28  5:29                 ` Keith Roberts
2008-04-28  6:06                   ` Michael Tokarev
2008-04-28  7:01                   ` Richard Scobie
2008-04-27 21:53               ` Mark Hahn
  -- strict thread matches above, loose matches on Subject: below --
2008-04-18 17:36 David Lethe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).