* Problem diagnosing rebuilding raid5 array
@ 2013-10-14 16:31 peter
2013-10-14 17:28 ` Brian Candler
2013-10-16 6:11 ` NeilBrown
0 siblings, 2 replies; 14+ messages in thread
From: peter @ 2013-10-14 16:31 UTC (permalink / raw)
To: linux-raid
Hi!
I'm having some problems with a raid 5 array and I'm not sure how to
diagnose the problem and how to proceed so I figured I need to ask the
experts :-)
I actually suspect I may have several problems at the same time.
The machine has two raid arrays, one raid 1 (md0) and one raid 5
(md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.
I found some read errors in the log on /dev/sdh so I replaced it with
a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.
The array was rebuilding and I left it for the night.
In the morning cat /proc/mdstat showed that 2 drives where down. I may
remember incorrectly but I think that /dev/sdh showed up as a spare
and another drive showed fail but the array showed up as active.
Anyway, I'm not sure which drive showed fail but I disconnected the
system for more diagnosis. This was a couple of days ago.
I found that the CPU fan had stopped working and replaced it. The case
have several fans and the heatsink seemed cool even without the fan
(it's an i3-530 that does nothing more than samba so it's mostly
idle). Possibly the hardrives has been running hotter than normal for
a while though.
Anyway, now when I reboot I get this:
> cat /proc/mdstat
Personalities : [raid1]
md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
9767572480 blocks
md0 : active raid1 sda[0] sdb[1]
1953514496 blocks [2/2] [UU]
unused devices: <none>
I'm not sure what is happening and what my next step is. I would
appreciate any help on this so I don't screw up the system more than
it already is :-)
Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.
BTW, don't know if it matters but the system is running an older
debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.
Best Regards,
Peter
> mdadm --examine /dev/sd?
/dev/sdd:
Magic : a92b4efc
Version : 00.90.00
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
Creation Time : Thu Jun 24 15:12:41 2010
Raid Level : raid5
Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 1
Update Time : Wed Oct 9 20:29:41 2013
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 3dc0af1a - correct
Events : 1288444
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
this 1 8 48 1 active sync /dev/sdd
0 0 0 0 0 removed
1 1 8 48 1 active sync /dev/sdd
2 2 8 80 2 active sync /dev/sdf
3 3 0 0 3 faulty removed
4 4 8 96 4 active sync /dev/sdg
5 5 8 112 5 spare /dev/sdh
/dev/sde:
Magic : a92b4efc
Version : 00.90.00
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
Creation Time : Thu Jun 24 15:12:41 2010
Raid Level : raid5
Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 1
Update Time : Tue Oct 8 03:26:05 2013
State : clean
Active Devices : 4
Working Devices : 5
Failed Devices : 1
Spare Devices : 1
Checksum : 3dbe6d93 - correct
Events : 1288428
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
this 0 8 64 0 active sync /dev/sde
0 0 8 64 0 active sync /dev/sde
1 1 8 48 1 active sync /dev/sdd
2 2 8 80 2 active sync /dev/sdf
3 3 0 0 3 faulty removed
4 4 8 96 4 active sync /dev/sdg
5 5 8 112 5 spare /dev/sdh
/dev/sdf:
Magic : a92b4efc
Version : 00.90.00
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
Creation Time : Thu Jun 24 15:12:41 2010
Raid Level : raid5
Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 1
Update Time : Wed Oct 9 20:29:41 2013
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 3dc0af3c - correct
Events : 1288444
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
this 2 8 80 2 active sync /dev/sdf
0 0 0 0 0 removed
1 1 8 48 1 active sync /dev/sdd
2 2 8 80 2 active sync /dev/sdf
3 3 0 0 3 faulty removed
4 4 8 96 4 active sync /dev/sdg
5 5 8 112 5 spare /dev/sdh
/dev/sdg:
Magic : a92b4efc
Version : 00.90.00
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
Creation Time : Thu Jun 24 15:12:41 2010
Raid Level : raid5
Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 1
Update Time : Wed Oct 9 20:29:41 2013
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 3dc0af50 - correct
Events : 1288444
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
this 4 8 96 4 active sync /dev/sdg
0 0 0 0 0 removed
1 1 8 48 1 active sync /dev/sdd
2 2 8 80 2 active sync /dev/sdf
3 3 0 0 3 faulty removed
4 4 8 96 4 active sync /dev/sdg
5 5 8 112 5 spare /dev/sdh
/dev/sdh:
Magic : a92b4efc
Version : 00.90.00
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
Creation Time : Thu Jun 24 15:12:41 2010
Raid Level : raid5
Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 1
Update Time : Wed Oct 9 20:29:41 2013
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 3dc0af5c - correct
Events : 1288444
Layout : left-symmetric
Chunk Size : 128K
Number Major Minor RaidDevice State
this 5 8 112 5 spare /dev/sdh
0 0 0 0 0 removed
1 1 8 48 1 active sync /dev/sdd
2 2 8 80 2 active sync /dev/sdf
3 3 0 0 3 faulty removed
4 4 8 96 4 active sync /dev/sdg
5 5 8 112 5 spare /dev/sdh
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
@ 2013-10-14 17:28 ` Brian Candler
2013-10-15 12:40 ` peter
2013-10-16 6:11 ` NeilBrown
1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-14 17:28 UTC (permalink / raw)
To: peter, linux-raid
On 14/10/2013 17:31, peter@steinhoff.se wrote:
>
> I found that the CPU fan had stopped working and replaced it. The case
> have several fans and the heatsink seemed cool even without the fan
> (it's an i3-530 that does nothing more than samba so it's mostly
> idle). Possibly the hardrives has been running hotter than normal for
> a while though.
>
Aside: in some cases it might be a good idea to disable the case control
- in the BIOS if your system supports it, or by removing the fan control
header completely.
This was a system with 24 drives and 3 LSI HBAs. The case fan control
was based on the CPU temperature alone. Therefore if the CPU was idle,
the fan speed went very low, which meant that the drives and the HBAs
got very hot.
This led to the perverse situation that when I was testing the system
heavily with lots of reads and writes it went for weeks without
problems, but if I left it idle for a day or two the HBAs crashed!
Regards,
Brian.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-14 17:28 ` Brian Candler
@ 2013-10-15 12:40 ` peter
2013-10-15 12:50 ` Brian Candler
2013-10-15 13:15 ` Brian Candler
0 siblings, 2 replies; 14+ messages in thread
From: peter @ 2013-10-15 12:40 UTC (permalink / raw)
To: Brian Candler; +Cc: linux-raid
Thanks Brian, that's a good point.
In this case the case fans are running constant speed and only the cpu
fan is PWM controlled. So the temperature over the drives should have
been more or less OK.
Unfortunately the BIOS on this Intel motherboard didn't show fan speed
or temperatures as it probably should so there was no alarms going off.
On another note, I took the failed drive I replaced in my array
(/dev/sdh) and put it in another machine (win7) and run Western
Digital's diagnosis software and it says the drive is OK.
I'm wondering if perhaps it's possible the CPU has been running too
hot and the raid array failed because of that.
Anyway, I'm still at loss what to do and what my next step should be...
Thanks,
Peter
Quoting Brian Candler <b.candler@pobox.com>:
> On 14/10/2013 17:31, peter@steinhoff.se wrote:
>>
>> I found that the CPU fan had stopped working and replaced it. The
>> case have several fans and the heatsink seemed cool even without
>> the fan (it's an i3-530 that does nothing more than samba so it's
>> mostly idle). Possibly the hardrives has been running hotter than
>> normal for a while though.
>>
> Aside: in some cases it might be a good idea to disable the case
> control - in the BIOS if your system supports it, or by removing the
> fan control header completely.
>
> This was a system with 24 drives and 3 LSI HBAs. The case fan
> control was based on the CPU temperature alone. Therefore if the CPU
> was idle, the fan speed went very low, which meant that the drives
> and the HBAs got very hot.
>
> This led to the perverse situation that when I was testing the
> system heavily with lots of reads and writes it went for weeks
> without problems, but if I left it idle for a day or two the HBAs
> crashed!
>
> Regards,
>
> Brian.
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 12:40 ` peter
@ 2013-10-15 12:50 ` Brian Candler
2013-10-15 14:20 ` peter
2013-10-15 13:15 ` Brian Candler
1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 12:50 UTC (permalink / raw)
To: peter; +Cc: linux-raid
On 15/10/2013 13:40, peter@steinhoff.se wrote:
>
> On another note, I took the failed drive I replaced in my array
> (/dev/sdh) and put it in another machine (win7) and run Western
> Digital's diagnosis software and it says the drive is OK.
>
Another type of drive test you can do is with smartctl:
smartctl -t conveyance /dev/sdX
... typically takes a couple of minutes
smartctl -l selftest /dev/sdX # to monitor progress / completion
smartctl -t long /dev/sdX
... typically takes 6 hours on a 3TB drive
smartctl -l selftest /dev/sdX # to monitor progress / completion
Regards,
Brian.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 12:40 ` peter
2013-10-15 12:50 ` Brian Candler
@ 2013-10-15 13:15 ` Brian Candler
2013-10-15 14:14 ` peter
1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 13:15 UTC (permalink / raw)
To: peter; +Cc: linux-raid
On 15/10/2013 13:40, peter@steinhoff.se wrote:
>
> Anyway, I'm still at loss what to do and what my next step should be...
>
Well, I'm not the world's authority on this, but from what I can see:
$ egrep '^/|UUID|State :|Events :' ert
/dev/sdd:
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
State : clean
Events : 1288444
/dev/sde:
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
State : clean
Events : 1288428
/dev/sdf:
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
State : clean
Events : 1288444
/dev/sdg:
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
State : clean
Events : 1288444
/dev/sdh:
UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
State : clean
Events : 1288444
So it looks like sde is stale with respect to the other drives (which
have a larger event count) and therefore is not being used. But sdh is a
spare (you said it was rebuilding onto this?), so you have N-2 usable
data disks, which is not enough to start RAID5.
DON'T do the following before someone else on the list confirms this is
the right course of action, but you can force the array to assemble using:
mdadm --stop /dev/mdXXX
mdadm --assemble --force --run /dev/mdXXX /dev/sd{d,e,f,g,h}
But since the state of sde is old, I think there is a real risk that
data corruption has taken place. Do an fsck before mounting. It may be
better to restore from a trusted backup.
Regards,
Brian.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 13:15 ` Brian Candler
@ 2013-10-15 14:14 ` peter
2013-10-15 14:20 ` Brian Candler
0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-15 14:14 UTC (permalink / raw)
To: Brian Candler; +Cc: linux-raid
Thanks Brian.
Yes, I replaced sdh because it showed read errors in dmesg and was
kicked out of the array.
But has the new sdh been rebuilt completely? If that case there should
be n-1 drives? Or does "spare" just means that it can be used but has
no data yet?
Also I have my old sdh with data on it but I don't know how current
that data is but perhaps I can use that somehow?
Thanks,
Peter
Quoting Brian Candler <b.candler@pobox.com>:
> On 15/10/2013 13:40, peter@steinhoff.se wrote:
>>
>> Anyway, I'm still at loss what to do and what my next step should be...
>>
> Well, I'm not the world's authority on this, but from what I can see:
>
> $ egrep '^/|UUID|State :|Events :' ert
> /dev/sdd:
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> State : clean
> Events : 1288444
> /dev/sde:
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> State : clean
> Events : 1288428
> /dev/sdf:
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> State : clean
> Events : 1288444
> /dev/sdg:
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> State : clean
> Events : 1288444
> /dev/sdh:
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> State : clean
> Events : 1288444
>
> So it looks like sde is stale with respect to the other drives
> (which have a larger event count) and therefore is not being used.
> But sdh is a spare (you said it was rebuilding onto this?), so you
> have N-2 usable data disks, which is not enough to start RAID5.
>
> DON'T do the following before someone else on the list confirms this
> is the right course of action, but you can force the array to
> assemble using:
>
> mdadm --stop /dev/mdXXX
> mdadm --assemble --force --run /dev/mdXXX /dev/sd{d,e,f,g,h}
>
> But since the state of sde is old, I think there is a real risk that
> data corruption has taken place. Do an fsck before mounting. It may
> be better to restore from a trusted backup.
>
> Regards,
>
> Brian.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 14:14 ` peter
@ 2013-10-15 14:20 ` Brian Candler
2013-10-15 15:01 ` peter
0 siblings, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 14:20 UTC (permalink / raw)
To: peter; +Cc: linux-raid
On 15/10/2013 15:14, peter@steinhoff.se wrote:
> Thanks Brian.
>
> Yes, I replaced sdh because it showed read errors in dmesg and was
> kicked out of the array.
>
> But has the new sdh been rebuilt completely? If that case there should
> be n-1 drives? Or does "spare" just means that it can be used but has
> no data yet?
>
That's what I'm not sure about. It is "clean" but it is also "spare".
I'm not sure what state would be seen while it is rebuilding.
When the old sdh had a problem, did you mdadm /dev/mdXXX --fail /dev/sdh?
And after you inserted the new drive, and presumably did mdadm
/dev/mdXXX --add /dev/sdh, did you see it start to rebuild in /proc/mdstat?
> Also I have my old sdh with data on it but I don't know how current
> that data is but perhaps I can use that somehow?
>
You could mdadm --examine it, but I suspect that the event count will be
way out of line by now.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 12:50 ` Brian Candler
@ 2013-10-15 14:20 ` peter
0 siblings, 0 replies; 14+ messages in thread
From: peter @ 2013-10-15 14:20 UTC (permalink / raw)
To: Brian Candler; +Cc: linux-raid
I have used those tests in the past but I'm a little worried that I
might make matters worse if I test the drives currently in the array
before having tried to recover whatever information I can.
Thanks,
Peter
Quoting Brian Candler <b.candler@pobox.com>:
> On 15/10/2013 13:40, peter@steinhoff.se wrote:
>>
>> On another note, I took the failed drive I replaced in my array
>> (/dev/sdh) and put it in another machine (win7) and run Western
>> Digital's diagnosis software and it says the drive is OK.
>>
> Another type of drive test you can do is with smartctl:
>
> smartctl -t conveyance /dev/sdX
> ... typically takes a couple of minutes
> smartctl -l selftest /dev/sdX # to monitor progress / completion
>
> smartctl -t long /dev/sdX
> ... typically takes 6 hours on a 3TB drive
> smartctl -l selftest /dev/sdX # to monitor progress / completion
>
> Regards,
>
> Brian.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 14:20 ` Brian Candler
@ 2013-10-15 15:01 ` peter
2013-10-15 15:04 ` Brian Candler
0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-15 15:01 UTC (permalink / raw)
To: Brian Candler; +Cc: linux-raid
>> But has the new sdh been rebuilt completely? If that case there
>> should be n-1 drives? Or does "spare" just means that it can be
>> used but has no data yet?
>>
> That's what I'm not sure about. It is "clean" but it is also
> "spare". I'm not sure what state would be seen while it is rebuilding.
>
> When the old sdh had a problem, did you mdadm /dev/mdXXX --fail /dev/sdh?
>
No, I didn't think I needed to do that. I just powered down the
machine and took it out.
> And after you inserted the new drive, and presumably did mdadm
> /dev/mdXXX --add /dev/sdh, did you see it start to rebuild in
> /proc/mdstat?
>
Yes I did. It was around 20% or so when I left it.
>> Also I have my old sdh with data on it but I don't know how current
>> that data is but perhaps I can use that somehow?
>>
> You could mdadm --examine it, but I suspect that the event count
> will be way out of line by now.
>
I will put it in and see what it shows. Is it correct to assume that
--examine reads the superblock and I can insert the drive into any
machine and examine it there?
Thanks,
Peter
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-15 15:01 ` peter
@ 2013-10-15 15:04 ` Brian Candler
0 siblings, 0 replies; 14+ messages in thread
From: Brian Candler @ 2013-10-15 15:04 UTC (permalink / raw)
To: peter; +Cc: linux-raid
On 15/10/2013 16:01, peter@steinhoff.se wrote:
>>>
>>> Also I have my old sdh with data on it but I don't know how current
>>> that data is but perhaps I can use that somehow?
>>>
>> You could mdadm --examine it, but I suspect that the event count will
>> be way out of line by now.
>>
>
> I will put it in and see what it shows. Is it correct to assume that
> --examine reads the superblock and I can insert the drive into any
> machine and examine it there?
>
Yes and yes.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
2013-10-14 17:28 ` Brian Candler
@ 2013-10-16 6:11 ` NeilBrown
2013-10-17 2:27 ` peter
1 sibling, 1 reply; 14+ messages in thread
From: NeilBrown @ 2013-10-16 6:11 UTC (permalink / raw)
To: peter; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 9143 bytes --]
On Mon, 14 Oct 2013 12:31:04 -0400 peter@steinhoff.se wrote:
> Hi!
>
> I'm having some problems with a raid 5 array and I'm not sure how to
> diagnose the problem and how to proceed so I figured I need to ask the
> experts :-)
>
> I actually suspect I may have several problems at the same time.
>
> The machine has two raid arrays, one raid 1 (md0) and one raid 5
> (md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.
>
> I found some read errors in the log on /dev/sdh so I replaced it with
> a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.
>
> The array was rebuilding and I left it for the night.
>
> In the morning cat /proc/mdstat showed that 2 drives where down. I may
> remember incorrectly but I think that /dev/sdh showed up as a spare
> and another drive showed fail but the array showed up as active.
>
> Anyway, I'm not sure which drive showed fail but I disconnected the
> system for more diagnosis. This was a couple of days ago.
>
> I found that the CPU fan had stopped working and replaced it. The case
> have several fans and the heatsink seemed cool even without the fan
> (it's an i3-530 that does nothing more than samba so it's mostly
> idle). Possibly the hardrives has been running hotter than normal for
> a while though.
>
> Anyway, now when I reboot I get this:
>
> > cat /proc/mdstat
> Personalities : [raid1]
> md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
> 9767572480 blocks
>
> md0 : active raid1 sda[0] sdb[1]
> 1953514496 blocks [2/2] [UU]
>
> unused devices: <none>
>
>
> I'm not sure what is happening and what my next step is. I would
> appreciate any help on this so I don't screw up the system more than
> it already is :-)
We have no way of knowing how far recovery progressed onto sdh, so you need
to exclude it. With v1.x metadata we would know ... but it wouldn't really
help the much.
Your only option is to do a --force assemble of the other devices.
sde is a little bit out of date, but it cannot be much out of date as the
array would have stopped handling writes as soon as it failed.
This will assemble the array degraded. You should then 'fsck' and do
anything else to check that the data is OK.
Then you need to check that all your drives and are your system are good (if
you haven't already), then add a good drive as a spare and let it rebuild.
NeilBrown
>
> Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.
>
> BTW, don't know if it matters but the system is running an older
> debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.
>
> Best Regards,
> Peter
>
>
> > mdadm --examine /dev/sd?
>
> /dev/sdd:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> Creation Time : Thu Jun 24 15:12:41 2010
> Raid Level : raid5
> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 1
>
> Update Time : Wed Oct 9 20:29:41 2013
> State : clean
> Active Devices : 3
> Working Devices : 4
> Failed Devices : 1
> Spare Devices : 1
> Checksum : 3dc0af1a - correct
> Events : 1288444
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Number Major Minor RaidDevice State
> this 1 8 48 1 active sync /dev/sdd
>
> 0 0 0 0 0 removed
> 1 1 8 48 1 active sync /dev/sdd
> 2 2 8 80 2 active sync /dev/sdf
> 3 3 0 0 3 faulty removed
> 4 4 8 96 4 active sync /dev/sdg
> 5 5 8 112 5 spare /dev/sdh
>
>
> /dev/sde:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> Creation Time : Thu Jun 24 15:12:41 2010
> Raid Level : raid5
> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 1
>
> Update Time : Tue Oct 8 03:26:05 2013
> State : clean
> Active Devices : 4
> Working Devices : 5
> Failed Devices : 1
> Spare Devices : 1
> Checksum : 3dbe6d93 - correct
> Events : 1288428
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Number Major Minor RaidDevice State
> this 0 8 64 0 active sync /dev/sde
>
> 0 0 8 64 0 active sync /dev/sde
> 1 1 8 48 1 active sync /dev/sdd
> 2 2 8 80 2 active sync /dev/sdf
> 3 3 0 0 3 faulty removed
> 4 4 8 96 4 active sync /dev/sdg
> 5 5 8 112 5 spare /dev/sdh
>
>
> /dev/sdf:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> Creation Time : Thu Jun 24 15:12:41 2010
> Raid Level : raid5
> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 1
>
> Update Time : Wed Oct 9 20:29:41 2013
> State : clean
> Active Devices : 3
> Working Devices : 4
> Failed Devices : 1
> Spare Devices : 1
> Checksum : 3dc0af3c - correct
> Events : 1288444
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Number Major Minor RaidDevice State
> this 2 8 80 2 active sync /dev/sdf
>
> 0 0 0 0 0 removed
> 1 1 8 48 1 active sync /dev/sdd
> 2 2 8 80 2 active sync /dev/sdf
> 3 3 0 0 3 faulty removed
> 4 4 8 96 4 active sync /dev/sdg
> 5 5 8 112 5 spare /dev/sdh
>
>
> /dev/sdg:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> Creation Time : Thu Jun 24 15:12:41 2010
> Raid Level : raid5
> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 1
>
> Update Time : Wed Oct 9 20:29:41 2013
> State : clean
> Active Devices : 3
> Working Devices : 4
> Failed Devices : 1
> Spare Devices : 1
> Checksum : 3dc0af50 - correct
> Events : 1288444
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Number Major Minor RaidDevice State
> this 4 8 96 4 active sync /dev/sdg
>
> 0 0 0 0 0 removed
> 1 1 8 48 1 active sync /dev/sdd
> 2 2 8 80 2 active sync /dev/sdf
> 3 3 0 0 3 faulty removed
> 4 4 8 96 4 active sync /dev/sdg
> 5 5 8 112 5 spare /dev/sdh
>
>
> /dev/sdh:
> Magic : a92b4efc
> Version : 00.90.00
> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
> Creation Time : Thu Jun 24 15:12:41 2010
> Raid Level : raid5
> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
> Raid Devices : 5
> Total Devices : 5
> Preferred Minor : 1
>
> Update Time : Wed Oct 9 20:29:41 2013
> State : clean
> Active Devices : 3
> Working Devices : 4
> Failed Devices : 1
> Spare Devices : 1
> Checksum : 3dc0af5c - correct
> Events : 1288444
>
> Layout : left-symmetric
> Chunk Size : 128K
>
> Number Major Minor RaidDevice State
> this 5 8 112 5 spare /dev/sdh
>
> 0 0 0 0 0 removed
> 1 1 8 48 1 active sync /dev/sdd
> 2 2 8 80 2 active sync /dev/sdf
> 3 3 0 0 3 faulty removed
> 4 4 8 96 4 active sync /dev/sdg
> 5 5 8 112 5 spare /dev/sdh
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-16 6:11 ` NeilBrown
@ 2013-10-17 2:27 ` peter
2013-10-17 2:39 ` NeilBrown
0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-17 2:27 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
Thanks Neil,
I've checked the drives and sdd had unrecoverable read errors which
was the reason that the rebuilding of sdh failed in the first place.
Then I made a clone of sdd with ddrescue and it looks like only 4096
bytes was completely unreadable.
Then I did the --force assemble as you suggested using the 3 good
drives and the cloned sdd. I ran e2fsck after that and temporarily
mounted the array to check that it looked OK.
Now I've added sdh again and the rebuilding process is underway.
Hopefully it will complete.
I was just wondering but since I lost 4kB of data when I cloned sdd,
does that mean that I will have 4x4kB of garbled data somewhere since
I assembled 4 drives (n-1) and the raid system wouldn't know? Or would
that have been detected somehow when I ran fsck (ext3)?
Thanks,
Peter
Quoting NeilBrown <neilb@suse.de>:
> On Mon, 14 Oct 2013 12:31:04 -0400 peter@steinhoff.se wrote:
>
>> Hi!
>>
>> I'm having some problems with a raid 5 array and I'm not sure how to
>> diagnose the problem and how to proceed so I figured I need to ask the
>> experts :-)
>>
>> I actually suspect I may have several problems at the same time.
>>
>> The machine has two raid arrays, one raid 1 (md0) and one raid 5
>> (md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.
>>
>> I found some read errors in the log on /dev/sdh so I replaced it with
>> a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.
>>
>> The array was rebuilding and I left it for the night.
>>
>> In the morning cat /proc/mdstat showed that 2 drives where down. I may
>> remember incorrectly but I think that /dev/sdh showed up as a spare
>> and another drive showed fail but the array showed up as active.
>>
>> Anyway, I'm not sure which drive showed fail but I disconnected the
>> system for more diagnosis. This was a couple of days ago.
>>
>> I found that the CPU fan had stopped working and replaced it. The case
>> have several fans and the heatsink seemed cool even without the fan
>> (it's an i3-530 that does nothing more than samba so it's mostly
>> idle). Possibly the hardrives has been running hotter than normal for
>> a while though.
>>
>> Anyway, now when I reboot I get this:
>>
>> > cat /proc/mdstat
>> Personalities : [raid1]
>> md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
>> 9767572480 blocks
>>
>> md0 : active raid1 sda[0] sdb[1]
>> 1953514496 blocks [2/2] [UU]
>>
>> unused devices: <none>
>>
>>
>> I'm not sure what is happening and what my next step is. I would
>> appreciate any help on this so I don't screw up the system more than
>> it already is :-)
>
> We have no way of knowing how far recovery progressed onto sdh, so you need
> to exclude it. With v1.x metadata we would know ... but it wouldn't really
> help the much.
>
> Your only option is to do a --force assemble of the other devices.
> sde is a little bit out of date, but it cannot be much out of date as the
> array would have stopped handling writes as soon as it failed.
>
> This will assemble the array degraded. You should then 'fsck' and do
> anything else to check that the data is OK.
>
> Then you need to check that all your drives and are your system are good (if
> you haven't already), then add a good drive as a spare and let it rebuild.
>
> NeilBrown
>
>
>>
>> Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.
>>
>> BTW, don't know if it matters but the system is running an older
>> debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.
>>
>> Best Regards,
>> Peter
>>
>>
>> > mdadm --examine /dev/sd?
>>
>> /dev/sdd:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>> Creation Time : Thu Jun 24 15:12:41 2010
>> Raid Level : raid5
>> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>> Raid Devices : 5
>> Total Devices : 5
>> Preferred Minor : 1
>>
>> Update Time : Wed Oct 9 20:29:41 2013
>> State : clean
>> Active Devices : 3
>> Working Devices : 4
>> Failed Devices : 1
>> Spare Devices : 1
>> Checksum : 3dc0af1a - correct
>> Events : 1288444
>>
>> Layout : left-symmetric
>> Chunk Size : 128K
>>
>> Number Major Minor RaidDevice State
>> this 1 8 48 1 active sync /dev/sdd
>>
>> 0 0 0 0 0 removed
>> 1 1 8 48 1 active sync /dev/sdd
>> 2 2 8 80 2 active sync /dev/sdf
>> 3 3 0 0 3 faulty removed
>> 4 4 8 96 4 active sync /dev/sdg
>> 5 5 8 112 5 spare /dev/sdh
>>
>>
>> /dev/sde:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>> Creation Time : Thu Jun 24 15:12:41 2010
>> Raid Level : raid5
>> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>> Raid Devices : 5
>> Total Devices : 5
>> Preferred Minor : 1
>>
>> Update Time : Tue Oct 8 03:26:05 2013
>> State : clean
>> Active Devices : 4
>> Working Devices : 5
>> Failed Devices : 1
>> Spare Devices : 1
>> Checksum : 3dbe6d93 - correct
>> Events : 1288428
>>
>> Layout : left-symmetric
>> Chunk Size : 128K
>>
>> Number Major Minor RaidDevice State
>> this 0 8 64 0 active sync /dev/sde
>>
>> 0 0 8 64 0 active sync /dev/sde
>> 1 1 8 48 1 active sync /dev/sdd
>> 2 2 8 80 2 active sync /dev/sdf
>> 3 3 0 0 3 faulty removed
>> 4 4 8 96 4 active sync /dev/sdg
>> 5 5 8 112 5 spare /dev/sdh
>>
>>
>> /dev/sdf:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>> Creation Time : Thu Jun 24 15:12:41 2010
>> Raid Level : raid5
>> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>> Raid Devices : 5
>> Total Devices : 5
>> Preferred Minor : 1
>>
>> Update Time : Wed Oct 9 20:29:41 2013
>> State : clean
>> Active Devices : 3
>> Working Devices : 4
>> Failed Devices : 1
>> Spare Devices : 1
>> Checksum : 3dc0af3c - correct
>> Events : 1288444
>>
>> Layout : left-symmetric
>> Chunk Size : 128K
>>
>> Number Major Minor RaidDevice State
>> this 2 8 80 2 active sync /dev/sdf
>>
>> 0 0 0 0 0 removed
>> 1 1 8 48 1 active sync /dev/sdd
>> 2 2 8 80 2 active sync /dev/sdf
>> 3 3 0 0 3 faulty removed
>> 4 4 8 96 4 active sync /dev/sdg
>> 5 5 8 112 5 spare /dev/sdh
>>
>>
>> /dev/sdg:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>> Creation Time : Thu Jun 24 15:12:41 2010
>> Raid Level : raid5
>> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>> Raid Devices : 5
>> Total Devices : 5
>> Preferred Minor : 1
>>
>> Update Time : Wed Oct 9 20:29:41 2013
>> State : clean
>> Active Devices : 3
>> Working Devices : 4
>> Failed Devices : 1
>> Spare Devices : 1
>> Checksum : 3dc0af50 - correct
>> Events : 1288444
>>
>> Layout : left-symmetric
>> Chunk Size : 128K
>>
>> Number Major Minor RaidDevice State
>> this 4 8 96 4 active sync /dev/sdg
>>
>> 0 0 0 0 0 removed
>> 1 1 8 48 1 active sync /dev/sdd
>> 2 2 8 80 2 active sync /dev/sdf
>> 3 3 0 0 3 faulty removed
>> 4 4 8 96 4 active sync /dev/sdg
>> 5 5 8 112 5 spare /dev/sdh
>>
>>
>> /dev/sdh:
>> Magic : a92b4efc
>> Version : 00.90.00
>> UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>> Creation Time : Thu Jun 24 15:12:41 2010
>> Raid Level : raid5
>> Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>> Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>> Raid Devices : 5
>> Total Devices : 5
>> Preferred Minor : 1
>>
>> Update Time : Wed Oct 9 20:29:41 2013
>> State : clean
>> Active Devices : 3
>> Working Devices : 4
>> Failed Devices : 1
>> Spare Devices : 1
>> Checksum : 3dc0af5c - correct
>> Events : 1288444
>>
>> Layout : left-symmetric
>> Chunk Size : 128K
>>
>> Number Major Minor RaidDevice State
>> this 5 8 112 5 spare /dev/sdh
>>
>> 0 0 0 0 0 removed
>> 1 1 8 48 1 active sync /dev/sdd
>> 2 2 8 80 2 active sync /dev/sdf
>> 3 3 0 0 3 faulty removed
>> 4 4 8 96 4 active sync /dev/sdg
>> 5 5 8 112 5 spare /dev/sdh
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-17 2:27 ` peter
@ 2013-10-17 2:39 ` NeilBrown
2013-10-18 2:41 ` peter
0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2013-10-17 2:39 UTC (permalink / raw)
To: peter; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]
On Wed, 16 Oct 2013 22:27:54 -0400 peter@steinhoff.se wrote:
> Thanks Neil,
>
> I've checked the drives and sdd had unrecoverable read errors which
> was the reason that the rebuilding of sdh failed in the first place.
>
> Then I made a clone of sdd with ddrescue and it looks like only 4096
> bytes was completely unreadable.
>
> Then I did the --force assemble as you suggested using the 3 good
> drives and the cloned sdd. I ran e2fsck after that and temporarily
> mounted the array to check that it looked OK.
>
> Now I've added sdh again and the rebuilding process is underway.
> Hopefully it will complete.
>
>
> I was just wondering but since I lost 4kB of data when I cloned sdd,
> does that mean that I will have 4x4kB of garbled data somewhere since
> I assembled 4 drives (n-1) and the raid system wouldn't know? Or would
> that have been detected somehow when I ran fsck (ext3)?
You could have 2*4kB of garbled data (the block you lost, and the
corresponding block on the device that was rebuilt).
Or you could have 1*4kB, or 0*4kB garbled if either the corrupted block or
the recovered block were parity blocks.
Those bad blocks are not likely in filesystem metadata else fsck should have
noticed. They could be in some file(s), or in some free space - in which
case you'll never notice.
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Problem diagnosing rebuilding raid5 array
2013-10-17 2:39 ` NeilBrown
@ 2013-10-18 2:41 ` peter
0 siblings, 0 replies; 14+ messages in thread
From: peter @ 2013-10-18 2:41 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid
Thanks a lot Neil and also big thanks to Brian.
With your help I've have now successfully rebuilt the array. When I
get the drives RMA'd and get new drives back I'll add those to the
array and turn it into a raid 6.
Thanks again,
Peter
Quoting NeilBrown <neilb@suse.de>:
> On Wed, 16 Oct 2013 22:27:54 -0400 peter@steinhoff.se wrote:
>> I was just wondering but since I lost 4kB of data when I cloned sdd,
>> does that mean that I will have 4x4kB of garbled data somewhere since
>> I assembled 4 drives (n-1) and the raid system wouldn't know? Or would
>> that have been detected somehow when I ran fsck (ext3)?
>
> You could have 2*4kB of garbled data (the block you lost, and the
> corresponding block on the device that was rebuilt).
> Or you could have 1*4kB, or 0*4kB garbled if either the corrupted block or
> the recovered block were parity blocks.
>
> Those bad blocks are not likely in filesystem metadata else fsck should have
> noticed. They could be in some file(s), or in some free space - in which
> case you'll never notice.
>
> NeilBrown
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-10-18 2:41 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
2013-10-14 17:28 ` Brian Candler
2013-10-15 12:40 ` peter
2013-10-15 12:50 ` Brian Candler
2013-10-15 14:20 ` peter
2013-10-15 13:15 ` Brian Candler
2013-10-15 14:14 ` peter
2013-10-15 14:20 ` Brian Candler
2013-10-15 15:01 ` peter
2013-10-15 15:04 ` Brian Candler
2013-10-16 6:11 ` NeilBrown
2013-10-17 2:27 ` peter
2013-10-17 2:39 ` NeilBrown
2013-10-18 2:41 ` peter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).