Problem diagnosing rebuilding raid5 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Problem diagnosing rebuilding raid5 array
@ 2013-10-14 16:31 peter
  2013-10-14 17:28 ` Brian Candler
  2013-10-16  6:11 ` NeilBrown
  0 siblings, 2 replies; 14+ messages in thread
From: peter @ 2013-10-14 16:31 UTC (permalink / raw)
  To: linux-raid

Hi!

I'm having some problems with a raid 5 array and I'm not sure how to  
diagnose the problem and how to proceed so I figured I need to ask the  
experts :-)

I actually suspect I may have several problems at the same time.

The machine has two raid arrays, one raid 1 (md0) and one raid 5  
(md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.

I found some read errors in the log on /dev/sdh so I replaced it with  
a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.

The array was rebuilding and I left it for the night.

In the morning cat /proc/mdstat showed that 2 drives where down. I may  
remember incorrectly but I think that /dev/sdh showed up as a spare  
and another drive showed fail but the array showed up as active.

Anyway, I'm not sure which drive showed fail but I disconnected the  
system for more diagnosis. This was a couple of days ago.

I found that the CPU fan had stopped working and replaced it. The case  
have several fans and the heatsink seemed cool even without the fan  
(it's an i3-530 that does nothing more than samba so it's mostly  
idle). Possibly the hardrives has been running hotter than normal for  
a while though.

Anyway, now when I reboot I get this:

> cat /proc/mdstat
Personalities : [raid1]
md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
       9767572480 blocks

md0 : active raid1 sda[0] sdb[1]
       1953514496 blocks [2/2] [UU]

unused devices: <none>


I'm not sure what is happening and what my next step is. I would  
appreciate any help on this so I don't screw up the system more than  
it already is :-)

Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.

BTW, don't know if it matters but the system is running an older  
debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.

Best Regards,
Peter


> mdadm --examine /dev/sd?

/dev/sdd:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
   Creation Time : Thu Jun 24 15:12:41 2010
      Raid Level : raid5
   Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
      Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 1

     Update Time : Wed Oct  9 20:29:41 2013
           State : clean
  Active Devices : 3
Working Devices : 4
  Failed Devices : 1
   Spare Devices : 1
        Checksum : 3dc0af1a - correct
          Events : 1288444

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     1       8       48        1      active sync   /dev/sdd

    0     0       0        0        0      removed
    1     1       8       48        1      active sync   /dev/sdd
    2     2       8       80        2      active sync   /dev/sdf
    3     3       0        0        3      faulty removed
    4     4       8       96        4      active sync   /dev/sdg
    5     5       8      112        5      spare   /dev/sdh


/dev/sde:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
   Creation Time : Thu Jun 24 15:12:41 2010
      Raid Level : raid5
   Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
      Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 1

     Update Time : Tue Oct  8 03:26:05 2013
           State : clean
  Active Devices : 4
Working Devices : 5
  Failed Devices : 1
   Spare Devices : 1
        Checksum : 3dbe6d93 - correct
          Events : 1288428

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     0       8       64        0      active sync   /dev/sde

    0     0       8       64        0      active sync   /dev/sde
    1     1       8       48        1      active sync   /dev/sdd
    2     2       8       80        2      active sync   /dev/sdf
    3     3       0        0        3      faulty removed
    4     4       8       96        4      active sync   /dev/sdg
    5     5       8      112        5      spare   /dev/sdh


/dev/sdf:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
   Creation Time : Thu Jun 24 15:12:41 2010
      Raid Level : raid5
   Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
      Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 1

     Update Time : Wed Oct  9 20:29:41 2013
           State : clean
  Active Devices : 3
Working Devices : 4
  Failed Devices : 1
   Spare Devices : 1
        Checksum : 3dc0af3c - correct
          Events : 1288444

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     2       8       80        2      active sync   /dev/sdf

    0     0       0        0        0      removed
    1     1       8       48        1      active sync   /dev/sdd
    2     2       8       80        2      active sync   /dev/sdf
    3     3       0        0        3      faulty removed
    4     4       8       96        4      active sync   /dev/sdg
    5     5       8      112        5      spare   /dev/sdh


/dev/sdg:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
   Creation Time : Thu Jun 24 15:12:41 2010
      Raid Level : raid5
   Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
      Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 1

     Update Time : Wed Oct  9 20:29:41 2013
           State : clean
  Active Devices : 3
Working Devices : 4
  Failed Devices : 1
   Spare Devices : 1
        Checksum : 3dc0af50 - correct
          Events : 1288444

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     4       8       96        4      active sync   /dev/sdg

    0     0       0        0        0      removed
    1     1       8       48        1      active sync   /dev/sdd
    2     2       8       80        2      active sync   /dev/sdf
    3     3       0        0        3      faulty removed
    4     4       8       96        4      active sync   /dev/sdg
    5     5       8      112        5      spare   /dev/sdh


/dev/sdh:
           Magic : a92b4efc
         Version : 00.90.00
            UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
   Creation Time : Thu Jun 24 15:12:41 2010
      Raid Level : raid5
   Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
      Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
    Raid Devices : 5
   Total Devices : 5
Preferred Minor : 1

     Update Time : Wed Oct  9 20:29:41 2013
           State : clean
  Active Devices : 3
Working Devices : 4
  Failed Devices : 1
   Spare Devices : 1
        Checksum : 3dc0af5c - correct
          Events : 1288444

          Layout : left-symmetric
      Chunk Size : 128K

       Number   Major   Minor   RaidDevice State
this     5       8      112        5      spare   /dev/sdh

    0     0       0        0        0      removed
    1     1       8       48        1      active sync   /dev/sdd
    2     2       8       80        2      active sync   /dev/sdf
    3     3       0        0        3      faulty removed
    4     4       8       96        4      active sync   /dev/sdg
    5     5       8      112        5      spare   /dev/sdh



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
@ 2013-10-14 17:28 ` Brian Candler
  2013-10-15 12:40   ` peter
  2013-10-16  6:11 ` NeilBrown
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-14 17:28 UTC (permalink / raw)
  To: peter, linux-raid

On 14/10/2013 17:31, peter@steinhoff.se wrote:
>
> I found that the CPU fan had stopped working and replaced it. The case 
> have several fans and the heatsink seemed cool even without the fan 
> (it's an i3-530 that does nothing more than samba so it's mostly 
> idle). Possibly the hardrives has been running hotter than normal for 
> a while though.
>
Aside: in some cases it might be a good idea to disable the case control 
- in the BIOS if your system supports it, or by removing the fan control 
header completely.

This was a system with 24 drives and 3 LSI HBAs. The case fan control 
was based on the CPU temperature alone. Therefore if the CPU was idle, 
the fan speed went very low, which meant that the drives and the HBAs 
got very hot.

This led to the perverse situation that when I was testing the system 
heavily with lots of reads and writes it went for weeks without 
problems, but if I left it idle for a day or two the HBAs crashed!

Regards,

Brian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-14 17:28 ` Brian Candler
@ 2013-10-15 12:40   ` peter
  2013-10-15 12:50     ` Brian Candler
  2013-10-15 13:15     ` Brian Candler
  0 siblings, 2 replies; 14+ messages in thread
From: peter @ 2013-10-15 12:40 UTC (permalink / raw)
  To: Brian Candler; +Cc: linux-raid

Thanks Brian, that's a good point.

In this case the case fans are running constant speed and only the cpu  
fan is PWM controlled. So the temperature over the drives should have  
been more or less OK.

Unfortunately the BIOS on this Intel motherboard didn't show fan speed  
or temperatures as it probably should so there was no alarms going off.

On another note, I took the failed drive I replaced in my array  
(/dev/sdh) and put it in another machine (win7) and run Western  
Digital's diagnosis software and it says the drive is OK.

I'm wondering if perhaps it's possible the CPU has been running too  
hot and the raid array failed because of that.

Anyway, I'm still at loss what to do and what my next step should be...

Thanks,
Peter

Quoting Brian Candler <b.candler@pobox.com>:

> On 14/10/2013 17:31, peter@steinhoff.se wrote:
>>
>> I found that the CPU fan had stopped working and replaced it. The  
>> case have several fans and the heatsink seemed cool even without  
>> the fan (it's an i3-530 that does nothing more than samba so it's  
>> mostly idle). Possibly the hardrives has been running hotter than  
>> normal for a while though.
>>
> Aside: in some cases it might be a good idea to disable the case  
> control - in the BIOS if your system supports it, or by removing the  
> fan control header completely.
>
> This was a system with 24 drives and 3 LSI HBAs. The case fan  
> control was based on the CPU temperature alone. Therefore if the CPU  
> was idle, the fan speed went very low, which meant that the drives  
> and the HBAs got very hot.
>
> This led to the perverse situation that when I was testing the  
> system heavily with lots of reads and writes it went for weeks  
> without problems, but if I left it idle for a day or two the HBAs  
> crashed!
>
> Regards,
>
> Brian.
>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 12:40   ` peter
@ 2013-10-15 12:50     ` Brian Candler
  2013-10-15 14:20       ` peter
  2013-10-15 13:15     ` Brian Candler
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 12:50 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

On 15/10/2013 13:40, peter@steinhoff.se wrote:
>
> On another note, I took the failed drive I replaced in my array 
> (/dev/sdh) and put it in another machine (win7) and run Western 
> Digital's diagnosis software and it says the drive is OK.
>
Another type of drive test you can do is with smartctl:

smartctl -t conveyance /dev/sdX
... typically takes a couple of minutes
smartctl -l selftest /dev/sdX   # to monitor progress / completion

smartctl -t long /dev/sdX
... typically takes 6 hours on a 3TB drive
smartctl -l selftest /dev/sdX   # to monitor progress / completion

Regards,

Brian.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 12:40   ` peter
  2013-10-15 12:50     ` Brian Candler
@ 2013-10-15 13:15     ` Brian Candler
  2013-10-15 14:14       ` peter
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 13:15 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

On 15/10/2013 13:40, peter@steinhoff.se wrote:
>
> Anyway, I'm still at loss what to do and what my next step should be...
>
Well, I'm not the world's authority on this, but from what I can see:

$ egrep '^/|UUID|State :|Events :' ert
/dev/sdd:
             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
            State : clean
           Events : 1288444
/dev/sde:
             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
            State : clean
           Events : 1288428
/dev/sdf:
             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
            State : clean
           Events : 1288444
/dev/sdg:
             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
            State : clean
           Events : 1288444
/dev/sdh:
             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
            State : clean
           Events : 1288444

So it looks like sde is stale with respect to the other drives (which 
have a larger event count) and therefore is not being used. But sdh is a 
spare (you said it was rebuilding onto this?), so you have N-2 usable 
data disks, which is not enough to start RAID5.

DON'T do the following before someone else on the list confirms this is 
the right course of action, but you can force the array to assemble using:

mdadm --stop /dev/mdXXX
mdadm --assemble --force --run /dev/mdXXX /dev/sd{d,e,f,g,h}

But since the state of sde is old, I think there is a real risk that 
data corruption has taken place. Do an fsck before mounting. It may be 
better to restore from a trusted backup.

Regards,

Brian.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 13:15     ` Brian Candler
@ 2013-10-15 14:14       ` peter
  2013-10-15 14:20         ` Brian Candler
  0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-15 14:14 UTC (permalink / raw)
  To: Brian Candler; +Cc: linux-raid

Thanks Brian.

Yes, I replaced sdh because it showed read errors in dmesg and was  
kicked out of the array.

But has the new sdh been rebuilt completely? If that case there should  
be n-1 drives? Or does "spare" just means that it can be used but has  
no data yet?

Also I have my old sdh with data on it but I don't know how current  
that data is but perhaps I can use that somehow?

Thanks,
Peter



Quoting Brian Candler <b.candler@pobox.com>:

> On 15/10/2013 13:40, peter@steinhoff.se wrote:
>>
>> Anyway, I'm still at loss what to do and what my next step should be...
>>
> Well, I'm not the world's authority on this, but from what I can see:
>
> $ egrep '^/|UUID|State :|Events :' ert
> /dev/sdd:
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>            State : clean
>           Events : 1288444
> /dev/sde:
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>            State : clean
>           Events : 1288428
> /dev/sdf:
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>            State : clean
>           Events : 1288444
> /dev/sdg:
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>            State : clean
>           Events : 1288444
> /dev/sdh:
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>            State : clean
>           Events : 1288444
>
> So it looks like sde is stale with respect to the other drives  
> (which have a larger event count) and therefore is not being used.  
> But sdh is a spare (you said it was rebuilding onto this?), so you  
> have N-2 usable data disks, which is not enough to start RAID5.
>
> DON'T do the following before someone else on the list confirms this  
> is the right course of action, but you can force the array to  
> assemble using:
>
> mdadm --stop /dev/mdXXX
> mdadm --assemble --force --run /dev/mdXXX /dev/sd{d,e,f,g,h}
>
> But since the state of sde is old, I think there is a real risk that  
> data corruption has taken place. Do an fsck before mounting. It may  
> be better to restore from a trusted backup.
>
> Regards,
>
> Brian.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 14:14       ` peter
@ 2013-10-15 14:20         ` Brian Candler
  2013-10-15 15:01           ` peter
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Candler @ 2013-10-15 14:20 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

On 15/10/2013 15:14, peter@steinhoff.se wrote:
> Thanks Brian.
>
> Yes, I replaced sdh because it showed read errors in dmesg and was 
> kicked out of the array.
>
> But has the new sdh been rebuilt completely? If that case there should 
> be n-1 drives? Or does "spare" just means that it can be used but has 
> no data yet?
>
That's what I'm not sure about. It is "clean" but it is also "spare". 
I'm not sure what state would be seen while it is rebuilding.

When the old sdh had a problem, did you mdadm /dev/mdXXX --fail /dev/sdh?

And after you inserted the new drive, and presumably did mdadm 
/dev/mdXXX --add /dev/sdh, did you see it start to rebuild in /proc/mdstat?

> Also I have my old sdh with data on it but I don't know how current 
> that data is but perhaps I can use that somehow?
>
You could mdadm --examine it, but I suspect that the event count will be 
way out of line by now.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 12:50     ` Brian Candler
@ 2013-10-15 14:20       ` peter
  0 siblings, 0 replies; 14+ messages in thread
From: peter @ 2013-10-15 14:20 UTC (permalink / raw)
  To: Brian Candler; +Cc: linux-raid

I have used those tests in the past but I'm a little worried that I  
might make matters worse if I test the drives currently in the array  
before having tried to recover whatever information I can.

Thanks,
Peter

Quoting Brian Candler <b.candler@pobox.com>:

> On 15/10/2013 13:40, peter@steinhoff.se wrote:
>>
>> On another note, I took the failed drive I replaced in my array  
>> (/dev/sdh) and put it in another machine (win7) and run Western  
>> Digital's diagnosis software and it says the drive is OK.
>>
> Another type of drive test you can do is with smartctl:
>
> smartctl -t conveyance /dev/sdX
> ... typically takes a couple of minutes
> smartctl -l selftest /dev/sdX   # to monitor progress / completion
>
> smartctl -t long /dev/sdX
> ... typically takes 6 hours on a 3TB drive
> smartctl -l selftest /dev/sdX   # to monitor progress / completion
>
> Regards,
>
> Brian.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 14:20         ` Brian Candler
@ 2013-10-15 15:01           ` peter
  2013-10-15 15:04             ` Brian Candler
  0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-15 15:01 UTC (permalink / raw)
  To: Brian Candler; +Cc: linux-raid

>> But has the new sdh been rebuilt completely? If that case there  
>> should be n-1 drives? Or does "spare" just means that it can be  
>> used but has no data yet?
>>
> That's what I'm not sure about. It is "clean" but it is also  
> "spare". I'm not sure what state would be seen while it is rebuilding.
>
> When the old sdh had a problem, did you mdadm /dev/mdXXX --fail /dev/sdh?
>

No, I didn't think I needed to do that. I just powered down the  
machine and took it out.

> And after you inserted the new drive, and presumably did mdadm  
> /dev/mdXXX --add /dev/sdh, did you see it start to rebuild in  
> /proc/mdstat?
>

Yes I did. It was around 20% or so when I left it.


>> Also I have my old sdh with data on it but I don't know how current  
>> that data is but perhaps I can use that somehow?
>>
> You could mdadm --examine it, but I suspect that the event count  
> will be way out of line by now.
>

I will put it in and see what it shows. Is it correct to assume that  
--examine reads the superblock and I can insert the drive into any  
machine and examine it there?

Thanks,
Peter




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-15 15:01           ` peter
@ 2013-10-15 15:04             ` Brian Candler
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Candler @ 2013-10-15 15:04 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

On 15/10/2013 16:01, peter@steinhoff.se wrote:
>>>
>>> Also I have my old sdh with data on it but I don't know how current 
>>> that data is but perhaps I can use that somehow?
>>>
>> You could mdadm --examine it, but I suspect that the event count will 
>> be way out of line by now.
>>
>
> I will put it in and see what it shows. Is it correct to assume that 
> --examine reads the superblock and I can insert the drive into any 
> machine and examine it there?
>
Yes and yes.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
  2013-10-14 17:28 ` Brian Candler
@ 2013-10-16  6:11 ` NeilBrown
  2013-10-17  2:27   ` peter
  1 sibling, 1 reply; 14+ messages in thread
From: NeilBrown @ 2013-10-16  6:11 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 9143 bytes --]

On Mon, 14 Oct 2013 12:31:04 -0400 peter@steinhoff.se wrote:

> Hi!
> 
> I'm having some problems with a raid 5 array and I'm not sure how to  
> diagnose the problem and how to proceed so I figured I need to ask the  
> experts :-)
> 
> I actually suspect I may have several problems at the same time.
> 
> The machine has two raid arrays, one raid 1 (md0) and one raid 5  
> (md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.
> 
> I found some read errors in the log on /dev/sdh so I replaced it with  
> a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.
> 
> The array was rebuilding and I left it for the night.
> 
> In the morning cat /proc/mdstat showed that 2 drives where down. I may  
> remember incorrectly but I think that /dev/sdh showed up as a spare  
> and another drive showed fail but the array showed up as active.
> 
> Anyway, I'm not sure which drive showed fail but I disconnected the  
> system for more diagnosis. This was a couple of days ago.
> 
> I found that the CPU fan had stopped working and replaced it. The case  
> have several fans and the heatsink seemed cool even without the fan  
> (it's an i3-530 that does nothing more than samba so it's mostly  
> idle). Possibly the hardrives has been running hotter than normal for  
> a while though.
> 
> Anyway, now when I reboot I get this:
> 
> > cat /proc/mdstat
> Personalities : [raid1]
> md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
>        9767572480 blocks
> 
> md0 : active raid1 sda[0] sdb[1]
>        1953514496 blocks [2/2] [UU]
> 
> unused devices: <none>
> 
> 
> I'm not sure what is happening and what my next step is. I would  
> appreciate any help on this so I don't screw up the system more than  
> it already is :-)

We have no way of knowing how far recovery progressed onto sdh, so you need
to exclude it.  With v1.x metadata we would know ... but it wouldn't really
help the much.

Your only option is to do a --force assemble of the other devices.
sde is a little bit out of date, but it cannot be much out of date as the
array would have stopped handling writes as soon as it failed.

This will assemble the array degraded.  You should then 'fsck' and do
anything else to check that the data is OK.

Then you need to check that all your drives and are your system are good (if
you haven't already), then add a good drive as a spare and let it rebuild.

NeilBrown


> 
> Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.
> 
> BTW, don't know if it matters but the system is running an older  
> debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.
> 
> Best Regards,
> Peter
> 
> 
> > mdadm --examine /dev/sd?
> 
> /dev/sdd:
>            Magic : a92b4efc
>          Version : 00.90.00
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>    Creation Time : Thu Jun 24 15:12:41 2010
>       Raid Level : raid5
>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>     Raid Devices : 5
>    Total Devices : 5
> Preferred Minor : 1
> 
>      Update Time : Wed Oct  9 20:29:41 2013
>            State : clean
>   Active Devices : 3
> Working Devices : 4
>   Failed Devices : 1
>    Spare Devices : 1
>         Checksum : 3dc0af1a - correct
>           Events : 1288444
> 
>           Layout : left-symmetric
>       Chunk Size : 128K
> 
>        Number   Major   Minor   RaidDevice State
> this     1       8       48        1      active sync   /dev/sdd
> 
>     0     0       0        0        0      removed
>     1     1       8       48        1      active sync   /dev/sdd
>     2     2       8       80        2      active sync   /dev/sdf
>     3     3       0        0        3      faulty removed
>     4     4       8       96        4      active sync   /dev/sdg
>     5     5       8      112        5      spare   /dev/sdh
> 
> 
> /dev/sde:
>            Magic : a92b4efc
>          Version : 00.90.00
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>    Creation Time : Thu Jun 24 15:12:41 2010
>       Raid Level : raid5
>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>     Raid Devices : 5
>    Total Devices : 5
> Preferred Minor : 1
> 
>      Update Time : Tue Oct  8 03:26:05 2013
>            State : clean
>   Active Devices : 4
> Working Devices : 5
>   Failed Devices : 1
>    Spare Devices : 1
>         Checksum : 3dbe6d93 - correct
>           Events : 1288428
> 
>           Layout : left-symmetric
>       Chunk Size : 128K
> 
>        Number   Major   Minor   RaidDevice State
> this     0       8       64        0      active sync   /dev/sde
> 
>     0     0       8       64        0      active sync   /dev/sde
>     1     1       8       48        1      active sync   /dev/sdd
>     2     2       8       80        2      active sync   /dev/sdf
>     3     3       0        0        3      faulty removed
>     4     4       8       96        4      active sync   /dev/sdg
>     5     5       8      112        5      spare   /dev/sdh
> 
> 
> /dev/sdf:
>            Magic : a92b4efc
>          Version : 00.90.00
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>    Creation Time : Thu Jun 24 15:12:41 2010
>       Raid Level : raid5
>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>     Raid Devices : 5
>    Total Devices : 5
> Preferred Minor : 1
> 
>      Update Time : Wed Oct  9 20:29:41 2013
>            State : clean
>   Active Devices : 3
> Working Devices : 4
>   Failed Devices : 1
>    Spare Devices : 1
>         Checksum : 3dc0af3c - correct
>           Events : 1288444
> 
>           Layout : left-symmetric
>       Chunk Size : 128K
> 
>        Number   Major   Minor   RaidDevice State
> this     2       8       80        2      active sync   /dev/sdf
> 
>     0     0       0        0        0      removed
>     1     1       8       48        1      active sync   /dev/sdd
>     2     2       8       80        2      active sync   /dev/sdf
>     3     3       0        0        3      faulty removed
>     4     4       8       96        4      active sync   /dev/sdg
>     5     5       8      112        5      spare   /dev/sdh
> 
> 
> /dev/sdg:
>            Magic : a92b4efc
>          Version : 00.90.00
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>    Creation Time : Thu Jun 24 15:12:41 2010
>       Raid Level : raid5
>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>     Raid Devices : 5
>    Total Devices : 5
> Preferred Minor : 1
> 
>      Update Time : Wed Oct  9 20:29:41 2013
>            State : clean
>   Active Devices : 3
> Working Devices : 4
>   Failed Devices : 1
>    Spare Devices : 1
>         Checksum : 3dc0af50 - correct
>           Events : 1288444
> 
>           Layout : left-symmetric
>       Chunk Size : 128K
> 
>        Number   Major   Minor   RaidDevice State
> this     4       8       96        4      active sync   /dev/sdg
> 
>     0     0       0        0        0      removed
>     1     1       8       48        1      active sync   /dev/sdd
>     2     2       8       80        2      active sync   /dev/sdf
>     3     3       0        0        3      faulty removed
>     4     4       8       96        4      active sync   /dev/sdg
>     5     5       8      112        5      spare   /dev/sdh
> 
> 
> /dev/sdh:
>            Magic : a92b4efc
>          Version : 00.90.00
>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>    Creation Time : Thu Jun 24 15:12:41 2010
>       Raid Level : raid5
>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>     Raid Devices : 5
>    Total Devices : 5
> Preferred Minor : 1
> 
>      Update Time : Wed Oct  9 20:29:41 2013
>            State : clean
>   Active Devices : 3
> Working Devices : 4
>   Failed Devices : 1
>    Spare Devices : 1
>         Checksum : 3dc0af5c - correct
>           Events : 1288444
> 
>           Layout : left-symmetric
>       Chunk Size : 128K
> 
>        Number   Major   Minor   RaidDevice State
> this     5       8      112        5      spare   /dev/sdh
> 
>     0     0       0        0        0      removed
>     1     1       8       48        1      active sync   /dev/sdd
>     2     2       8       80        2      active sync   /dev/sdf
>     3     3       0        0        3      faulty removed
>     4     4       8       96        4      active sync   /dev/sdg
>     5     5       8      112        5      spare   /dev/sdh
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-16  6:11 ` NeilBrown
@ 2013-10-17  2:27   ` peter
  2013-10-17  2:39     ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: peter @ 2013-10-17  2:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thanks Neil,

I've checked the drives and sdd had unrecoverable read errors which  
was the reason that the rebuilding of sdh failed in the first place.

Then I made a clone of sdd with ddrescue and it looks like only 4096  
bytes was completely unreadable.

Then I did the --force assemble as you suggested using the 3 good  
drives and the cloned sdd. I ran e2fsck after that and temporarily  
mounted the array to check that it looked OK.

Now I've added sdh again and the rebuilding process is underway.  
Hopefully it will complete.


I was just wondering but since I lost 4kB of data when I cloned sdd,  
does that mean that I will have 4x4kB of garbled data somewhere since  
I assembled 4 drives (n-1) and the raid system wouldn't know? Or would  
that have been detected somehow when I ran fsck (ext3)?


Thanks,
Peter



Quoting NeilBrown <neilb@suse.de>:

> On Mon, 14 Oct 2013 12:31:04 -0400 peter@steinhoff.se wrote:
>
>> Hi!
>>
>> I'm having some problems with a raid 5 array and I'm not sure how to
>> diagnose the problem and how to proceed so I figured I need to ask the
>> experts :-)
>>
>> I actually suspect I may have several problems at the same time.
>>
>> The machine has two raid arrays, one raid 1 (md0) and one raid 5
>> (md1). The raid 5 array consists of 5 x 2TB WD RE4-GP drives.
>>
>> I found some read errors in the log on /dev/sdh so I replaced it with
>> a new RE4 GP drive and did mdadm --add /dev/md1 /dev/sdh.
>>
>> The array was rebuilding and I left it for the night.
>>
>> In the morning cat /proc/mdstat showed that 2 drives where down. I may
>> remember incorrectly but I think that /dev/sdh showed up as a spare
>> and another drive showed fail but the array showed up as active.
>>
>> Anyway, I'm not sure which drive showed fail but I disconnected the
>> system for more diagnosis. This was a couple of days ago.
>>
>> I found that the CPU fan had stopped working and replaced it. The case
>> have several fans and the heatsink seemed cool even without the fan
>> (it's an i3-530 that does nothing more than samba so it's mostly
>> idle). Possibly the hardrives has been running hotter than normal for
>> a while though.
>>
>> Anyway, now when I reboot I get this:
>>
>> > cat /proc/mdstat
>> Personalities : [raid1]
>> md1 : inactive sdd[1](S) sdh[5](S) sdg[4](S) sdf[2](S) sde[0](S)
>>        9767572480 blocks
>>
>> md0 : active raid1 sda[0] sdb[1]
>>        1953514496 blocks [2/2] [UU]
>>
>> unused devices: <none>
>>
>>
>> I'm not sure what is happening and what my next step is. I would
>> appreciate any help on this so I don't screw up the system more than
>> it already is :-)
>
> We have no way of knowing how far recovery progressed onto sdh, so you need
> to exclude it.  With v1.x metadata we would know ... but it wouldn't really
> help the much.
>
> Your only option is to do a --force assemble of the other devices.
> sde is a little bit out of date, but it cannot be much out of date as the
> array would have stopped handling writes as soon as it failed.
>
> This will assemble the array degraded.  You should then 'fsck' and do
> anything else to check that the data is OK.
>
> Then you need to check that all your drives and are your system are good (if
> you haven't already), then add a good drive as a spare and let it rebuild.
>
> NeilBrown
>
>
>>
>> Below is the ouput of "mdadm --examine" for the drives in the raid 5 array.
>>
>> BTW, don't know if it matters but the system is running an older
>> debian (lenny?) with a 2.6.32 backport kernel, mdadm version is 2.6.7.2.
>>
>> Best Regards,
>> Peter
>>
>>
>> > mdadm --examine /dev/sd?
>>
>> /dev/sdd:
>>            Magic : a92b4efc
>>          Version : 00.90.00
>>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>>    Creation Time : Thu Jun 24 15:12:41 2010
>>       Raid Level : raid5
>>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>>     Raid Devices : 5
>>    Total Devices : 5
>> Preferred Minor : 1
>>
>>      Update Time : Wed Oct  9 20:29:41 2013
>>            State : clean
>>   Active Devices : 3
>> Working Devices : 4
>>   Failed Devices : 1
>>    Spare Devices : 1
>>         Checksum : 3dc0af1a - correct
>>           Events : 1288444
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     1       8       48        1      active sync   /dev/sdd
>>
>>     0     0       0        0        0      removed
>>     1     1       8       48        1      active sync   /dev/sdd
>>     2     2       8       80        2      active sync   /dev/sdf
>>     3     3       0        0        3      faulty removed
>>     4     4       8       96        4      active sync   /dev/sdg
>>     5     5       8      112        5      spare   /dev/sdh
>>
>>
>> /dev/sde:
>>            Magic : a92b4efc
>>          Version : 00.90.00
>>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>>    Creation Time : Thu Jun 24 15:12:41 2010
>>       Raid Level : raid5
>>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>>     Raid Devices : 5
>>    Total Devices : 5
>> Preferred Minor : 1
>>
>>      Update Time : Tue Oct  8 03:26:05 2013
>>            State : clean
>>   Active Devices : 4
>> Working Devices : 5
>>   Failed Devices : 1
>>    Spare Devices : 1
>>         Checksum : 3dbe6d93 - correct
>>           Events : 1288428
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     0       8       64        0      active sync   /dev/sde
>>
>>     0     0       8       64        0      active sync   /dev/sde
>>     1     1       8       48        1      active sync   /dev/sdd
>>     2     2       8       80        2      active sync   /dev/sdf
>>     3     3       0        0        3      faulty removed
>>     4     4       8       96        4      active sync   /dev/sdg
>>     5     5       8      112        5      spare   /dev/sdh
>>
>>
>> /dev/sdf:
>>            Magic : a92b4efc
>>          Version : 00.90.00
>>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>>    Creation Time : Thu Jun 24 15:12:41 2010
>>       Raid Level : raid5
>>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>>     Raid Devices : 5
>>    Total Devices : 5
>> Preferred Minor : 1
>>
>>      Update Time : Wed Oct  9 20:29:41 2013
>>            State : clean
>>   Active Devices : 3
>> Working Devices : 4
>>   Failed Devices : 1
>>    Spare Devices : 1
>>         Checksum : 3dc0af3c - correct
>>           Events : 1288444
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     2       8       80        2      active sync   /dev/sdf
>>
>>     0     0       0        0        0      removed
>>     1     1       8       48        1      active sync   /dev/sdd
>>     2     2       8       80        2      active sync   /dev/sdf
>>     3     3       0        0        3      faulty removed
>>     4     4       8       96        4      active sync   /dev/sdg
>>     5     5       8      112        5      spare   /dev/sdh
>>
>>
>> /dev/sdg:
>>            Magic : a92b4efc
>>          Version : 00.90.00
>>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>>    Creation Time : Thu Jun 24 15:12:41 2010
>>       Raid Level : raid5
>>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>>     Raid Devices : 5
>>    Total Devices : 5
>> Preferred Minor : 1
>>
>>      Update Time : Wed Oct  9 20:29:41 2013
>>            State : clean
>>   Active Devices : 3
>> Working Devices : 4
>>   Failed Devices : 1
>>    Spare Devices : 1
>>         Checksum : 3dc0af50 - correct
>>           Events : 1288444
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     4       8       96        4      active sync   /dev/sdg
>>
>>     0     0       0        0        0      removed
>>     1     1       8       48        1      active sync   /dev/sdd
>>     2     2       8       80        2      active sync   /dev/sdf
>>     3     3       0        0        3      faulty removed
>>     4     4       8       96        4      active sync   /dev/sdg
>>     5     5       8      112        5      spare   /dev/sdh
>>
>>
>> /dev/sdh:
>>            Magic : a92b4efc
>>          Version : 00.90.00
>>             UUID : 61a6a879:adb7ac7b:86c7b55e:eb5cc2b6
>>    Creation Time : Thu Jun 24 15:12:41 2010
>>       Raid Level : raid5
>>    Used Dev Size : 1953514496 (1863.02 GiB 2000.40 GB)
>>       Array Size : 7814057984 (7452.07 GiB 8001.60 GB)
>>     Raid Devices : 5
>>    Total Devices : 5
>> Preferred Minor : 1
>>
>>      Update Time : Wed Oct  9 20:29:41 2013
>>            State : clean
>>   Active Devices : 3
>> Working Devices : 4
>>   Failed Devices : 1
>>    Spare Devices : 1
>>         Checksum : 3dc0af5c - correct
>>           Events : 1288444
>>
>>           Layout : left-symmetric
>>       Chunk Size : 128K
>>
>>        Number   Major   Minor   RaidDevice State
>> this     5       8      112        5      spare   /dev/sdh
>>
>>     0     0       0        0        0      removed
>>     1     1       8       48        1      active sync   /dev/sdd
>>     2     2       8       80        2      active sync   /dev/sdf
>>     3     3       0        0        3      faulty removed
>>     4     4       8       96        4      active sync   /dev/sdg
>>     5     5       8      112        5      spare   /dev/sdh
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-17  2:27   ` peter
@ 2013-10-17  2:39     ` NeilBrown
  2013-10-18  2:41       ` peter
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2013-10-17  2:39 UTC (permalink / raw)
  To: peter; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1371 bytes --]

On Wed, 16 Oct 2013 22:27:54 -0400 peter@steinhoff.se wrote:

> Thanks Neil,
> 
> I've checked the drives and sdd had unrecoverable read errors which  
> was the reason that the rebuilding of sdh failed in the first place.
> 
> Then I made a clone of sdd with ddrescue and it looks like only 4096  
> bytes was completely unreadable.
> 
> Then I did the --force assemble as you suggested using the 3 good  
> drives and the cloned sdd. I ran e2fsck after that and temporarily  
> mounted the array to check that it looked OK.
> 
> Now I've added sdh again and the rebuilding process is underway.  
> Hopefully it will complete.
> 
> 
> I was just wondering but since I lost 4kB of data when I cloned sdd,  
> does that mean that I will have 4x4kB of garbled data somewhere since  
> I assembled 4 drives (n-1) and the raid system wouldn't know? Or would  
> that have been detected somehow when I ran fsck (ext3)?

You could have 2*4kB of garbled data (the block you lost, and the
corresponding block on the device that was rebuilt).
Or you could have 1*4kB, or 0*4kB garbled if either the corrupted block or
the recovered block were parity blocks.

Those bad blocks are not likely in filesystem metadata else fsck should have
noticed.  They could be in some file(s), or in some free space - in which
case you'll never notice.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Problem diagnosing rebuilding raid5 array
  2013-10-17  2:39     ` NeilBrown
@ 2013-10-18  2:41       ` peter
  0 siblings, 0 replies; 14+ messages in thread
From: peter @ 2013-10-18  2:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Thanks a lot Neil and also big thanks to Brian.

With your help I've have now successfully rebuilt the array. When I  
get the drives RMA'd and get new drives back I'll add those to the  
array and turn it into a raid 6.

Thanks again,
Peter


Quoting NeilBrown <neilb@suse.de>:
> On Wed, 16 Oct 2013 22:27:54 -0400 peter@steinhoff.se wrote:
>> I was just wondering but since I lost 4kB of data when I cloned sdd,
>> does that mean that I will have 4x4kB of garbled data somewhere since
>> I assembled 4 drives (n-1) and the raid system wouldn't know? Or would
>> that have been detected somehow when I ran fsck (ext3)?
>
> You could have 2*4kB of garbled data (the block you lost, and the
> corresponding block on the device that was rebuilt).
> Or you could have 1*4kB, or 0*4kB garbled if either the corrupted block or
> the recovered block were parity blocks.
>
> Those bad blocks are not likely in filesystem metadata else fsck should have
> noticed.  They could be in some file(s), or in some free space - in which
> case you'll never notice.
>
> NeilBrown
>




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-10-18  2:41 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-14 16:31 Problem diagnosing rebuilding raid5 array peter
2013-10-14 17:28 ` Brian Candler
2013-10-15 12:40   ` peter
2013-10-15 12:50     ` Brian Candler
2013-10-15 14:20       ` peter
2013-10-15 13:15     ` Brian Candler
2013-10-15 14:14       ` peter
2013-10-15 14:20         ` Brian Candler
2013-10-15 15:01           ` peter
2013-10-15 15:04             ` Brian Candler
2013-10-16  6:11 ` NeilBrown
2013-10-17  2:27   ` peter
2013-10-17  2:39     ` NeilBrown
2013-10-18  2:41       ` peter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).