RAID5 - 4 disk reboot trouble.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 - 4 disk reboot trouble.
@ 2006-05-11 11:46 Guido Moonen
  2006-05-11 11:52 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Guido Moonen @ 2006-05-11 11:46 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm running a raid5 system, and when I reboot my raid seems to be 
failing. (One disk is set to spare and other disk seems to be oke in the 
detials page but we get a INPUT/OUTPUT error when trying to mount it)

We cannot seem te find the problem in this setup.

If you need more info please contact me using guido.moonen@axon.tv

Specs of the system:
- Kernel 2.6.15.6 (with unionfs patch, Marvell driver, vweb (internal 
pci card) driver, libata, ibm kernel debugger)
- 4x 250 Gb sata harddrive (which will be used for raid)
- mdadm version v2.4.1 - 4 April 2006
- mke2fs version 1.37

Steps to get our problem.
1. Create the raid system
"mdadm --create -n 4 -l 5 -x 0 /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 
/dev/sdd1"
2. Format the system to use ext3
"mke2fs -j /dev/md0"
3. Reboot (The hard way, turning off power)
4. Reassemble the raid array
"mdadm --assemble --run --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 
/dev/sdd1"
5. Repeat 3 and 4 until system does not correctly mount the raid anymore.
Then it reports:
mdadm: looking for devices for /dev/md0
mdadm: /dev/sda1 is identified as a member of /dev/md0, slot 0.
mdadm: /dev/sdb1 is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdc1 is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdd1 is identified as a member of /dev/md0, slot 4.
mdadm: added /dev/sdb1 to /dev/md0 as 1
mdadm: added /dev/sdc1 to /dev/md0 as 2
mdadm: no uptodate device for slot 3 of /dev/md0
mdadm: added /dev/sdd1 to /dev/md0 as 4
mdadm: added /dev/sda1 to /dev/md0 as 0
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

The line mdadm: no uptodate device for slot 3 of /dev/md0 is what I see 
every boot, even it is runs correctly.

The raid system is used to write a constant mpeg stream (512kbit/s) and 
we have a database active (postgres) on the raid. other than that there 
is no read activity on the raid system.

** mdadm --detail /dev/md0 after step 2 **
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 11 11:29:40 2006
     Raid Level : raid5
     Array Size : 732419136 (698.49 GiB 750.00 GB)
    Device Size : 244139712 (232.83 GiB 250.00 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 11 11:35:08 2006
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 2% complete

           UUID : 4d66978f:eab0d6ef:39e6cf38:7a7191ba
         Events : 0.3

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       4       8       49        3      spare rebuilding   /dev/sdd1

** mdadm --detail /dev/md0 after step 5 **
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 11 11:29:40 2006
     Raid Level : raid5
    Device Size : 244139712 (232.83 GiB 250.00 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 11 11:43:07 2006
          State : active, degraded
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 4d66978f:eab0d6ef:39e6cf38:7a7191ba
         Events : 0.204

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       0        0        3      removed

       4       8       49        -      spare   /dev/sdd1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID5 - 4 disk reboot trouble.
  2006-05-11 11:46 RAID5 - 4 disk reboot trouble Guido Moonen
@ 2006-05-11 11:52 ` Neil Brown
  2006-05-11 12:00   ` Guido Moonen
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2006-05-11 11:52 UTC (permalink / raw)
  To: Guido Moonen; +Cc: linux-raid

On Thursday May 11, guido.moonen@axon.tv wrote:
> Hi,
> 
> I'm running a raid5 system, and when I reboot my raid seems to be 
> failing. (One disk is set to spare and other disk seems to be oke in the 
> detials page but we get a INPUT/OUTPUT error when trying to mount it)
> 
> We cannot seem te find the problem in this setup.
...
>           State : clean, degraded, recovering
                                     ^^^^^^^^^^

Do you ever let the recovery actually finish?  Until you do you don't
have real redundancy.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID5 - 4 disk reboot trouble.
  2006-05-11 11:52 ` Neil Brown
@ 2006-05-11 12:00   ` Guido Moonen
  2006-05-11 14:15     ` Guido Moonen
  0 siblings, 1 reply; 4+ messages in thread
From: Guido Moonen @ 2006-05-11 12:00 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Hi,

Computers in the field will be able to complete the whole cycle of 
recovering and having a redundent array. but this is a situation that 
can happen, and we are not sure what is causing this problem. I will let 
one complete the this recovery and try to reproduce this bug. But when a 
customer will replace one the drives this process is started again and 
there will be a period where the system is not full proof.

System use:
This system will record (24/7) a single channel and saves the recorded 
data (MPEG) on a raid device. The system must be able to hold 90 days of 
recorded material for compliance regulation. When the raid fails users 
can lose upto 90 days of mpeg which is not acceptable for compliance 
(They must be able to produce the recorded mpeg for 90 days). So we 
would like to know if this failure can be avoided, or if there is 
another configuration which makes it possible to recover from this state.

Guido.

Neil Brown wrote:

>On Thursday May 11, guido.moonen@axon.tv wrote:
>  
>
>>Hi,
>>
>>I'm running a raid5 system, and when I reboot my raid seems to be 
>>failing. (One disk is set to spare and other disk seems to be oke in the 
>>detials page but we get a INPUT/OUTPUT error when trying to mount it)
>>
>>We cannot seem te find the problem in this setup.
>>    
>>
>...
>  
>
>>          State : clean, degraded, recovering
>>    
>>
>                                     ^^^^^^^^^^
>
>Do you ever let the recovery actually finish?  Until you do you don't
>have real redundancy.
>
>NeilBrown
>
>  
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID5 - 4 disk reboot trouble.
  2006-05-11 12:00   ` Guido Moonen
@ 2006-05-11 14:15     ` Guido Moonen
  0 siblings, 0 replies; 4+ messages in thread
From: Guido Moonen @ 2006-05-11 14:15 UTC (permalink / raw)
  To: linux-raid; +Cc: Neil Brown

After some more tests:

A running system with a correct raid system will not have any trouble 
rebooting and re-assembling.
but a system without one of the disks also crashes the raid in a reboot.

I know we should have a fully 4 disk synchronized raid system. But it 
seems to me it should still be able to assemble a raid system without 
the forth disk, multiple times. Is there something that I should change 
in my configuration or any things I can do to prevent this?

Guido.

Correct System print:
[root@localhost ~]# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 11 12:05:31 2006
     Raid Level : raid5
     Array Size : 732419136 (698.49 GiB 750.00 GB)
    Device Size : 244139712 (232.83 GiB 250.00 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 11 13:36:45 2006
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 32c52389:27a260ee:ed154946:5e56f4ed
         Events : 0.4

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      active sync   /dev/sdd1

Does not have this problem.

Missing a drive print:
/dev/md0:
        Version : 00.90.03
  Creation Time : Thu May 11 12:05:31 2006
     Raid Level : raid5
    Device Size : 244139712 (232.83 GiB 250.00 GB)
   Raid Devices : 4
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May 11 14:09:09 2006
          State : active, degraded
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 32c52389:27a260ee:ed154946:5e56f4ed
         Events : 0.455

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       0        0        3      removed

Guido Moonen wrote:

> Hi,
>
> Computers in the field will be able to complete the whole cycle of 
> recovering and having a redundent array. but this is a situation that 
> can happen, and we are not sure what is causing this problem. I will 
> let one complete the this recovery and try to reproduce this bug. But 
> when a customer will replace one the drives this process is started 
> again and there will be a period where the system is not full proof.
>
> System use:
> This system will record (24/7) a single channel and saves the recorded 
> data (MPEG) on a raid device. The system must be able to hold 90 days 
> of recorded material for compliance regulation. When the raid fails 
> users can lose upto 90 days of mpeg which is not acceptable for 
> compliance (They must be able to produce the recorded mpeg for 90 
> days). So we would like to know if this failure can be avoided, or if 
> there is another configuration which makes it possible to recover from 
> this state.
>
> Guido.
>
> Neil Brown wrote:
>
>> On Thursday May 11, guido.moonen@axon.tv wrote:
>>  
>>
>>> Hi,
>>>
>>> I'm running a raid5 system, and when I reboot my raid seems to be 
>>> failing. (One disk is set to spare and other disk seems to be oke in 
>>> the detials page but we get a INPUT/OUTPUT error when trying to 
>>> mount it)
>>>
>>> We cannot seem te find the problem in this setup.
>>>   
>>
>> ...
>>  
>>
>>>          State : clean, degraded, recovering
>>>   
>>
>>                                     ^^^^^^^^^^
>>
>> Do you ever let the recovery actually finish?  Until you do you don't
>> have real redundancy.
>>
>> NeilBrown
>>
>>  
>>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-05-11 14:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-11 11:46 RAID5 - 4 disk reboot trouble Guido Moonen
2006-05-11 11:52 ` Neil Brown
2006-05-11 12:00   ` Guido Moonen
2006-05-11 14:15     ` Guido Moonen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).