A few mdadm questions

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* A few mdadm questions
@ 2004-11-13  3:34 Robert Osiel
  2004-11-13  4:21 ` Guy
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Osiel @ 2004-11-13  3:34 UTC (permalink / raw)
  To: linux-raid

Hello.
I have a five-disk RAID 5 array  in which one disk's failure went 
unnoticed for an indeterminate time.  Once I finally noticed, I did a 
raidhotremove on the disk -- or what I thought was the disk.  
Unfortunately, I can't count.  Now my array has one 'failed' disk and 
one 'spare' disk.  Aaargh.

Since then, I've learned a lot, but I haven't been able to find 
reassurances and/or answers elsewhere on a few issues.

The two big questions are:
1) How can I mark the 'spare' disk as 'clean' and get it back in the 
array?  If I read the mdadm source correctly, it looks like 'removed' 
disks are skipped when trying to assemble.
2) If I --assemble --force the array and just specify (n-1) disks, does 
that ensure that (if the array starts) it starts in degraded mode and 
won't start re-writing the parity information? 

Thanks a bunch in advance for any help.

Bob

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: A few mdadm questions
  2004-11-13  3:34 A few mdadm questions Robert Osiel
@ 2004-11-13  4:21 ` Guy
  2004-11-13  7:32   ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2004-11-13  4:21 UTC (permalink / raw)
  To: 'Robert Osiel', linux-raid

First, stop using the old raid tools.  Use mdadm only!  mdadm would not have
allowed your error to occur.

If you start the array with n-1 disk, it can't re-build.
I think you can recover.  I simulated your mistake.  See results:

Status of array before I started to trash it:
    Number   Major   Minor   RaidDevice State
       0       1        1        0      active sync   /dev/ram1
       1       1       14        1      active sync   /dev/ram14
       2       1       13        2      active sync   /dev/ram13
       3       1        0        3      active sync   /dev/ram0

Failed 1 disk:
# mdadm /dev/md3 -f /dev/ram14
mdadm: set /dev/ram14 faulty in /dev/md3

Attempt to remove another disk, but mdadm will not allow:
# mdadm /dev/md3 -r /dev/ram13
mdadm: hot remove failed for /dev/ram13: Device or resource busy

Fail another disk, the array is now in a vary bad state:
# mdadm /dev/md3 -f /dev/ram13
mdadm: set /dev/ram13 faulty in /dev/md3

Remove the second failed disk:
# mdadm /dev/md3 -r /dev/ram13
mdadm: hot removed /dev/ram13

Now I attempt to recover.

Stop the array:
# mdadm -S /dev/md3

Check the status:
# mdadm -D /dev/md3
mdadm: md device /dev/md3 does not appear to be active.

Now start the array, listing n-1 disks.
# mdadm --assemble --force /dev/md3 /dev/ram0 /dev/ram1 /dev/ram13
mdadm: forcing event count in /dev/ram13(2) from 66 upto 69
mdadm: clearing FAULTY flag for device 2 in /dev/md3 for /dev/ram13
mdadm: /dev/md3 has been started with 3 drives.

Add the disk that failed first:
# mdadm /dev/md3 -a /dev/ram14
mdadm: hot added /dev/ram14

After a re-sync the array is fine.

So, at this point, this is what you need to do:

Stop the array:
mdadm -S /dev/mdx

Start the array using the 4 good disks, not the disk that failed first.
mdadm --assemble --force <list the 4 good disks>

Your array should be up at this point.

You can now add the failed disk:
mdadm /dev/mdx -a /dev/xxx

Hope this helps!
If you have questions, just post again.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Robert Osiel
Sent: Friday, November 12, 2004 10:35 PM
To: linux-raid@vger.kernel.org
Subject: A few mdadm questions

Hello.
I have a five-disk RAID 5 array  in which one disk's failure went 
unnoticed for an indeterminate time.  Once I finally noticed, I did a 
raidhotremove on the disk -- or what I thought was the disk.  
Unfortunately, I can't count.  Now my array has one 'failed' disk and 
one 'spare' disk.  Aaargh.

Since then, I've learned a lot, but I haven't been able to find 
reassurances and/or answers elsewhere on a few issues.

The two big questions are:
1) How can I mark the 'spare' disk as 'clean' and get it back in the 
array?  If I read the mdadm source correctly, it looks like 'removed' 
disks are skipped when trying to assemble.
2) If I --assemble --force the array and just specify (n-1) disks, does 
that ensure that (if the array starts) it starts in degraded mode and 
won't start re-writing the parity information? 

Thanks a bunch in advance for any help.

Bob

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: A few mdadm questions
  2004-11-13  4:21 ` Guy
@ 2004-11-13  7:32   ` Neil Brown
  2004-11-14  0:35     ` Robert Osiel
  0 siblings, 1 reply; 9+ messages in thread
From: Neil Brown @ 2004-11-13  7:32 UTC (permalink / raw)
  To: Guy; +Cc: 'Robert Osiel', linux-raid

On Friday November 12, bugzilla@watkins-home.com wrote:
> First, stop using the old raid tools.  Use mdadm only!  mdadm would not have
> allowed your error to occur.

I'm afraid this isn't correct, though the rest of Guy's advice is very
good (thanks Guy!).

  mdadm --remove
does exactly the same thing as
  raidhotremove

It is the kernel that should (and does) stop you from hot-removing a
device that is working and active.  So I'm not quite sure what
happened to Robert...

Robert: it is always useful to provide specific with the output of 
    cat /proc/mdstat
and
    mdadm -D /dev/mdX

This avoids possible confusion over terminology.

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A few mdadm questions
  2004-11-13  7:32   ` Neil Brown
@ 2004-11-14  0:35     ` Robert Osiel
  2004-11-14  2:03       ` Guy
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Osiel @ 2004-11-14  0:35 UTC (permalink / raw)
  To: linux-raid

Guy/Neil:

Thanks a lot for the help.
Sorry that I didn't include all of the info in my last message, but this 
box is off the network right now and doesn't even have a floppy or 
monitor, so I had to do a little work to get the info out.

I tried to start the array with the 3 good disks and the 1 spare, but I 
got an error to the effect that 3 good + 1 spare drives are not enough 
to start the array (see below)

 > cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
read_ahead not set
unused devices: <none>

 > mdadm -D /dev/md0
mdadm: md device /dev/md0 does not appear to be active

 > mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdi1 /dev/hdm1 /dev/hdo1
mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
start the array

 > cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
read_ahead not set
md0: inactive
ide/host2/bus0/target0/lun0/part1[0]
ide/host4/bus0/target0/lun0/part1[5]
ide/host6/bus1/target0/lun0/part1[4]
ide/host6/bus0/target0/lun0/part1[3]

Some notes:
hdk1 is the disk which failed initially
hdi1 is the disk which I removed and which thinks it is a 'spare'

The other three drives report basically identical info, like this:
 > mdadm -E /dev/hde1

Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:26 2004
State : dirty
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.10

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    0        22        1        0        active sync
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        faulty 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare

Here are the two drives in question:

__________mdadm -E /dev/hdi1:

Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:26 2004
State : dirty
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.10

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    5        34        1        5        spare
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        faulty 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare


__________mdadm -E /dev/hdk1
Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:24 2004
State : dirty
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.9

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    2        56        1        2        active sync 
/dev/ide/host4/bus0/target0/lun0/part1
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        active sync 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare




Neil Brown wrote:

>On Friday November 12, bugzilla@watkins-home.com wrote:
>  
>
>>First, stop using the old raid tools.  Use mdadm only!  mdadm would not have
>>allowed your error to occur.
>>    
>>
>
>I'm afraid this isn't correct, though the rest of Guy's advice is very
>good (thanks Guy!).
>
>  mdadm --remove
>does exactly the same thing as
>  raidhotremove
>
>It is the kernel that should (and does) stop you from hot-removing a
>device that is working and active.  So I'm not quite sure what
>happened to Robert...
>
>Robert: it is always useful to provide specific with the output of 
>    cat /proc/mdstat
>and
>    mdadm -D /dev/mdX
>
>This avoids possible confusion over terminology.
>
>NeilBrown
>  
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: A few mdadm questions
  2004-11-14  0:35     ` Robert Osiel
@ 2004-11-14  2:03       ` Guy
  2004-11-14 16:12         ` Robert Osiel
  0 siblings, 1 reply; 9+ messages in thread
From: Guy @ 2004-11-14  2:03 UTC (permalink / raw)
  To: 'Robert Osiel', linux-raid

Your array had 5 disks, not counting any spares.
You need to start the array with at least 4 of the five disks, spares don't
help when starting an array.

I don't know why it thinks your disk (hdi1) is a spare.  But, that may
explain how it was removed from the array.  Unless Neil has some magic
incantations, I think you are out of luck.

If Neil has no ideas, you could try to start the array with the drive that
failed (hdk1), but that will cause corruption of any stripes that have
changed since the drive was removed from the array.  So, save this option as
a last resort.  Of course, if hdk1 has failed hard, you will not be able to
use it.

Last resort!!!  Corruption will occur!
mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdk1 /dev/hdm1 /dev/hdo1

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Robert Osiel
Sent: Saturday, November 13, 2004 7:36 PM
To: linux-raid@vger.kernel.org
Subject: Re: A few mdadm questions

Guy/Neil:

Thanks a lot for the help.
Sorry that I didn't include all of the info in my last message, but this 
box is off the network right now and doesn't even have a floppy or 
monitor, so I had to do a little work to get the info out.

I tried to start the array with the 3 good disks and the 1 spare, but I 
got an error to the effect that 3 good + 1 spare drives are not enough 
to start the array (see below)

 > cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
read_ahead not set
unused devices: <none>

 > mdadm -D /dev/md0
mdadm: md device /dev/md0 does not appear to be active

 > mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdi1 /dev/hdm1 /dev/hdo1
mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
start the array

 > cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
read_ahead not set
md0: inactive
ide/host2/bus0/target0/lun0/part1[0]
ide/host4/bus0/target0/lun0/part1[5]
ide/host6/bus1/target0/lun0/part1[4]
ide/host6/bus0/target0/lun0/part1[3]

Some notes:
hdk1 is the disk which failed initially
hdi1 is the disk which I removed and which thinks it is a 'spare'

The other three drives report basically identical info, like this:
 > mdadm -E /dev/hde1

Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:26 2004
State : dirty
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.10

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    0        22        1        0        active sync
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        faulty 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare

Here are the two drives in question:

__________mdadm -E /dev/hdi1:

Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:26 2004
State : dirty
Active Devices : 3
Working Devices : 4
Failed Devices : 1
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.10

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    5        34        1        5        spare
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        faulty 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare


__________mdadm -E /dev/hdk1
Magic : a92b4efc
Version : 00.90.00
UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
Creation Time : Sun Oct 5 01:25:49 2003
Build Level: raid5
Device Size : 160079488 (152.66 GiB 163.92 GB)
Raid Devices : 5
Total Devices : 5
Preferred Minor : 0

Update Time Sat Sep 25 22:07:24 2004
State : dirty
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Checksum : 4ee5cc77 - correct
Events : 0.9

Layout : left-symmetric
Chunk Size :  128K

    Number        Major    Minor    RaidDevice    State
this    2        56        1        2        active sync 
/dev/ide/host4/bus0/target0/lun0/part1
0        0        22        1        0        active sync
1        1        0        0        1        faulty removed
2        2        56        1        2        active sync 
/dev/ide/host4/bus0/target0/lun0/part1
3        3        57        1        3        active sync    
/dev/ide/host4/bus1/target0/lun0/part1
4        4        88        1        4        active sync 
/dev/ide/host6/bus0/target0/lun0/part1
5        5        34        1        5        spare




Neil Brown wrote:

>On Friday November 12, bugzilla@watkins-home.com wrote:
>  
>
>>First, stop using the old raid tools.  Use mdadm only!  mdadm would not
have
>>allowed your error to occur.
>>    
>>
>
>I'm afraid this isn't correct, though the rest of Guy's advice is very
>good (thanks Guy!).
>
>  mdadm --remove
>does exactly the same thing as
>  raidhotremove
>
>It is the kernel that should (and does) stop you from hot-removing a
>device that is working and active.  So I'm not quite sure what
>happened to Robert...
>
>Robert: it is always useful to provide specific with the output of 
>    cat /proc/mdstat
>and
>    mdadm -D /dev/mdX
>
>This avoids possible confusion over terminology.
>
>NeilBrown
>  
>

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A few mdadm questions
  2004-11-14  2:03       ` Guy
@ 2004-11-14 16:12         ` Robert Osiel
  2004-11-14 23:42           ` Neil Brown
  0 siblings, 1 reply; 9+ messages in thread
From: Robert Osiel @ 2004-11-14 16:12 UTC (permalink / raw)
  To: Guy; +Cc: linux-raid

Guy,

Thanks for the input.  I'm not sure why that disk is now a spare 
either.  I was hoping that there was some way to re-write that 
superblock to convince the array it was a good disk.  I saw some old 
(pre-mdadm) advice which mentioned using mkraid to rewrite (all) of the 
superblocks, but that seems really drastic.

In the worst case, as you mentioned, I would try to start with the other 
(failed) disk.  Most of the data on that drive is fairly static, so I 
hope to have some good recovery -- assuming the disk is still OK (in the 
past it has been something like a loose cable, so I'm hopeful).

I'll wait and see if Neil has any advice. *crosses fingers*

Bob


Guy wrote:

>Your array had 5 disks, not counting any spares.
>You need to start the array with at least 4 of the five disks, spares don't
>help when starting an array.
>
>I don't know why it thinks your disk (hdi1) is a spare.  But, that may
>explain how it was removed from the array.  Unless Neil has some magic
>incantations, I think you are out of luck.
>
>If Neil has no ideas, you could try to start the array with the drive that
>failed (hdk1), but that will cause corruption of any stripes that have
>changed since the drive was removed from the array.  So, save this option as
>a last resort.  Of course, if hdk1 has failed hard, you will not be able to
>use it.
>
>Last resort!!!  Corruption will occur!
>mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdk1 /dev/hdm1 /dev/hdo1
>
>Guy
>
>-----Original Message-----
>From: linux-raid-owner@vger.kernel.org
>[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Robert Osiel
>Sent: Saturday, November 13, 2004 7:36 PM
>To: linux-raid@vger.kernel.org
>Subject: Re: A few mdadm questions
>
>Guy/Neil:
>
>Thanks a lot for the help.
>Sorry that I didn't include all of the info in my last message, but this 
>box is off the network right now and doesn't even have a floppy or 
>monitor, so I had to do a little work to get the info out.
>
>I tried to start the array with the 3 good disks and the 1 spare, but I 
>got an error to the effect that 3 good + 1 spare drives are not enough 
>to start the array (see below)
>
> > cat /proc/mdstat
>Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
>read_ahead not set
>unused devices: <none>
>
> > mdadm -D /dev/md0
>mdadm: md device /dev/md0 does not appear to be active
>
> > mdadm --assemble --force /dev/md0 /dev/hde1 /dev/hdi1 /dev/hdm1 /dev/hdo1
>mdadm: /dev/md0 assembled from 3 drives and 1 spare - not enough to 
>start the array
>
> > cat /proc/mdstat
>Personalities : [linear] [raid0] [raid1] [raid5]  [multipath]
>read_ahead not set
>md0: inactive
>ide/host2/bus0/target0/lun0/part1[0]
>ide/host4/bus0/target0/lun0/part1[5]
>ide/host6/bus1/target0/lun0/part1[4]
>ide/host6/bus0/target0/lun0/part1[3]
>
>Some notes:
>hdk1 is the disk which failed initially
>hdi1 is the disk which I removed and which thinks it is a 'spare'
>
>The other three drives report basically identical info, like this:
> > mdadm -E /dev/hde1
>
>Magic : a92b4efc
>Version : 00.90.00
>UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
>Creation Time : Sun Oct 5 01:25:49 2003
>Build Level: raid5
>Device Size : 160079488 (152.66 GiB 163.92 GB)
>Raid Devices : 5
>Total Devices : 5
>Preferred Minor : 0
>
>Update Time Sat Sep 25 22:07:26 2004
>State : dirty
>Active Devices : 3
>Working Devices : 4
>Failed Devices : 1
>Spare Devices : 1
>Checksum : 4ee5cc77 - correct
>Events : 0.10
>
>Layout : left-symmetric
>Chunk Size :  128K
>
>    Number        Major    Minor    RaidDevice    State
>this    0        22        1        0        active sync
>0        0        22        1        0        active sync
>1        1        0        0        1        faulty removed
>2        2        56        1        2        faulty 
>/dev/ide/host4/bus0/target0/lun0/part1
>3        3        57        1        3        active sync    
>/dev/ide/host4/bus1/target0/lun0/part1
>4        4        88        1        4        active sync 
>/dev/ide/host6/bus0/target0/lun0/part1
>5        5        34        1        5        spare
>
>Here are the two drives in question:
>
>__________mdadm -E /dev/hdi1:
>
>Magic : a92b4efc
>Version : 00.90.00
>UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
>Creation Time : Sun Oct 5 01:25:49 2003
>Build Level: raid5
>Device Size : 160079488 (152.66 GiB 163.92 GB)
>Raid Devices : 5
>Total Devices : 5
>Preferred Minor : 0
>
>Update Time Sat Sep 25 22:07:26 2004
>State : dirty
>Active Devices : 3
>Working Devices : 4
>Failed Devices : 1
>Spare Devices : 1
>Checksum : 4ee5cc77 - correct
>Events : 0.10
>
>Layout : left-symmetric
>Chunk Size :  128K
>
>    Number        Major    Minor    RaidDevice    State
>this    5        34        1        5        spare
>0        0        22        1        0        active sync
>1        1        0        0        1        faulty removed
>2        2        56        1        2        faulty 
>/dev/ide/host4/bus0/target0/lun0/part1
>3        3        57        1        3        active sync    
>/dev/ide/host4/bus1/target0/lun0/part1
>4        4        88        1        4        active sync 
>/dev/ide/host6/bus0/target0/lun0/part1
>5        5        34        1        5        spare
>
>
>__________mdadm -E /dev/hdk1
>Magic : a92b4efc
>Version : 00.90.00
>UUID : ec2e64a8:fffd3e41:ffee5518:2f3e858c
>Creation Time : Sun Oct 5 01:25:49 2003
>Build Level: raid5
>Device Size : 160079488 (152.66 GiB 163.92 GB)
>Raid Devices : 5
>Total Devices : 5
>Preferred Minor : 0
>
>Update Time Sat Sep 25 22:07:24 2004
>State : dirty
>Active Devices : 4
>Working Devices : 5
>Failed Devices : 0
>Spare Devices : 1
>Checksum : 4ee5cc77 - correct
>Events : 0.9
>
>Layout : left-symmetric
>Chunk Size :  128K
>
>    Number        Major    Minor    RaidDevice    State
>this    2        56        1        2        active sync 
>/dev/ide/host4/bus0/target0/lun0/part1
>0        0        22        1        0        active sync
>1        1        0        0        1        faulty removed
>2        2        56        1        2        active sync 
>/dev/ide/host4/bus0/target0/lun0/part1
>3        3        57        1        3        active sync    
>/dev/ide/host4/bus1/target0/lun0/part1
>4        4        88        1        4        active sync 
>/dev/ide/host6/bus0/target0/lun0/part1
>5        5        34        1        5        spare
>
>
>
>
>Neil Brown wrote:
>
>  
>
>>On Friday November 12, bugzilla@watkins-home.com wrote:
>> 
>>
>>    
>>
>>>First, stop using the old raid tools.  Use mdadm only!  mdadm would not
>>>      
>>>
>have
>  
>
>>>allowed your error to occur.
>>>   
>>>
>>>      
>>>
>>I'm afraid this isn't correct, though the rest of Guy's advice is very
>>good (thanks Guy!).
>>
>> mdadm --remove
>>does exactly the same thing as
>> raidhotremove
>>
>>It is the kernel that should (and does) stop you from hot-removing a
>>device that is working and active.  So I'm not quite sure what
>>happened to Robert...
>>
>>Robert: it is always useful to provide specific with the output of 
>>   cat /proc/mdstat
>>and
>>   mdadm -D /dev/mdX
>>
>>This avoids possible confusion over terminology.
>>
>>NeilBrown
>> 
>>
>>    
>>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>  
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A few mdadm questions
  2004-11-14 16:12         ` Robert Osiel
@ 2004-11-14 23:42           ` Neil Brown
  2004-11-15  5:34             ` Guy
  2004-11-15 15:50             ` Robert Osiel
  0 siblings, 2 replies; 9+ messages in thread
From: Neil Brown @ 2004-11-14 23:42 UTC (permalink / raw)
  To: Robert Osiel; +Cc: Guy, linux-raid

On Sunday November 14, bob@osiel.org wrote:
> 
> I'll wait and see if Neil has any advice. *crosses fingers*
> 

Well, my reading of the information you sent (very complete, thanks),
is:

At
   Update Time Sat Sep 25 22:07:24 2004

when /dev/hdk1 last had a superblock update, the array have one failed
drive (not present) and one spare.
At this point it *should* have been rebuilding the spare to replace the
missing device, but I cannot tell if it actually was.

At
   Update Time Sat Sep 25 22:07:26 2004
(2 seconds later) when /dev/hdi1 was last written another drive had
failed, apparently [major=56, minor=1] which is /dev/hdi1 on my
system, but seems to be different for you.

If that drive, whichever it is, is really dead, then you have lost all
your data.  If, however, it was a transient error or even a
single-block-error then you can recover most of it with
  mdadm -A /dev/md0 --uuid=ec2e64a8:fffd3e41:ffee5518:2f3e858c --force /dev/hd?1

This will choose the best 4 drives and assemble a degraded array with
them.  It will only update the superblocks and assemble the array - it
won't touch the data at all.

You can then try mounting the filesystem read-only and dumping the
data to backup.

When you add the 5th drive (hdi?) it should start rebuilding.  If it
gets a read error on one of the drives, the rebuild will fail, but the
data should still be safe.

I'm still very surprised that you managed to "raidhotremove" without
"raidsetfaulty" first... What kernel (exactly) are you running?

NeilBrown

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: A few mdadm questions
  2004-11-14 23:42           ` Neil Brown
@ 2004-11-15  5:34             ` Guy
  2004-11-15 15:50             ` Robert Osiel
  1 sibling, 0 replies; 9+ messages in thread
From: Guy @ 2004-11-15  5:34 UTC (permalink / raw)
  To: 'Neil Brown', 'Robert Osiel'; +Cc: linux-raid

I guess I have been confused.  I did not realize this was a 5 disk RAID5
with 1 spare.  Six disks total.  Is this correct?

If the above is correct, then:
Based on Neil's email, I now see that /dev/hdi1 is and was the spare.
This is the device that was hot removed.
This device should have no data on it.  So it should not be included when
trying to recover.

Just to be safe, do you know the device names of the six devices in the
array?  If so, try to assemble, but don't include hdi1 or hdk1.  If Neil is
correct, your 2 failures were only 2 seconds apart.  Has your array been
down since Sat Sep 25 22:07:26 2004?  If so, I guess hdk1 can't be too out
of date!  So, use it if needed.

Neil said to use this command:
mdadm -A /dev/md0 --uuid=ec2e64a8:fffd3e41:ffee5518:2f3e858c --force
/dev/hd?1
I am worried that it may attempt to use hdi1, which is the spare.
Also, I don't know that all of your devices match hd?1.  I have never seen
the complete list of devices.

Guy

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of Neil Brown
Sent: Sunday, November 14, 2004 6:43 PM
To: Robert Osiel
Cc: Guy; linux-raid@vger.kernel.org
Subject: Re: A few mdadm questions

On Sunday November 14, bob@osiel.org wrote:
> 
> I'll wait and see if Neil has any advice. *crosses fingers*
> 

Well, my reading of the information you sent (very complete, thanks),
is:

At
   Update Time Sat Sep 25 22:07:24 2004

when /dev/hdk1 last had a superblock update, the array have one failed
drive (not present) and one spare.
At this point it *should* have been rebuilding the spare to replace the
missing device, but I cannot tell if it actually was.

At
   Update Time Sat Sep 25 22:07:26 2004
(2 seconds later) when /dev/hdi1 was last written another drive had
failed, apparently [major=56, minor=1] which is /dev/hdi1 on my
system, but seems to be different for you.

If that drive, whichever it is, is really dead, then you have lost all
your data.  If, however, it was a transient error or even a
single-block-error then you can recover most of it with
  mdadm -A /dev/md0 --uuid=ec2e64a8:fffd3e41:ffee5518:2f3e858c --force
/dev/hd?1

This will choose the best 4 drives and assemble a degraded array with
them.  It will only update the superblocks and assemble the array - it
won't touch the data at all.

You can then try mounting the filesystem read-only and dumping the
data to backup.

When you add the 5th drive (hdi?) it should start rebuilding.  If it
gets a read error on one of the drives, the rebuild will fail, but the
data should still be safe.

I'm still very surprised that you managed to "raidhotremove" without
"raidsetfaulty" first... What kernel (exactly) are you running?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: A few mdadm questions
  2004-11-14 23:42           ` Neil Brown
  2004-11-15  5:34             ` Guy
@ 2004-11-15 15:50             ` Robert Osiel
  1 sibling, 0 replies; 9+ messages in thread
From: Robert Osiel @ 2004-11-15 15:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: Guy, linux-raid

Neil:
The machine is/was running plain Mandrake 8.0 (Debian wouldn't install 
on that box); I built the box more than a year ago and haven't had to 
mess with it after the first week or so -- no reboots until this drive 
failed.  Methinks Mandrake 8 is 2.4.3 (the box won't boot on its own merits
now -- so I'm afraid I can't be more specific); currently I'm booting 
via a live CD with a gentoo 2.6 kernel.

Guy:
The array is a five-disk array -- hde1, hdi1, hdk1, hdm1, hdo1.  That's 
how I created it.  When I do mdadm -E on any of the drives, it also 
shows a device 1 as "faulty removed", but that device never existed and 
I don't know why it is listed (and it doesn't seem to matter to the 
array, since I've run degraded on 4 disks before).  I created the array 
with the old tools and a raidtab.

hde1, hdm1, and hdo1 are the "good" drives.
hdk1 is the drive which initially failed (and is listed as faulty)
hdi1 (device 5) is the disk I raidhotremoved (and now is spare)

It has been a while since this happened, but as I remember it was a 
Saturday when I checked on the array and noticed that there was a bad 
disk.  What doesn't make sense is that it would have been in the 
afternoon -- is the time stamp in the superblock GMT?  If so, that might 
make sense.  However, I certainly did not notice a drive was out and do 
the remove within 2 seconds -- at a minimum it would have been several 
minutes, and could have been as long as weeks.

Continued thanks!

Bob.

>I'm still very surprised that you managed to "raidhotremove" without
>"raidsetfaulty" first... What kernel (exactly) are you running?
>
>NeilBrown
>  
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-11-15 15:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-11-13  3:34 A few mdadm questions Robert Osiel
2004-11-13  4:21 ` Guy
2004-11-13  7:32   ` Neil Brown
2004-11-14  0:35     ` Robert Osiel
2004-11-14  2:03       ` Guy
2004-11-14 16:12         ` Robert Osiel
2004-11-14 23:42           ` Neil Brown
2004-11-15  5:34             ` Guy
2004-11-15 15:50             ` Robert Osiel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).