mdadm error when trying to replace a failed drive in RAID5 array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* mdadm error when trying to replace a failed drive in RAID5 array
@ 2008-01-19 23:08 Steve Fairbairn
  2008-01-20 21:10 ` Robin Hill
  0 siblings, 1 reply; 5+ messages in thread
From: Steve Fairbairn @ 2008-01-19 23:08 UTC (permalink / raw)
  To: linux-raid

Hi All,

Firstly, I must express my thanks to Neil Brown for being willing to
respond to the direct email I sent him as I couldn't for the life of me
find any forums on mdadm or this list...

I have a Software RAID 5 device configured, but one of the drives
failed. I removed the drive with the following command...

mdadm /dev/md0 --remove /dev/hdc1

[root@space ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid5 hdk1[5] hdi1[3] hdh1[2] hdg1[1] hde1[0]
      976590848 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
      [====>................]  recovery = 22.1% (54175872/244147712)
finish=3615.3min speed=872K/sec

md0 : active raid5 sdd1[4] sdc1[2] sdb1[1] sda1[0]
      1953535744 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUU_U]

unused devices: <none>

Please ignore /dev/md1 for now at least.  Now my array (/dev/md0) shows
the following...

[root@space ~]# mdadm -QD /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Wed Jan 9 18:57:53 2008
Raid Level : raid5
Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Tue Jan 4 04:28:03 2005
State : clean, degraded
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0

Layout : left-symmetric
Chunk Size : 64K

UUID : 382c157a:405e0640:c30f9e9e:888a5e63
Events : 0.337650

Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
3 0 0 3 removed
4 8 49 4 active sync /dev/sdd1

Now, when I try to insert the replacement drive back in, I get the
following...

[root@space ~]# mdadm /dev/md0 --add /dev/hdc1
mdadm: add new device failed for /dev/hdc1 as 5: Invalid argument

It seems to be that mdadm is trying to add the device as number 5
instead of replacing number 3, but I have no idea why, or how to make it
replace number 3.

--- Neil has explained to me already that the drive should be added as
5, and then switched to 3 after a a rebuild is complete.  Neil aslo
asked me if dmesg showed up anything when I tried adding the drive

[root@space mdadm-2.6.4]# dmesg | tail
...
md: hdc1 has invalid sb, not importing!
md: md_import_device returned -22
md: hdc1 has invalid sb, not importing!
md: md_import_device returned -22

I have updated mdadm to the latest version I can find...

[root@space ~]# mdadm --version
mdadm - v2.6.4 - 19th October 2007

Still get the same error. I'm hoping someone will have some suggestion
as to how to sort this out. Backing up nearly 2TB of data isn't really a
viable option for me, so I'm quite desperate to get the redundancy back.

My linux distribution is a relatively new installation from CentOS 5.1
ISOs.  The Kernel version is 

[root@space ~]# uname -a
Linux space.homenet.com 2.6.18-53.1.4.el5 #1 SMP Fri Nov 30 00:45:55 EST
2007 x86_64 x86_64 x86_64 GNU/Linux

Many Thanks,

Steve.

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.7/1232 - Release Date:
18/01/2008 19:32

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: mdadm error when trying to replace a failed drive in RAID5 array
       [not found] <4793AEEC.7090802@tmr.com>
@ 2008-01-20 21:01 ` Steve Fairbairn
  0 siblings, 0 replies; 5+ messages in thread
From: Steve Fairbairn @ 2008-01-20 21:01 UTC (permalink / raw)
  To: 'Bill Davidsen'; +Cc: linux-raid

Thanks for the response Bill.  Neil has responded to me a few times, but
I'm more than happy to try and keep it on this list instead as it feels
like I'm badgering Neil which really isn't fair...

Since my initial email, I got to the point of believing it was down to
the superblock, and that --zero-superblock wasn't working, so a good few
hours and a dd if=/dev/zero of=/dev/hdc later, I tried adding it again
to the same result.

As it happens, I did the --zero-superblock, then tried to insert it
again and then examined (mdadm -E) again and the block was 'still there'
- What really happened was that the act of trying to add it writes in
the superblock.  So --zero-superblock is working fine for me, but it's
still refusing to add the device.

The only other thing I've tried is moving the replacement drive to
/dev/hdd instead (secondary slave) with an small old HD I had lying
around as hdc.

[root@space ~]# mdadm -E /dev/hdd1
mdadm: No md superblock detected on /dev/hdd1.

[root@space ~]# mdadm /dev/md0 --add /dev/hdd1
mdadm: add new device failed for /dev/hdd1 as 5: Invalid argument

[root@space ~]# dmesg | tail
...
md: hdd1 has invalid sb, not importing!
md: md_import_device returned -22

[root@space ~]# mdadm -E /dev/hdd1
/dev/hdd1:
Magic : a92b4efc
Version : 00.90.00
UUID : 382c157a:405e0640:c30f9e9e:888a5e63
Creation Time : Wed Jan 9 18:57:53 2008
Raid Level : raid5
Used Dev Size : 488383936 (465.76 GiB 500.11 GB)
Array Size : 1953535744 (1863.04 GiB 2000.42 GB)
Raid Devices : 5
Total Devices : 4
Preferred Minor : 0
Update Time : Sun Jan 20 13:02:00 2008
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 1
Spare Devices : 0
Checksum : 198f8fb4 - correct
Events : 0.348270
Layout : left-symmetric
Chunk Size : 64K
Number Major Minor RaidDevice State
this 5 22 65 -1 spare /dev/hdd1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
2 2 8 33 2 active sync /dev/sdc1
3 3 0 0 3 faulty removed
4 4 8 49 4 active sync /dev/sdd1

I have mentioned it to Neil, but didn't mention it here before.  I am a
C developer by trade, so can easily devle into the mdadm source for
extra debug if anyone thinks it could help.  I could also delve into md
in the kernel if really wanted, but my knowledge of building kernels on
linux is some 4 years+ out of date and forgotten, so if that's a yes,
then some pointers on how to get the centos kernel config and a choice
of kernel from www.kernel.org, or from the centos distro would be
invaluable.

I'm away for a few days from tomorrow and probably wont be able to do
much if anything until I'm back on Thursday, so please be patient if I
don't respond before then.

Many Thanks,

Steve.

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.7/1233 - Release Date:
19/01/2008 18:37

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mdadm error when trying to replace a failed drive in RAID5 array
  2008-01-19 23:08 Steve Fairbairn
@ 2008-01-20 21:10 ` Robin Hill
  0 siblings, 0 replies; 5+ messages in thread
From: Robin Hill @ 2008-01-20 21:10 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]

On Sat Jan 19, 2008 at 11:08:43PM -0000, Steve Fairbairn wrote:

> 
> Hi All,
> 
> I have a Software RAID 5 device configured, but one of the drives
> failed. I removed the drive with the following command...
> 
> mdadm /dev/md0 --remove /dev/hdc1
> 
> Now, when I try to insert the replacement drive back in, I get the
> following...
> 
> [root@space ~]# mdadm /dev/md0 --add /dev/hdc1
> mdadm: add new device failed for /dev/hdc1 as 5: Invalid argument
> 
> [root@space mdadm-2.6.4]# dmesg | tail
> ...
> md: hdc1 has invalid sb, not importing!
> md: md_import_device returned -22
> md: hdc1 has invalid sb, not importing!
> md: md_import_device returned -22
> 
I've had the same error message trying to add a drive into an array
myself - in my case I'm almost certain it's because the drive is
slightly smaller than the others in the array (the array's currently
growing so I haven't delved any further yet).  Have you checked the
actual partition sizes?  Particularly if it's a different type of drive
as drives from different manufacturers can vary by quite a large
amount.

Cheers,
        Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: mdadm error when trying to replace a failed drive in RAID5 array
       [not found] <18323.45288.778496.719826@notabene.brown>
@ 2008-01-20 21:21 ` Steve Fairbairn
  2008-01-21 10:38   ` Ask Bjørn Hansen
  0 siblings, 1 reply; 5+ messages in thread
From: Steve Fairbairn @ 2008-01-20 21:21 UTC (permalink / raw)
  To: 'Neil Brown'; +Cc: linux-raid

> -----Original Message-----
> From: Neil Brown [mailto:neilb@suse.de] 
> Sent: 20 January 2008 20:37
> 
> > md: hdd1 has invalid sb, not importing!
> > md: md_import_device returned -22
> 
> In 2.6.18, the only thing that can return this message 
> without other more explanatory messages are:
> 
>   2/ If the device appears to be too small.
> 
> Maybe it is the later, though that seems unlikely.
> 

[root@space ~]# mdadm /dev/md0 --verbose --add /dev/hdd1
mdadm: added /dev/hdd1

HUGE thanks to Neil, and one white gold plated donkey award to me.

OK.  When I created /dev/md1 after creating /dev/md0, I was using a
mishmash of disks I had lying around.  As this selection of disks used
differing block sizes, I chose to create the raid partitions from the
first block, to a set size (+250G).  When I reinstalled the disk for
going into /dev/md0, I partitioned the disk the same way (+500G), which
it turns out isn't how I created the partitions when I created that
array.

So the device I was trying to add was about 22 blocks too small.  Taking
Neils suggestion and looking at /proc/partitions showed this up
incredibly quickly.

My sincere apologies for wasting all your time on a stupid error, and
again many many thanks for the solution...

md0 : active raid5 hdd1[5] sdd1[4] sdc1[2] sdb1[1] sda1[0]
      1953535744 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUU_U]
      [>....................]  recovery =  0.9% (4430220/488383936)
finish=1110.8min speed=7259K/sec

Steve.

No virus found in this outgoing message.
Checked by AVG Free Edition. 
Version: 7.5.516 / Virus Database: 269.19.7/1233 - Release Date:
19/01/2008 18:37

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: mdadm error when trying to replace a failed drive in RAID5 array
  2008-01-20 21:21 ` mdadm error when trying to replace a failed drive in RAID5 array Steve Fairbairn
@ 2008-01-21 10:38   ` Ask Bjørn Hansen
  0 siblings, 0 replies; 5+ messages in thread
From: Ask Bjørn Hansen @ 2008-01-21 10:38 UTC (permalink / raw)
  To: Steve Fairbairn; +Cc: 'Neil Brown', linux-raid

On Jan 20, 2008, at 1:21 PM, Steve Fairbairn wrote:

> So the device I was trying to add was about 22 blocks too small.   
> Taking
> Neils suggestion and looking at /proc/partitions showed this up
> incredibly quickly.

Always leave a little space in the end; it makes sure you don't run  
into that particular problem when you replace disks and the end of the  
disk is often significantly slower anyway.

 From before the write-intent bitmap stuff I have/had a habit of  
creating separate raids on relatively small partitions (joined  
together by LVM).  I'd just pick a fixed size (on 500GB disks I'd use  
90GB per partition for example) and create however many partitions  
would fit like that and leave the end for scratch space /  
experiments / whatever.

  - ask

-- 
http://develooper.com/ - http://askask.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-01-21 10:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <18323.45288.778496.719826@notabene.brown>
2008-01-20 21:21 ` mdadm error when trying to replace a failed drive in RAID5 array Steve Fairbairn
2008-01-21 10:38   ` Ask Bjørn Hansen
     [not found] <4793AEEC.7090802@tmr.com>
2008-01-20 21:01 ` Steve Fairbairn
2008-01-19 23:08 Steve Fairbairn
2008-01-20 21:10 ` Robin Hill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).