linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* strange RAID5 problem
@ 2006-05-09  5:30 Maurice Hilarius
  2006-05-09  5:45 ` Neil Brown
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Maurice Hilarius @ 2006-05-09  5:30 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb

Good evening.

I am having a bit of a problem with a largish RAID5 set.
Now it is looking more and more like I am about to lose all the data on
it, so I am asking (begging?) to see if anyone can help me sort this out.


Here is the scenario: 16 SATA  disks connected to a pair of AMCC(3Ware)
9550SX-12 controllers.

RAID 5, 15 disks, plus 1 hot spare.

SMART started reporting errors on a disk, so it was retired with the
3Ware CLI, then removed and replaced.
The new disk had a JBOD signature added with the 3Ware CLI, then a
single large partition was created with fdisk.

At this point I would expect to be able to add the disk back to the
array by:
[root@box ~]# mdadm /dev/md3 -a /dev/sdw1

But, I get this error message:
mdadm: hot add failed for /dev/sdw1: No such device

What? We just made the partition on sdw a moment ago in fdisk. It IS there!

So. we look around a bit:
# /cat/proc/mdstat

md3 : inactive sdq1[0] sdaf1[15] sdae1[14] sdad1[13] sdac1[12] sdab1[11]
sdaa1[10] sdz1[9] sdy1[8] sdx1[7] sdv1[5] sdu1[4] sdt1[3] sds1[2]
sdr1[1]
      5860631040 blocks

Yup, that looks correct, missing sdw1[6]

Looking more:
# mdadm -D /dev/md3

/dev/md3:
        Version : 00.90.01
  Creation Time : Tue Jan 10 19:21:23 2006
     Raid Level : raid5
    Device Size : 390708736 (372.61 GiB 400.09 GB)
   Raid Devices : 16
  Total Devices : 15
Preferred Minor : 3
    Persistence : Superblock is persistent

    Update Time : Mon May  8 19:33:36 2006
          State : active, degraded
 Active Devices : 15
Working Devices : 15
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

           UUID : 771aa4c0:48d9b467:44c847e2:9bc81c43
         Events : 0.1818687

    Number   Major   Minor   RaidDevice State
       0      65        1        0      active sync   /dev/sdq1
       1      65       17        1      active sync   /dev/sdr1
       2      65       33        2      active sync   /dev/sds1
       3      65       49        3      active sync   /dev/sdt1
       4      65       65        4      active sync   /dev/sdu1
       5      65       81        5      active sync   /dev/sdv1
     609       0        0        0      removed
       7      65      113        7      active sync   /dev/sdx1
       8      65      129        8      active sync   /dev/sdy1
       9      65      145        9      active sync   /dev/sdz1
      10      65      161       10      active sync   /dev/sdaa1
      11      65      177       11      active sync   /dev/sdab1
      12      65      193       12      active sync   /dev/sdac1
      13      65      209       13      active sync   /dev/sdad1
      14      65      225       14      active sync   /dev/sdae1
      15      65      241       15      active sync   /dev/sdaf1

That also looks to be as expected.

So, lets try to assemble it again and force sdw1 in to it:

[root@box ~]# mdadm
--assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1
/dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1
/dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1
mdadm: superblock on /dev/sdw1 doesn't match others - assembly aborted

[root@box ~]# mdadm
--assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1
/dev/sdv1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1
/dev/sdad1 /dev/sdae1 /dev/sdaf1
mdadm: failed to RUN_ARRAY /dev/md3: Invalid argument

[root@box ~]# mdadm
-A /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1
/dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1
/dev/sdad1 /dev/sdae1 /dev/sdaf1
mdadm: device /dev/md3 already active - cannot assemble it

[root@box ~]# cat /proc/mdstat
Personalities : [raid1] [raid5]
md1 : active raid1 hdb3[1] hda3[0]
      115105600 blocks [2/2] [UU]

md2 : active raid5 sdp1[15] sdo1[14] sdn1[13] sdm1[12] sdl1[11] sdk1[10]
sdj1[9] sdi1[8] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1]
sda1[0]
      5860631040 blocks level 5, 256k chunk, algorithm 2 [16/16]
[UUUUUUUUUUUUUUUU]

md3 : inactive sdq1[0] sdaf1[15] sdae1[14] sdad1[13] sdac1[12] sdab1[11]
sdaa1[10] sdz1[9] sdy1[8] sdx1[7] sdv1[5] sdu1[4] sdt1[3] sds1[2]
sdr1[1]
      5860631040 blocks
md0 : active raid1 hdb1[1] hda1[0]
      104320 blocks [2/2] [UU]

unused devices: <none>

[root@box ~]# mdadm /dev/md3 -a /dev/sdw1
mdadm: hot add failed for /dev/sdw1: No such device

OK, let's mount the degraded RAID and try to copy the files to somewhere
else, so we can make it from scratch:

[root@box ~]# mount /dev/md3 /all/boxw16/
/dev/md3: Invalid argument
mount: /dev/md3: can't read superblock

[root@box ~]# fsck /dev/md3
fsck 1.35 (28-Feb-2004)
e2fsck 1.35 (28-Feb-2004)
fsck.ext2: Invalid argument while trying to open /dev/md3

The superblock could not be read..

[root@box ~]# mke2fs -n /dev/md3
mke2fs 1.35 (28-Feb-2004)
mke2fs: Device size reported to be zero.  Invalid partition specified,
or partition table wasn't reread after running fdisk, due to
a modified partition being busy and in use.  You may need to
reboot to re-read your partition table.


So, now what to do?

Any ideas would be DEEPLY appreciated !


-- 

Regards,
	Maurice


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09  5:30 strange RAID5 problem Maurice Hilarius
@ 2006-05-09  5:45 ` Neil Brown
  2006-05-09  5:58 ` Luca Berra
  2006-05-09  6:12 ` strange RAID5 problem CaT
  2 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2006-05-09  5:45 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: linux-raid

On Monday May 8, maurice@harddata.com wrote:
> Good evening.
> 
> I am having a bit of a problem with a largish RAID5 set.
> Now it is looking more and more like I am about to lose all the data on
> it, so I am asking (begging?) to see if anyone can help me sort this out.
> 

Very thorough description, but you omitted the 'dmesg' output
corresponding to :

> 
> [root@box ~]# mdadm
> --assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1
> /dev/sdv1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1 /dev/sdac1
> /dev/sdad1 /dev/sdae1 /dev/sdaf1
> mdadm: failed to RUN_ARRAY /dev/md3: Invalid argument


Also, you don't seem to have tried '--force' with '--assemble'.  It
might help.

NeilBrown

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09  5:30 strange RAID5 problem Maurice Hilarius
  2006-05-09  5:45 ` Neil Brown
@ 2006-05-09  5:58 ` Luca Berra
  2006-05-09 16:16   ` Maurice Hilarius
  2006-05-09  6:12 ` strange RAID5 problem CaT
  2 siblings, 1 reply; 8+ messages in thread
From: Luca Berra @ 2006-05-09  5:58 UTC (permalink / raw)
  To: linux-raid

On Mon, May 08, 2006 at 11:30:52PM -0600, Maurice Hilarius wrote:
>[root@box ~]# mdadm /dev/md3 -a /dev/sdw1
>
>But, I get this error message:
>mdadm: hot add failed for /dev/sdw1: No such device
>
>What? We just made the partition on sdw a moment ago in fdisk. It IS there!

I don't believe you, prove it (/proc/partitions)

>So. we look around a bit:
># /cat/proc/mdstat
>
>md3 : inactive sdq1[0] sdaf1[15] sdae1[14] sdad1[13] sdac1[12] sdab1[11]
>sdaa1[10] sdz1[9] sdy1[8] sdx1[7] sdv1[5] sdu1[4] sdt1[3] sds1[2]
>sdr1[1]
>      5860631040 blocks
>
>Yup, that looks correct, missing sdw1[6]

no, it does not, it is 'inactive'

>[root@box ~]# cat /proc/mdstat
>Personalities : [raid1] [raid5]
...
>md3 : inactive sdq1[0] sdaf1[15] sdae1[14] sdad1[13] sdac1[12] sdab1[11]
>sdaa1[10] sdz1[9] sdy1[8] sdx1[7] sdv1[5] sdu1[4] sdt1[3] sds1[2]
>sdr1[1]
>      5860631040 blocks
...
>[root@box ~]# mdadm /dev/md3 -a /dev/sdw1
>mdadm: hot add failed for /dev/sdw1: No such device
>
>OK, let's mount the degraded RAID and try to copy the files to somewhere
>else, so we can make it from scratch:
>
>[root@box ~]# mount /dev/md3 /all/boxw16/
>/dev/md3: Invalid argument
>mount: /dev/md3: can't read superblock
>
it is still inactive, no wonder you cannot access it.

try running the array, or really stop it before assembling.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09  5:30 strange RAID5 problem Maurice Hilarius
  2006-05-09  5:45 ` Neil Brown
  2006-05-09  5:58 ` Luca Berra
@ 2006-05-09  6:12 ` CaT
  2 siblings, 0 replies; 8+ messages in thread
From: CaT @ 2006-05-09  6:12 UTC (permalink / raw)
  To: Maurice Hilarius; +Cc: linux-raid, neilb

On Mon, May 08, 2006 at 11:30:52PM -0600, Maurice Hilarius wrote:
> [root@box ~]# mdadm
> --assemble /dev/md3 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1
> /dev/sdv1 /dev/sdw1 /dev/sdx1 /dev/sdy1 /dev/sdz1 /dev/sdaa1 /dev/sdab1
> /dev/sdac1 /dev/sdad1 /dev/sdae1 /dev/sdaf1
> mdadm: superblock on /dev/sdw1 doesn't match others - assembly aborted

Have you tried zeroing the superblock with

mdadm --misc --zero-superblock /dev/sdw1

and then adding it in?

> [root@box ~]# mount /dev/md3 /all/boxw16/
> /dev/md3: Invalid argument
> mount: /dev/md3: can't read superblock

Wow that looks messy. ummm. about the only thing I can think of is
failing /dev/sdw1 and removing it (I know it says it's not there
but...)

Also, not biggest expert on raid around here. ;)

-- 
    "To the extent that we overreact, we proffer the terrorists the
    greatest tribute."
    	- High Court Judge Michael Kirby

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09  5:58 ` Luca Berra
@ 2006-05-09 16:16   ` Maurice Hilarius
  2006-05-09 19:20     ` Luca Berra
  0 siblings, 1 reply; 8+ messages in thread
From: Maurice Hilarius @ 2006-05-09 16:16 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid

Luca Berra wrote:
> On Mon, May 08, 2006 at 11:30:52PM -0600, Maurice Hilarius wrote:
>> [root@box ~]# mdadm /dev/md3 -a /dev/sdw1
>>
>> But, I get this error message:
>> mdadm: hot add failed for /dev/sdw1: No such device
>>
>> What? We just made the partition on sdw a moment ago in fdisk. It IS
>> there!
>
> I don't believe you, prove it (/proc/partitions)
>
>
I understand. Here we go then. Devices in question bracketed with "**":

[root@box ~]# cat /proc/partitions
major minor  #blocks  name

   3     0  117220824 hda
   3     1     104391 hda1
   3     2    2008125 hda2
   3     3  115105725 hda3
   3    64  117220824 hdb
   3    65     104391 hdb1
   3    66    2008125 hdb2
   3    67  115105725 hdb3
   8     0  390711384 sda
   8     1  390708801 sda1
   8    16  390711384 sdb
   8    17  390708801 sdb1
   8    32  390711384 sdc
   8    33  390708801 sdc1
   8    48  390711384 sdd
   8    49  390708801 sdd1
   8    64  390711384 sde
   8    65  390708801 sde1
   8    80  390711384 sdf
   8    81  390708801 sdf1
   8    96  390711384 sdg
   8    97  390708801 sdg1
   8   112  390711384 sdh
   8   113  390708801 sdh1
   8   128  390711384 sdi
   8   129  390708801 sdi1
   8   144  390711384 sdj
   8   145  390708801 sdj1
   8   160  390711384 sdk
   8   161  390708801 sdk1
   8   176  390711384 sdl
   8   177  390708801 sdl1
   8   192  390711384 sdm
   8   193  390708801 sdm1
   8   208  390711384 sdn
   8   209  390708801 sdn1
   8   224  390711384 sdo
   8   225  390708801 sdo1
   8   240  390711384 sdp
   8   241  390708801 sdp1
  65     0  390711384 sdq
  65     1  390708801 sdq1
  65    16  390711384 sdr
  65    17  390708801 sdr1
  65    32  390711384 sds
  65    33  390708801 sds1
  65    48  390711384 sdt
  65    49  390708801 sdt1
  65    64  390711384 sdu
  65    65  390708801 sdu1
  65    80  390711384 sdv
  65    81  390708801 sdv1
**
  65    96  390711384 sdw
  65    97  390708801 sdw1
**
  65   112  390711384 sdx
  65   113  390708801 sdx1
  65   128  390711384 sdy
  65   129  390708801 sdy1
  65   144  390711384 sdz
  65   145  390708801 sdz1
  65   160  390711384 sdaa
  65   161  390708801 sdaa1
  65   176  390711384 sdab
  65   177  390708801 sdab1
  65   192  390711384 sdac
  65   193  390708801 sdac1
  65   208  390711384 sdad
  65   209  390708801 sdad1
  65   224  390711384 sdae
  65   225  390708801 sdae1
  65   240  390711384 sdaf
  65   241  390708801 sdaf1
**
   9     0     104320 md0
**
   9     2 5860631040 md2
   9     1  115105600 md1



-- 

Regards,
	Maurice


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09 16:16   ` Maurice Hilarius
@ 2006-05-09 19:20     ` Luca Berra
  2006-05-09 22:19       ` Maurice Hilarius
  0 siblings, 1 reply; 8+ messages in thread
From: Luca Berra @ 2006-05-09 19:20 UTC (permalink / raw)
  To: linux-raid

On Tue, May 09, 2006 at 10:16:25AM -0600, Maurice Hilarius wrote:
>Luca Berra wrote:
>> On Mon, May 08, 2006 at 11:30:52PM -0600, Maurice Hilarius wrote:
>>> [root@box ~]# mdadm /dev/md3 -a /dev/sdw1
>>>
>>> But, I get this error message:
>>> mdadm: hot add failed for /dev/sdw1: No such device
>>>
>>> What? We just made the partition on sdw a moment ago in fdisk. It IS
>>> there!
>>
>> I don't believe you, prove it (/proc/partitions)
>>
>>
>I understand. Here we go then. Devices in question bracketed with "**":
>
ok, now i do.
is the /dev/sdw1 device file correctly created?
you could try straceing mdadm to see what happens

what about the other suggestion? trying to stop the array and restart
it, since it is marked as inactive.
L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: strange RAID5 problem
  2006-05-09 19:20     ` Luca Berra
@ 2006-05-09 22:19       ` Maurice Hilarius
  2006-05-10 14:54         ` Thanks! Was:[Re: strange RAID5 problem] Maurice Hilarius
  0 siblings, 1 reply; 8+ messages in thread
From: Maurice Hilarius @ 2006-05-09 22:19 UTC (permalink / raw)
  To: Luca Berra; +Cc: linux-raid, Neil Brown

Luca Berra wrote:
> ..
>>> I don't believe you, prove it (/proc/partitions)
>>>
>> I understand. Here we go then. Devices in question bracketed with "**":
>>
> ok, now i do.
> is the /dev/sdw1 device file correctly created?
> you could try straceing mdadm to see what happens
>
> what about the other suggestion? trying to stop the array and restart
> it, since it is marked as inactive.
> L.
>
Here is what we ended up doing that fixed it.
Thanks to Neil on the --force, however even with that,
ALL parameters were needed on the mdadm -C or it still refused.
We used EVMS  to rebuild as that is what originally created the RAID.

mdadm -C /dev/md3 --chunk=256 --level=5 --parity=ls --raid-devices=16
--force /dev/evms/.nodes/sdq1 /dev/evms/.nodes/sdr1
/dev/evms/.nodes/sds1 /dev/evms/.nodes/sdt1 /dev/evms/.nodes/sdu1
/dev/evms/.nodes/sdv1 missing /dev/evms/.nodes/sdx1
/dev/evms/.nodes/sdy1 /dev/evms/.nodes/sdz1 /dev/evms/.nodes/sdaa1
/dev/evms/.nodes/sdab1 /dev/evms/.nodes/sdac1 /dev/evms/.nodes/sdad1
/dev/evms/.nodes/sdae1 /dev/evms/.nodes/sdaf1

Notice we are assembling a device with a "missing" member, and the
devices are in "order" per: mdamd -D /dev/md3

This was the *only* that it would come up. It was mountable, data seems
intact.
We started the rebuild with no errors by simply adding the device
as I mentioned before with -a.

Then sped it up via:

echo "100000" > /proc/sys/dev/raid/speed_limit_min

Because frankly we have the resources to do so and need it going as fast
as possible.

-- 

Regards,
	Maurice


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Thanks! Was:[Re: strange RAID5 problem]
  2006-05-09 22:19       ` Maurice Hilarius
@ 2006-05-10 14:54         ` Maurice Hilarius
  0 siblings, 0 replies; 8+ messages in thread
From: Maurice Hilarius @ 2006-05-10 14:54 UTC (permalink / raw)
  To: linux-raid

Thanks to Neil, Luca, and CaT, who were all a big help.



-- 

With our best regards,


Maurice W. Hilarius        Telephone: 01-780-456-9771
Hard Data Ltd.  FAX:       01-780-456-9772
11060 - 166 Avenue         email:maurice@harddata.com
Edmonton, AB, Canada       http://www.harddata.com/
   T5X 1Y3


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2006-05-10 14:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-05-09  5:30 strange RAID5 problem Maurice Hilarius
2006-05-09  5:45 ` Neil Brown
2006-05-09  5:58 ` Luca Berra
2006-05-09 16:16   ` Maurice Hilarius
2006-05-09 19:20     ` Luca Berra
2006-05-09 22:19       ` Maurice Hilarius
2006-05-10 14:54         ` Thanks! Was:[Re: strange RAID5 problem] Maurice Hilarius
2006-05-09  6:12 ` strange RAID5 problem CaT

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).