Half of RAID1 array missing on 2.6.7-rc3

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Half of RAID1 array missing on 2.6.7-rc3
@ 2004-08-05 14:08 John Stoffel
  2004-08-05 15:09 ` Alvin Oga
  2004-08-05 19:38 ` John Stoffel
  0 siblings, 2 replies; 7+ messages in thread
From: John Stoffel @ 2004-08-05 14:08 UTC (permalink / raw)
  To: linux-raid; +Cc: stoffel


Hi folks,

I've run into a problem on my debian SMP system, running kernels
2.6.7-rc3 (as well as 2.6.8-rc2-mm2 and 2.6.8-rc3) where I can't seem
to add or removed devices from my /dev/md0 array.  The system is a
dual processor Xeon, 550mhz.  Debian unstable, fairly aggressively
updated.

The root filesystems are all on SCSI disks, and I have a pair of WD
120gb drives on a Promise HPT302 controller which are mirrored.  These
are /dev/hde and /dev/hdg respectively.  The other day while I was
mucking around with getting a third 120gb drive working in a
USB2.0/Firewire external case, I noticed that /dev/md0 had lost one of
it's two disks, /dev/hdg.  I've been trying to re-add it back in, but
I can't.  

What I'm doing is setting up the two disks mirrored as /dev/md0 using
/dev/hde1 and /dev/hdg1.  Then I've setup a volume group using
DeviceMapper to hold a pair of filesystems on there, so that I can
grow/shrink them as needed down the line.  So far so good.  The data
is all there and I can still access it no problem, but I can't get my
data mirrored again!

I've run a complete badblocks on /dev/hdg and it passes without any
problems.  I suspect that because I have what looks to be two UUIDs
associated with /dev/md0, that it's somehow screwed up somewhere.  I
really don't want to lose this data if I can help it.  

Here's some info on versions and setup.

    # mdadm --version
    mdadm - v1.6.0 - 4 June 2004

I had been using 1.4.0-3 before, but I upgraded in case there was
something wrong.  I can drop back if need be.

   # cat /proc/partitions
   major minor  #blocks  name

     33     0  117220824 hde
     33     1  117218241 hde1
     34     0  117220824 hdg
     34     1  117218241 hdg1
      8     0   17783000 sda
      8     1     248976 sda1
      8     2    4000185 sda2
      8     3     996030 sda3
      8     4          1 sda4
      8     5    4000153 sda5
      8     6    8000338 sda6
      8    16   17782540 sdb
      8    17     248976 sdb1
      8    18     996030 sdb2
      8    19   16530885 sdb3
      9     0  117218176 md0
      8    32  117220824 sdc
      8    33   58593496 sdc1
      8    34   48828024 sdc2
    253     0   53477376 dm-0
    253     1   36700160 dm-1
    253     2  117218241 dm-2
    253     3     248976 dm-3
    253     4     996030 dm-4
    253     5   16530885 dm-5
    253     6   58593496 dm-6
    253     7   48828024 dm-7


    # mdadm -QE --scan
    ARRAY /dev/md0 level=raid1 num-devices=2 UUID=2e078443:42b63ef5:cc179492:aecf0094
       devices=/dev/hde1
    ARRAY /dev/md0 level=raid1 num-devices=2 UUID=9835ebd0:5d02ebf0:907edc91:c4bf97b2
       devices=/dev/hde

This bothers me, why am I seeing two different UUIDs here?
	
    # mdadm --detail /dev/md0
    /dev/md0:
	    Version : 00.90.01
      Creation Time : Fri Oct 24 19:23:41 2003
	 Raid Level : raid1
	 Array Size : 117218176 (111.79 GiB 120.03 GB)
	Device Size : 117218176 (111.79 GiB 120.03 GB)
       Raid Devices : 2
      Total Devices : 1
    Preferred Minor : 0
	Persistence : Superblock is persistent

	Update Time : Thu Aug  5 09:33:35 2004
	      State : clean, degraded
     Active Devices : 1
    Working Devices : 1
     Failed Devices : 0
      Spare Devices : 0


	Number   Major   Minor   RaidDevice State
	   0      33        1        0      active sync   /dev/hde1
	   1       0        0       -1      removed
	       UUID : 2e078443:42b63ef5:cc179492:aecf0094
	     Events : 0.990424


Here's another strange thing.  I have Raid Devices = 2, but the Active
and Working Devices are both 1.  

I've unmounted both filesystems, stopped the volume group (vgchange -a
n) and now stopped the /dev/md0 device with:

   mdadm --stop --scan

Then I rebuilt it with:

    # mdadm --assemble /dev/md0 --auto --scan --update=summaries --verbose
    mdadm: looking for devices for /dev/md0
    mdadm: /dev/hde has wrong uuid.
    mdadm: /dev/hde1 is identified as a member of /dev/md0, slot 0.
    mdadm: no RAID superblock on /dev/hdg
    mdadm: /dev/hdg has wrong uuid.
    mdadm: no RAID superblock on /dev/hdg1
    mdadm: /dev/hdg1 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda
    mdadm: /dev/sda has wrong uuid.
    mdadm: no RAID superblock on /dev/sda1
    mdadm: /dev/sda1 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda2
    mdadm: /dev/sda2 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda3
    mdadm: /dev/sda3 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda4
    mdadm: /dev/sda4 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda5
    mdadm: /dev/sda5 has wrong uuid.
    mdadm: no RAID superblock on /dev/sda6
    mdadm: /dev/sda6 has wrong uuid.
    mdadm: no RAID superblock on /dev/sdb
    mdadm: /dev/sdb has wrong uuid.
    mdadm: no RAID superblock on /dev/sdb1
    mdadm: /dev/sdb1 has wrong uuid.
    mdadm: no RAID superblock on /dev/sdb2
    mdadm: /dev/sdb2 has wrong uuid.
    mdadm: no RAID superblock on /dev/sdb3
    mdadm: /dev/sdb3 has wrong uuid.
    mdadm: no RAID superblock on /dev/sdc
    mdadm: /dev/sdc has wrong uuid.
    mdadm: no RAID superblock on /dev/sdc1
    mdadm: /dev/sdc1 has wrong uuid.
    mdadm: no RAID superblock on /dev/sdc2
    mdadm: /dev/sdc2 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/hdg1
    mdadm: /dev/evms/.nodes/hdg1 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/sdb1
    mdadm: /dev/evms/.nodes/sdb1 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/sdb2
    mdadm: /dev/evms/.nodes/sdb2 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/sdb3
    mdadm: /dev/evms/.nodes/sdb3 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/sdc1
    mdadm: /dev/evms/.nodes/sdc1 has wrong uuid.
    mdadm: no RAID superblock on /dev/evms/.nodes/sdc2
    mdadm: /dev/evms/.nodes/sdc2 has wrong uuid.
    mdadm: no uptodate device for slot 1 of /dev/md0
    mdadm: added /dev/hde1 to /dev/md0 as 0
    mdadm: /dev/md0 has been started with 1 drive (out of 2).

Which is great, I can still see it without a problem.

    jfsnew:/etc/init.d# mdadm --detail /dev/md0
    /dev/md0:
	    Version : 00.90.01
      Creation Time : Fri Oct 24 19:23:41 2003
	 Raid Level : raid1
	 Array Size : 117218176 (111.79 GiB 120.03 GB)
	Device Size : 117218176 (111.79 GiB 120.03 GB)
       Raid Devices : 2
      Total Devices : 1
    Preferred Minor : 0
	Persistence : Superblock is persistent

	Update Time : Thu Aug  5 09:33:35 2004
	      State : clean, degraded
     Active Devices : 1
    Working Devices : 1
     Failed Devices : 0
      Spare Devices : 0


	Number   Major   Minor   RaidDevice State
	   0      33        1        0      active sync   /dev/hde1
	   1       0        0       -1      removed
	       UUID : 2e078443:42b63ef5:cc179492:aecf0094
	     Events : 0.990424


Well, no change there.  

    jfsnew:/etc/init.d# mdadm /dev/md0 -a /dev/hdg1
    mdadm: hot add failed for /dev/hdg1: Invalid argument

And this just fails.  I get the following error in /var/log/syslog.  

    Aug  5 09:58:09 jfsnew kernel: md: trying to hot-add hdg1 to md0 ... 
    Aug  5 09:58:09 jfsnew kernel: md: could not lock hdg1.
    Aug  5 09:58:09 jfsnew kernel: md: error, md_import_device() returned -16

Which doesn't seem to make any sense.  Can someone tell me what the
heck is going on here?  

Thanks,
John
   John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
	 stoffel@lucent.com - http://www.lucent.com - 978-952-7548




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 14:08 Half of RAID1 array missing on 2.6.7-rc3 John Stoffel
@ 2004-08-05 15:09 ` Alvin Oga
  2004-08-05 15:17   ` John Stoffel
  2004-08-05 19:38 ` John Stoffel
  1 sibling, 1 reply; 7+ messages in thread
From: Alvin Oga @ 2004-08-05 15:09 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid


hi ya john

On Thu, 5 Aug 2004, John Stoffel wrote:

> The root filesystems are all on SCSI disks, and I have a pair of WD
> 120gb drives on a Promise HPT302 controller which are mirrored.  These

if it was me, i'd throw away the highpoint controller ... it aint worth
the risk of losing your data
	- i prefer sw raid and its flexibility over expensive hw raid 

isn't hpt (rocketraid) hardware raid ??
	- why are we using mdadm tools on a hw raid controller ??

> .... I noticed that /dev/md0 had lost one of
> it's two disks, /dev/hdg.  I've been trying to re-add it back in, but
> I can't.  

you should monitor the raid so that you know if a disk crashed within  few
hours .. otherwise, you lose all data on the entire raid array

> What I'm doing is setting up the two disks mirrored as /dev/md0 using
> /dev/hde1 and /dev/hdg1.  Then I've setup a volume group using
> DeviceMapper to hold a pair of filesystems on there, so that I can
> grow/shrink them as needed down the line.  So far so good.  The data
> is all there and I can still access it no problem, but I can't get my
> data mirrored again!

than it's NOT "so good" so far ... ( raid is broken )

> I've run a complete badblocks on /dev/hdg and it passes without any
> problems. 

good

>     # mdadm -QE --scan
>     ARRAY /dev/md0 level=raid1 num-devices=2 UUID=2e078443:42b63ef5:cc179492:aecf0094
>        devices=/dev/hde1
>     ARRAY /dev/md0 level=raid1 num-devices=2 UUID=9835ebd0:5d02ebf0:907edc91:c4bf97b2
>        devices=/dev/hde
> 
> This bothers me, why am I seeing two different UUIDs here?

one is the entire disk ... other is a partition
 	
>     # mdadm --detail /dev/md0
> 	Update Time : Thu Aug  5 09:33:35 2004
> 	      State : clean, degraded

degraded is good... if you lost one disk
 
> 	Number   Major   Minor   RaidDevice State
> 	   0      33        1        0      active sync   /dev/hde1
> 	   1       0        0       -1      removed

good ... one removed

>     # mdadm --assemble /dev/md0 --auto --scan --update=summaries --verbose
>     mdadm: looking for devices for /dev/md0
>     mdadm: /dev/hde has wrong uuid.
>     mdadm: /dev/hde1 is identified as a member of /dev/md0, slot 0.

fun times w/ hw raid..
> 
>     jfsnew:/etc/init.d# mdadm /dev/md0 -a /dev/hdg1
>     mdadm: hot add failed for /dev/hdg1: Invalid argument

how about simple "raid stop" and "raid start" or at least the
commands that came with (possibly non-hw-raid) hpt302 ...
 
> And this just fails.  I get the following error in /var/log/syslog.  
> 
>     Aug  5 09:58:09 jfsnew kernel: md: trying to hot-add hdg1 to md0 ... 
>     Aug  5 09:58:09 jfsnew kernel: md: could not lock hdg1.
>     Aug  5 09:58:09 jfsnew kernel: md: error, md_import_device() returned -16
> 
> Which doesn't seem to make any sense.  Can someone tell me what the
> heck is going on here?  

i think you're using mdadm ( sw raid tools ) on a hardware raid controller 

c ya
alvin


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 15:09 ` Alvin Oga
@ 2004-08-05 15:17   ` John Stoffel
  0 siblings, 0 replies; 7+ messages in thread
From: John Stoffel @ 2004-08-05 15:17 UTC (permalink / raw)
  To: Alvin Oga; +Cc: John Stoffel, linux-raid

Alvin> if it was me, i'd throw away the highpoint controller ... it
Alvin> aint worth the risk of losing your data - i prefer sw raid and
Alvin> its flexibility over expensive hw raid

It's purely an IDE controller, four ports, two channels, supposedly
ATA 133.  I'm only using SW RAID on there.  

Alvin> isn't hpt (rocketraid) hardware raid ??

Nope, at least not this one.

Thanks for the input.
John

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 14:08 Half of RAID1 array missing on 2.6.7-rc3 John Stoffel
  2004-08-05 15:09 ` Alvin Oga
@ 2004-08-05 19:38 ` John Stoffel
  2004-08-05 19:52   ` Luca Berra
  1 sibling, 1 reply; 7+ messages in thread
From: John Stoffel @ 2004-08-05 19:38 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-raid, stoffel

Hi folks,

I think I've found the problem.  At least there are a couple of
problems here.

1. When the md code tries to run the various hot-add scripts, it gives
   back a fairly useless error.  It should instead tell you that the
   device is locked by some other user and hopefully tell you WHAT
   that user is.

I finally started poking around at device mapper stuff as well, and I
ran the command:

   # dmsetup status
   sdb3: 0 33061770 linear 
   sdb2: 0 1992060 linear 
   data_vg-local_lv: 0 62914560 linear 
   data_vg-local_lv: 62914560 10485760 linear 
   sdb1: 0 497952 linear 
   data_vg-home_lv: 0 83886080 linear 
   data_vg-home_lv: 83886080 23068672 linear 
   sdc2: 0 97656048 linear 
   sdc1: 0 117186993 linear 
   hdg1: 0 234436482 linear 

Notice how hdg1 is listed as a LINEAR device.  I certainly didn't do
that by default, god knows how it gets picked up.  But once I did:

   # dmsetup remove hdg1

It was removed!!  

   # dmsetup status
   sdb3: 0 33061770 linear 
   sdb2: 0 1992060 linear 
   data_vg-local_lv: 0 62914560 linear 
   data_vg-local_lv: 62914560 10485760 linear 
   sdb1: 0 497952 linear 
   data_vg-home_lv: 0 83886080 linear 
   data_vg-home_lv: 83886080 23068672 linear 
   sdc2: 0 97656048 linear 
   sdc1: 0 117186993 linear 

So now I was able to do:

   # mdadm /dev/md0 --force -a /dev/hdg1
   mdadm: hot added /dev/hdg1

Which was great to see.  

   # cat /proc/mdstat 
   Personalities : [linear] [raid0] [raid1] [raid5] 
   md0 : active raid1 hdg1[2] hde1[0]
	 117218176 blocks [2/1] [U_]
	 [>....................]  recovery =  0.5% (673664/117218176) finish=49.0min speed=39627K/sec
   unused devices: <none>

And now it's re-building the mirror properly.  So now I need to see
how I can stop the Device Mapper stuff from taking over and
controlling various devices.  The /dev/sdb? and the /dev/sdc? ones are
also problematic, since those are just a second SCSI disk and a USB
storage device.  Hmm... I wonder if I remove the usb-storage device
from device mapper I'll be able to finally write all the data to it
without locking up the system.  Gotta try that out.

I do appreciate the people who gave suggestions, even though it didn't
turn out to be the right solution in the end.  

Now to figure out how to make device mapper only look at /dev/md*
devices in the future.

John

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 19:38 ` John Stoffel
@ 2004-08-05 19:52   ` Luca Berra
  2004-08-05 20:05     ` John Stoffel
  0 siblings, 1 reply; 7+ messages in thread
From: Luca Berra @ 2004-08-05 19:52 UTC (permalink / raw)
  To: linux-raid

On Thu, Aug 05, 2004 at 03:38:07PM -0400, John Stoffel wrote:
>Now to figure out how to make device mapper only look at /dev/md*
>devices in the future.

how are you using device mapper?
recent versions of lvm2 ignore md components by default
or you can specify a filter in the configuration file.

L.

-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 19:52   ` Luca Berra
@ 2004-08-05 20:05     ` John Stoffel
  2004-08-05 20:34       ` Luca Berra
  0 siblings, 1 reply; 7+ messages in thread
From: John Stoffel @ 2004-08-05 20:05 UTC (permalink / raw)
  To: linux-raid, Luca Berra

Luca> how are you using device mapper?

To setup some volume groups on top of an MD array of two mirrored
disks.  I think.  All I know currently is that it's required by LVM2
to have it setup and around.  I have /dev/mapper/data_vg-*_lv and some
other devices currently.  All I really want are the LVM2 devices to be
covered.  And since those devices are built on top of /dev/md0 (for
now, might add more later), I don't need other devices looked at.
Period. 

Can you tell me what I'm doing wrong here?

Luca> recent versions of lvm2 ignore md components by default or you
Luca> can specify a filter in the configuration file.

Which file is that?  There's nothing in /etc/dm/... that I can see.
There's some stuff in /etc/lvm/lvm.conf, but I haven't touched that at
all.

Basically, I don't want Device Mapper to scan and take over all my
disks.  From my dmesg file, I see a bunch of these on startup now:

    device-mapper: error adding target to table
    device-mapper: : dm-linear: Device lookup failed

    device-mapper: error adding target to table
    device-mapper: : dm-linear: Device lookup failed

    device-mapper: error adding target to table
    device-mapper: : dm-linear: Device lookup failed

    device-mapper: error adding target to table
    device-mapper: : dm-linear: Device lookup failed

    device-mapper: error adding target to table
    device-mapper: : dm-linear: Device lookup failed

And I don't know where they are coming from.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Half of RAID1 array missing on 2.6.7-rc3
  2004-08-05 20:05     ` John Stoffel
@ 2004-08-05 20:34       ` Luca Berra
  0 siblings, 0 replies; 7+ messages in thread
From: Luca Berra @ 2004-08-05 20:34 UTC (permalink / raw)
  To: linux-raid

On Thu, Aug 05, 2004 at 04:05:16PM -0400, John Stoffel wrote:
>Luca> recent versions of lvm2 ignore md components by default or you
>Luca> can specify a filter in the configuration file.
>
>Which file is that?  There's nothing in /etc/dm/... that I can see.
>There's some stuff in /etc/lvm/lvm.conf, but I haven't touched that at
>all.

/etc/lvm/lvm.conf

if you have a recent version you will find a 
md_component_detection = 1
line,
if not you can change the
filter = [ "a/.*/" ]
to
filter = [ "a|/dev/md*|" ]


-- 
Luca Berra -- bluca@comedia.it
        Communication Media & Services S.r.l.
 /"\
 \ /     ASCII RIBBON CAMPAIGN
  X        AGAINST HTML MAIL
 / \

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-08-05 20:34 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-05 14:08 Half of RAID1 array missing on 2.6.7-rc3 John Stoffel
2004-08-05 15:09 ` Alvin Oga
2004-08-05 15:17   ` John Stoffel
2004-08-05 19:38 ` John Stoffel
2004-08-05 19:52   ` Luca Berra
2004-08-05 20:05     ` John Stoffel
2004-08-05 20:34       ` Luca Berra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).