Linux Raid confused about one drive and two arrays

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Linux Raid confused about one drive and two arrays
@ 2004-01-22 14:34 AndyLiebman
  2004-01-23  0:39 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: AndyLiebman @ 2004-01-22 14:34 UTC (permalink / raw)
  To: linux-raid

I have just encountered a very disturbing RAID problem. I hope somebody 
understands what happened and can tell me how to fix it.

I have two RAID 5 arrays on my Linux machine -- md4 and md6.. Each array 
consists of 5 firewire (1394a) drives -- one partition on each drive, 10 drives in 
total. Because the device ID's on these drives can change, I always use MDADM 
to create and manage my arrays based on UUIDs. I am using MDADM 1.3. Mandrake 
9.2 with mandrake's 2.4.22-21 kernel.

After running these arrays successfully for two months -- rebooting my file 
server every day -- one of my arrays came up in a degraded mode. It looks as if 
the Linux RAID subsystem "thinks" one of my drives belongs to both arrays.

As you can see below, when I run mdadm -E on each of my ten firewire drives, 
mdadm is telling me that for each of the drives in the md4 array (UUID group 
62d8b91d:a2368783:6a78ca50:5793492f )  there are 5 Raid devices and 6 total 
devices with one failed. However this array always only had 5 devices.

On the other hand, for most of the drives in the md6 arary (UUID group  
57f26496:25520b96:41757b62:f83fcb7b), mdadm is telling me that there are 5 raid 
devices and 5 total devices with one failed.

However, when I run mdadm -E on the drive currently identified as /dev/sdh1 
-- which also belongs to md6 or  the UUID group 
57f26496:25520b96:41757b62:f83fcb7b -- mdadm tells me that sdh1 is part of an array with 6 total devices, 5 
raid devices, one failed.

/dev/sdh1 is identified as device number 3 in the RAID with the UUID 
57f26496:25520b96:41757b62:f83fcb7b.  Howver, when I run mdadm -E on the other 4 
drives that belong to md6, mdadm tells me that device number 3 is faulty.

My questions are:

How do I fix this problem?
Why did it occur?
How can I prevent it from occurring again?

Hope somebody can answer these questions today.

Here is all the output from starting up my arrays and running mdadm:

[root@localhost avidserver]# mdadm -Av /dev/md4 
--uuid=62d8b91d:a2368783:6a78ca50:5793492f /dev/sd*
mdadm: looking for devices for /dev/md4
mdadm: /dev/sd is not a block device.
mdadm: /dev/sd has wrong uuid.
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sda1 is identified as a member of /dev/md4, slot 0.
mdadm: no RAID superblock on /dev/sdb
mdadm: /dev/sdb has wrong uuid.
mdadm: /dev/sdb1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdc
mdadm: /dev/sdc has wrong uuid.
mdadm: /dev/sdc1 is identified as a member of /dev/md4, slot 1.
mdadm: no RAID superblock on /dev/sdd
mdadm: /dev/sdd has wrong uuid.
mdadm: /dev/sdd1 has wrong uuid.
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sde has wrong uuid.
mdadm: /dev/sde1 is identified as a member of /dev/md4, slot 3.
mdadm: no RAID superblock on /dev/sdf
mdadm: /dev/sdf has wrong uuid.
mdadm: /dev/sdf1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: /dev/sdg1 is identified as a member of /dev/md4, slot 4.
mdadm: no RAID superblock on /dev/sdh
mdadm: /dev/sdh has wrong uuid.
mdadm: /dev/sdh1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdi
mdadm: /dev/sdi has wrong uuid.
mdadm: /dev/sdi1 is identified as a member of /dev/md4, slot 2.
mdadm: no RAID superblock on /dev/sdj
mdadm: /dev/sdj has wrong uuid.
mdadm: /dev/sdj1 has wrong uuid.
mdadm: added /dev/sdc1 to /dev/md4 as 1
mdadm: added /dev/sdi1 to /dev/md4 as 2
mdadm: added /dev/sde1 to /dev/md4 as 3
mdadm: added /dev/sdg1 to /dev/md4 as 4
mdadm: added /dev/sda1 to /dev/md4 as 0
mdadm: /dev/md4 has been started with 5 drives.

[root@localhost avidserver]# mdadm -Av /dev/md6 
--uuid=57f26496:25520b96:41757b62:f83fcb7b /dev/sd*
mdadm: looking for devices for /dev/md6
mdadm: /dev/sd is not a block device.
mdadm: /dev/sd has wrong uuid.
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sda1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdb
mdadm: /dev/sdb has wrong uuid.
mdadm: /dev/sdb1 is identified as a member of /dev/md6, slot 0.
mdadm: no RAID superblock on /dev/sdc
mdadm: /dev/sdc has wrong uuid.
mdadm: /dev/sdc1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdd
mdadm: /dev/sdd has wrong uuid.
mdadm: /dev/sdd1 is identified as a member of /dev/md6, slot 1.
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sde has wrong uuid.
mdadm: /dev/sde1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdf
mdadm: /dev/sdf has wrong uuid.
mdadm: /dev/sdf1 is identified as a member of /dev/md6, slot 2.
mdadm: no RAID superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: /dev/sdg1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdh
mdadm: /dev/sdh has wrong uuid.
mdadm: /dev/sdh1 is identified as a member of /dev/md6, slot 3.
mdadm: no RAID superblock on /dev/sdi
mdadm: /dev/sdi has wrong uuid.
mdadm: /dev/sdi1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdj
mdadm: /dev/sdj has wrong uuid.
mdadm: /dev/sdj1 is identified as a member of /dev/md6, slot 4.
mdadm: added /dev/sdd1 to /dev/md6 as 1
mdadm: added /dev/sdf1 to /dev/md6 as 2
mdadm: added /dev/sdh1 to /dev/md6 as 3
mdadm: added /dev/sdj1 to /dev/md6 as 4
mdadm: added /dev/sdb1 to /dev/md6 as 0
mdadm: /dev/md6 has been started with 4 drives (out of 5).

NOTE THAT mdadm identified sdh1 as being in slot 3 on md6, yet under cat 
/proc/mdstat the slot 3
Drive in md6 is reported as missing. 

[root@localhost avidserver]# cat /proc/mdstat
Personalities : [raid5]
read_ahead 1024 sectors
md6 : active raid5 scsi/host1/bus0/target1/lun0/part1[0] 
scsi/host5/bus0/target1/lun0/part1[4] scsi/host3/bus0/target1/lun0/part1[2] 
scsi/host2/bus0/target1/lun0/part1[1]
      796566528 blocks level 5, 128k chunk, algorithm 2 [5/4] [UUU_U]

md4 : active raid5 scsi/host1/bus0/target0/lun0/part1[0] 
scsi/host4/bus0/target0/lun0/part1[4] scsi/host3/bus0/target0/lun0/part1[3] 
scsi/host5/bus0/target0/lun0/part1[2] scsi/host2/bus0/target0/lun0/part1[1]
      480214528 blocks level 5, 128k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>


[root@localhost avidserver]# mdadm -E /dev/sda1
/dev/sda1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 62d8b91d:a2368783:6a78ca50:5793492f
  Creation Time : Fri Nov 22 09:13:16 2002
     Raid Level : raid5
    Device Size : 120053632 (114.49 GiB 122.93 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 4

    Update Time : Thu Jan 22 08:42:49 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : f55e948c - correct
         Events : 0.146

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   0     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   1     1       8       33        1      active sync   
/dev/scsi/host2/bus0/target0/lun0/part1
   2     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   3     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   4     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1

[root@localhost avidserver]# mdadm -E /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 57f26496:25520b96:41757b62:f83fcb7b
  Creation Time : Mon Nov 24 17:36:05 2003
     Raid Level : raid5
    Device Size : 199141632 (189.92 GiB 203.92 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 6

    Update Time : Thu Jan 22 08:43:28 2004
          State : dirty, no-errors
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ebd80d56 - correct
         Events : 0.137

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   0     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   1     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   2     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   3     3       0        0        3      faulty removed
   4     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1


    [root@localhost avidserver]# mdadm -E /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 62d8b91d:a2368783:6a78ca50:5793492f
  Creation Time : Fri Nov 22 09:13:16 2002
     Raid Level : raid5
    Device Size : 120053632 (114.49 GiB 122.93 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 4

    Update Time : Thu Jan 22 08:42:49 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : f55e94ae - correct
         Events : 0.146

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     1       8       33        1      active sync   
/dev/scsi/host2/bus0/target0/lun0/part1
   0     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   1     1       8       33        1      active sync   
/dev/scsi/host2/bus0/target0/lun0/part1
   2     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   3     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   4     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1


   [root@localhost avidserver]# mdadm -E /dev/sdd1
/dev/sdd1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 57f26496:25520b96:41757b62:f83fcb7b
  Creation Time : Mon Nov 24 17:36:05 2003
     Raid Level : raid5
    Device Size : 199141632 (189.92 GiB 203.92 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 6

    Update Time : Thu Jan 22 08:43:28 2004
          State : dirty, no-errors
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ebd80d78 - correct
         Events : 0.137

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   0     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   1     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   2     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   3     3       0        0        3      faulty removed
   4     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1

   [root@localhost avidserver]# mdadm -E /dev/sde1
/dev/sde1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 62d8b91d:a2368783:6a78ca50:5793492f
  Creation Time : Fri Nov 22 09:13:16 2002
     Raid Level : raid5
    Device Size : 120053632 (114.49 GiB 122.93 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 4

    Update Time : Thu Jan 22 08:42:49 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : f55e94d2 - correct
         Events : 0.146

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   0     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   1     1       8       33        1      active sync   /de
v/scsi/host2/bus0/target0/lun0/part1
   2     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   3     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   4     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1

   [root@localhost avidserver]# mdadm -E /dev/sdf1
/dev/sdf1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 57f26496:25520b96:41757b62:f83fcb7b
  Creation Time : Mon Nov 24 17:36:05 2003
     Raid Level : raid5
    Device Size : 199141632 (189.92 GiB 203.92 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 6

    Update Time : Thu Jan 22 08:43:28 2004
          State : dirty, no-errors
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ebd80d9a - correct
         Events : 0.137

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   0     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   1     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   2     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   3     3       0        0        3      faulty removed
   4     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1


   [root@localhost avidserver]# mdadm -E /dev/sdg1
/dev/sdg1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 62d8b91d:a2368783:6a78ca50:5793492f
  Creation Time : Fri Nov 22 09:13:16 2002
     Raid Level : raid5
    Device Size : 120053632 (114.49 GiB 122.93 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 4

    Update Time : Thu Jan 22 08:42:49 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : f55e94f4 - correct
         Events : 0.146

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1
   0     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   1     1       8       33        1      active sync   
/dev/scsi/host2/bus0/target0/lun0/part1
   2     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   3     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   4     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1


   [root@localhost avidserver]# mdadm -E /dev/sdh1
/dev/sdh1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 57f26496:25520b96:41757b62:f83fcb7b
  Creation Time : Mon Nov 24 17:36:05 2003
     Raid Level : raid5
    Device Size : 199141632 (189.92 GiB 203.92 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 6

    Update Time : Thu Jan 15 08:18:48 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ebcecdda - correct
         Events : 0.118

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8      113        3      active sync   
/dev/scsi/host4/bus0/target1/lun0/part1
   0     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   1     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   2     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   3     3       8      113        3      active sync   
/dev/scsi/host4/bus0/target1/lun0/part1
   4     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1

   [root@localhost avidserver]# mdadm -E /dev/sdi1
/dev/sdi1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 62d8b91d:a2368783:6a78ca50:5793492f
  Creation Time : Fri Nov 22 09:13:16 2002
     Raid Level : raid5
    Device Size : 120053632 (114.49 GiB 122.93 GB)
   Raid Devices : 5
  Total Devices : 6
Preferred Minor : 4

    Update Time : Thu Jan 22 08:42:49 2004
          State : dirty, no-errors
 Active Devices : 5
Working Devices : 5
 Failed Devices : 1
  Spare Devices : 0
       Checksum : f55e9510 - correct
         Events : 0.146

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   0     0       8        1        0      active sync   
/dev/scsi/host1/bus0/target0/lun0/part1
   1     1       8       33        1      active sync   
/dev/scsi/host2/bus0/target0/lun0/part1
   2     2       8      129        2      active sync   
/dev/scsi/host5/bus0/target0/lun0/part1
   3     3       8       65        3      active sync   
/dev/scsi/host3/bus0/target0/lun0/part1
   4     4       8       97        4      active sync   
/dev/scsi/host4/bus0/target0/lun0/part1


   [root@localhost avidserver]# mdadm -E /dev/sdj1
/dev/sdj1:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 57f26496:25520b96:41757b62:f83fcb7b
  Creation Time : Mon Nov 24 17:36:05 2003
     Raid Level : raid5
    Device Size : 199141632 (189.92 GiB 203.92 GB)
   Raid Devices : 5
  Total Devices : 5
Preferred Minor : 6

    Update Time : Thu Jan 22 08:43:28 2004
          State : dirty, no-errors
 Active Devices : 4
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 0
       Checksum : ebd80dde - correct
         Events : 0.137

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1
   0     0       8       17        0      active sync   
/dev/scsi/host1/bus0/target1/lun0/part1
   1     1       8       49        1      active sync   
/dev/scsi/host2/bus0/target1/lun0/part1
   2     2       8       81        2      active sync   
/dev/scsi/host3/bus0/target1/lun0/part1
   3     3       0        0        3      faulty removed
   4     4       8      145        4      active sync   
/dev/scsi/host5/bus0/target1/lun0/part1


 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Raid confused about one drive and two arrays
  2004-01-22 14:34 Linux Raid confused about one drive and two arrays AndyLiebman
@ 2004-01-23  0:39 ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2004-01-23  0:39 UTC (permalink / raw)
  To: AndyLiebman; +Cc: linux-raid

On Thursday January 22, AndyLiebman@aol.com wrote:
> I have just encountered a very disturbing RAID problem. I hope somebody 
> understands what happened and can tell me how to fix it.

It doesn't look very serious.

> 
> I have two RAID 5 arrays on my Linux machine -- md4 and md6.. Each array 
> consists of 5 firewire (1394a) drives -- one partition on each drive, 10 drives in 
> total. Because the device ID's on these drives can change, I always use MDADM 
> to create and manage my arrays based on UUIDs. I am using MDADM 1.3. Mandrake 
> 9.2 with mandrake's 2.4.22-21 kernel.
> 
> After running these arrays successfully for two months -- rebooting my file 
> server every day -- one of my arrays came up in a degraded mode. It looks as if 
> the Linux RAID subsystem "thinks" one of my drives belongs to both arrays.
> 
> As you can see below, when I run mdadm -E on each of my ten firewire drives, 
> mdadm is telling me that for each of the drives in the md4 array (UUID group 
> 62d8b91d:a2368783:6a78ca50:5793492f )  there are 5 Raid devices and 6 total 
> devices with one failed. However this array always only had 5
> devices.

The "total" and "failed" device counts are (unfortuantely) not very
reliable.

> 
> On the other hand, for most of the drives in the md6 arary (UUID group  
> 57f26496:25520b96:41757b62:f83fcb7b), mdadm is telling me that there are 5 raid 
> devices and 5 total devices with one failed.
> 
> However, when I run mdadm -E on the drive currently identified as /dev/sdh1 
> -- which also belongs to md6 or  the UUID group 
> 57f26496:25520b96:41757b62:f83fcb7b -- mdadm tells me that sdh1 is part of an array with 6 total devices, 5 
> raid devices, one failed.
> 
> /dev/sdh1 is identified as device number 3 in the RAID with the UUID 
> 57f26496:25520b96:41757b62:f83fcb7b.  Howver, when I run mdadm -E on the other 4 
> drives that belong to md6, mdadm tells me that device number 3 is
> faulty.

So presumably md thought that sdh1 failed un some way and removed it
from the array.  It updated the superblock on the remaining devices to
say that sdh1 had failed, but it didn't update the superblock on sdh1,
because it had failed, and writing to the superblock would be
pointless.


> 
> My questions are:
> 
> How do I fix this problem?

Check that sdh1 is ok (do a simple read check) and then
  mdadm /dev/md6 -a /dev/sdh1

> Why did it occur?

Look in your kernel logs to find out when and why sdh1 was removed
from the array.

> How can I prevent it from occurring again?

You cannot.  Drives fail occasionally.  That is why we have raid.

Or maybe a better answer is:
  Monitor your RAID arrays and correct problems when they occur.


NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Raid confused about one drive and two arrays
@ 2004-01-23  3:22 AndyLiebman
  2004-01-23  3:27 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: AndyLiebman @ 2004-01-23  3:22 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

In a message dated 1/22/2004 7:42:48 PM Eastern Standard Time, 
neilb@cse.unsw.edu.au writes:

> 
> My questions are:
> 
> How do I fix this problem?

Check that sdh1 is ok (do a simple read check) and then
  mdadm /dev/md6 -a /dev/sdh1

Neil, 

Thanks for taking the time to answer my previous email. 

Please excuse my follow up question. How do I do a "simple read check" on a 
partition/drive that's been removed from an array but that doesn't have it's 
own file system on it? 

Assuming the "failed removed" drive tests out okay on a read check, I 
understand that you're suggesting I add the drive back to my array. But why isn't it 
appropriate to Assemble the array with the "force" option? Is it NOT 
appropriate to use this option once a drive/partition has been marked as "failed and 
removed"? If not, when is "force" appropriate? 

Wish there was some more info about these options (like "run" as well). The 
manual pages give very brief explanations without examples of where they are 
appropriate. I know it's not your job to educate the whole world. Is this 
information in any book that one could buy? Is it in the Derek ??? book? Exhaustive 
google searches (and searches through the Linux Raid archives) have given me 
clues here and there. But no hard and fast rules of thumb from the person who 
should know best about mdadm!

That said, I -- and others -- certainly appreciate all you contribute to 
Linux! 

Regards, 
Andy Liebman

-----------------------------------------------------------------------------
FOR REFERENCE, HERE'S YOUR REPLY TO ME:

> I have just encountered a very disturbing RAID problem. I hope somebody 
> understands what happened and can tell me how to fix it.

It doesn't look very serious.

> 
> I have two RAID 5 arrays on my Linux machine -- md4 and md6.. Each array 
> consists of 5 firewire (1394a) drives -- one partition on each drive, 10 
drives in 
> total. Because the device ID's on these drives can change, I always use 
MDADM 
> to create and manage my arrays based on UUIDs. I am using MDADM 1.3. 
Mandrake 
> 9.2 with mandrake's 2.4.22-21 kernel.
> 
> After running these arrays successfully for two months -- rebooting my file 
> server every day -- one of my arrays came up in a degraded mode. It looks 
as if 
> the Linux RAID subsystem "thinks" one of my drives belongs to both arrays.
> 
> As you can see below, when I run mdadm -E on each of my ten firewire 
drives, 
> mdadm is telling me that for each of the drives in the md4 array (UUID 
group 
> 62d8b91d:a2368783:6a78ca50:5793492f )  there are 5 Raid devices and 6 total 
> devices with one failed. However this array always only had 5
> devices.

The "total" and "failed" device counts are (unfortuantely) not very
reliable.

> 
> On the other hand, for most of the drives in the md6 arary (UUID group  
> 57f26496:25520b96:41757b62:f83fcb7b), mdadm is telling me that there are 5 
raid 
> devices and 5 total devices with one failed.
> 
> However, when I run mdadm -E on the drive currently identified as /dev/sdh1 
> -- which also belongs to md6 or  the UUID group 
> 57f26496:25520b96:41757b62:f83fcb7b -- mdadm tells me that sdh1 is part of 
an array with 6 total devices, 5 
> raid devices, one failed.
> 
> /dev/sdh1 is identified as device number 3 in the RAID with the UUID 
> 57f26496:25520b96:41757b62:f83fcb7b.  Howver, when I run mdadm -E on the 
other 4 
> drives that belong to md6, mdadm tells me that device number 3 is
> faulty.

So presumably md thought that sdh1 failed un some way and removed it
from the array.  It updated the superblock on the remaining devices to
say that sdh1 had failed, but it didn't update the superblock on sdh1,
because it had failed, and writing to the superblock would be
pointless.

> 
> My questions are:
> 
> How do I fix this problem?

Check that sdh1 is ok (do a simple read check) and then
  mdadm /dev/md6 -a /dev/sdh1

> Why did it occur?

Look in your kernel logs to find out when and why sdh1 was removed
from the array.

> How can I prevent it from occurring again?

You cannot.  Drives fail occasionally.  That is why we have raid.

Or maybe a better answer is:
  Monitor your RAID arrays and correct problems when they occur.

BY THE WAY, I WAS MONITORING MY ARRAYS -- WHICH IS WHY I PICKED UP THE 
PROBLEM JUST AFTER IT OCCURRED. 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Linux Raid confused about one drive and two arrays
  2004-01-23  3:22 AndyLiebman
@ 2004-01-23  3:27 ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2004-01-23  3:27 UTC (permalink / raw)
  To: AndyLiebman; +Cc: linux-raid

On Thursday January 22, AndyLiebman@aol.com wrote:
> In a message dated 1/22/2004 7:42:48 PM Eastern Standard Time, 
> neilb@cse.unsw.edu.au writes:
> 
> > 
> > My questions are:
> > 
> > How do I fix this problem?
> 
> Check that sdh1 is ok (do a simple read check) and then
>   mdadm /dev/md6 -a /dev/sdh1
> 
> Neil, 
> 
> Thanks for taking the time to answer my previous email. 
> 
> Please excuse my follow up question. How do I do a "simple read check" on a 
> partition/drive that's been removed from an array but that doesn't have it's 
> own file system on it? 

dd if=/dev/sdh1 of=/dev/nul bs=1024k

> 
> Assuming the "failed removed" drive tests out okay on a read check, I 
> understand that you're suggesting I add the drive back to my array. But why isn't it 
> appropriate to Assemble the array with the "force" option? Is it NOT 
> appropriate to use this option once a drive/partition has been marked as "failed and 
> removed"? If not, when is "force" appropriate? 

--force is a last resort.  It is needed if two drives have fallen out
of a raid5 array.  --force assumes the data on the devices is still
correct and reasonably consistent even though some drives appear to
have failed.

In your case, you haven't lost any data, so you want to simply add a
known-good drive into the array and let it rebuild.


> 
> Wish there was some more info about these options (like "run" as well). The 
> manual pages give very brief explanations without examples of where they are 
> appropriate. I know it's not your job to educate the whole world. Is this 
> information in any book that one could buy? Is it in the Derek ??? book? Exhaustive 
> google searches (and searches through the Linux Raid archives) have given me 
> clues here and there. But no hard and fast rules of thumb from the person who 
> should know best about mdadm!

No, there isn't any really good thorough documentation of these sorts
of thing.

NeilBrown


> 
> That said, I -- and others -- certainly appreciate all you contribute to 
> Linux! 
> 
> Regards, 
> Andy Liebman
> 
> -----------------------------------------------------------------------------
> FOR REFERENCE, HERE'S YOUR REPLY TO ME:
> 
> > I have just encountered a very disturbing RAID problem. I hope somebody 
> > understands what happened and can tell me how to fix it.
> 
> It doesn't look very serious.
> 
> > 
> > I have two RAID 5 arrays on my Linux machine -- md4 and md6.. Each array 
> > consists of 5 firewire (1394a) drives -- one partition on each drive, 10 
> drives in 
> > total. Because the device ID's on these drives can change, I always use 
> MDADM 
> > to create and manage my arrays based on UUIDs. I am using MDADM 1.3. 
> Mandrake 
> > 9.2 with mandrake's 2.4.22-21 kernel.
> > 
> > After running these arrays successfully for two months -- rebooting my file 
> > server every day -- one of my arrays came up in a degraded mode. It looks 
> as if 
> > the Linux RAID subsystem "thinks" one of my drives belongs to both arrays.
> > 
> > As you can see below, when I run mdadm -E on each of my ten firewire 
> drives, 
> > mdadm is telling me that for each of the drives in the md4 array (UUID 
> group 
> > 62d8b91d:a2368783:6a78ca50:5793492f )  there are 5 Raid devices and 6 total 
> > devices with one failed. However this array always only had 5
> > devices.
> 
> The "total" and "failed" device counts are (unfortuantely) not very
> reliable.
> 
> > 
> > On the other hand, for most of the drives in the md6 arary (UUID group  
> > 57f26496:25520b96:41757b62:f83fcb7b), mdadm is telling me that there are 5 
> raid 
> > devices and 5 total devices with one failed.
> > 
> > However, when I run mdadm -E on the drive currently identified as /dev/sdh1 
> > -- which also belongs to md6 or  the UUID group 
> > 57f26496:25520b96:41757b62:f83fcb7b -- mdadm tells me that sdh1 is part of 
> an array with 6 total devices, 5 
> > raid devices, one failed.
> > 
> > /dev/sdh1 is identified as device number 3 in the RAID with the UUID 
> > 57f26496:25520b96:41757b62:f83fcb7b.  Howver, when I run mdadm -E on the 
> other 4 
> > drives that belong to md6, mdadm tells me that device number 3 is
> > faulty.
> 
> So presumably md thought that sdh1 failed un some way and removed it
> from the array.  It updated the superblock on the remaining devices to
> say that sdh1 had failed, but it didn't update the superblock on sdh1,
> because it had failed, and writing to the superblock would be
> pointless.
> 
> 
> > 
> > My questions are:
> > 
> > How do I fix this problem?
> 
> Check that sdh1 is ok (do a simple read check) and then
>   mdadm /dev/md6 -a /dev/sdh1
> 
> > Why did it occur?
> 
> Look in your kernel logs to find out when and why sdh1 was removed
> from the array.
> 
> > How can I prevent it from occurring again?
> 
> You cannot.  Drives fail occasionally.  That is why we have raid.
> 
> Or maybe a better answer is:
>   Monitor your RAID arrays and correct problems when they occur.
> 
> BY THE WAY, I WAS MONITORING MY ARRAYS -- WHICH IS WHY I PICKED UP THE 
> PROBLEM JUST AFTER IT OCCURRED. 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2004-01-23  3:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-22 14:34 Linux Raid confused about one drive and two arrays AndyLiebman
2004-01-23  0:39 ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2004-01-23  3:22 AndyLiebman
2004-01-23  3:27 ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).