OK, Now this is really weird

Linux RAID subsystem development
 help / color / mirror / Atom feed

* OK, Now this is really weird
@ 2011-02-26  7:00 Leslie Rhorer
  2011-02-26  7:36 ` Jeff Woods
  0 siblings, 1 reply; 8+ messages in thread
From: Leslie Rhorer @ 2011-02-26  7:00 UTC (permalink / raw)
  To: 'Linux RAID'


	I have a pair of drives each of whose 3 partitions are members of a
set of 3 RAID arrays.  One of the two drives had a flaky power connection
which I thought I had fixed, but I guess not, because the drive was taken
offline again on Tuesday.  The significant issue, however, is that both
times the drive failed, mdadm behaved really oddly.  The first time I
thought it might just be some odd anomaly, but the second time it did
precisely the same thing.  Both times, when the drive was de-registered by
udev, the first two arrays properly responded to the failure, but the third
array did not.  Here is the layout:

ARRAY /dev/md1 metadata=0.90 UUID=4cde286c:0687556a:4d9996dd:dd23e701
ARRAY /dev/md2 metadata=1.2 name=Backup:2
UUID=d45ff663:9e53774c:6fcf9968:21692025
ARRAY /dev/md3 metadata=1.2 name=Backup:3
UUID=51d22c47:10f58974:0b27ef04:5609d357


	Here is the result from examining the live parttions:

/dev/sdl1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 4cde286c:0687556a:4d9996dd:dd23e701 (local to host Backup)
  Creation Time : Fri Jun 11 20:45:51 2010
     Raid Level : raid1
  Used Dev Size : 6144704 (5.86 GiB 6.29 GB)
     Array Size : 6144704 (5.86 GiB 6.29 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1

    Update Time : Sat Feb 26 00:47:19 2011
          State : clean
Internal Bitmap : present
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0
       Checksum : c127a1bf - correct
         Events : 1014


      Number   Major   Minor   RaidDevice State
this     1       8      177        1      active sync   /dev/sdl1

   0     0       0        0        0      removed
   1     1       8      177        1      active sync   /dev/sdl1


/dev/sdl2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : d45ff663:9e53774c:6fcf9968:21692025
           Name : Backup:2  (local to host Backup)
  Creation Time : Sat Dec 19 22:59:43 2009
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 554884828 (264.59 GiB 284.10 GB)
     Array Size : 554884828 (264.59 GiB 284.10 GB)
    Data Offset : 272 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e0896263:c0f95d43:9c0cb92a:79a95210

Internal Bitmap : 8 sectors from superblock
    Update Time : Sat Feb 26 00:47:18 2011
       Checksum : 41881e60 - correct
         Events : 902752


   Device Role : Active device 1
   Array State : .A ('A' == active, '.' == missing)


/dev/sdl3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 51d22c47:10f58974:0b27ef04:5609d357
           Name : Backup:3  (local to host Backup)
  Creation Time : Sat May 29 14:16:22 2010
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 409593096 (195.31 GiB 209.71 GB)
     Array Size : 409593096 (195.31 GiB 209.71 GB)
    Data Offset : 144 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 982c9519:48d21940:3720b6d5:dfb0a312

Internal Bitmap : 8 sectors from superblock
    Update Time : Wed Feb  9 20:02:26 2011
       Checksum : 6c78f4a2 - correct
         Events : 364740


   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)


	Here are the array details:

/dev/md1:
        Version : 0.90
  Creation Time : Fri Jun 11 20:45:51 2010
     Raid Level : raid1
     Array Size : 6144704 (5.86 GiB 6.29 GB)
  Used Dev Size : 6144704 (5.86 GiB 6.29 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat Feb 26 00:53:23 2011
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 4cde286c:0687556a:4d9996dd:dd23e701 (local to host Backup)
         Events : 0.1016

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8      177        1      active sync   /dev/sdl1

       2       8      161        -      faulty spare
/dev/md2:
        Version : 1.2
  Creation Time : Sat Dec 19 22:59:43 2009
     Raid Level : raid1
     Array Size : 277442414 (264.59 GiB 284.10 GB)
  Used Dev Size : 277442414 (264.59 GiB 284.10 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat Feb 26 00:53:47 2011
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : Backup:2  (local to host Backup)
           UUID : d45ff663:9e53774c:6fcf9968:21692025
         Events : 902890

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       3       8      178        1      active sync   /dev/sdl2

       2       8      162        -      faulty spare
/dev/md3:
        Version : 1.2
  Creation Time : Sat May 29 14:16:22 2010
     Raid Level : raid1
     Array Size : 204796548 (195.31 GiB 209.71 GB)
  Used Dev Size : 204796548 (195.31 GiB 209.71 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed Feb  9 20:02:26 2011
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           Name : Backup:3  (local to host Backup)
           UUID : 51d22c47:10f58974:0b27ef04:5609d357
         Events : 364740

    Number   Major   Minor   RaidDevice State
       2       8      163        0      faulty spare rebuilding
       3       8      179        1      active sync   /dev/sdl3

	So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
been failed and removed on /dev /md3 like it has on /dev/md1 and /dev/md2?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OK, Now this is really weird
  2011-02-26  7:00 OK, Now this is really weird Leslie Rhorer
@ 2011-02-26  7:36 ` Jeff Woods
  2011-02-26 11:20   ` Leslie Rhorer
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Woods @ 2011-02-26  7:36 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

Quoting Leslie Rhorer <lrhorer@satx.rr.com>:
> 	I have a pair of drives each of whose 3 partitions are members of a
> set of 3 RAID arrays.  One of the two drives had a flaky power connection
> which I thought I had fixed, but I guess not, because the drive was taken
> offline again on Tuesday.  The significant issue, however, is that both
> times the drive failed, mdadm behaved really oddly.  The first time I
> thought it might just be some odd anomaly, but the second time it did
> precisely the same thing.  Both times, when the drive was de-registered by
> udev, the first two arrays properly responded to the failure, but the third
> array did not.  Here is the layout:

[snip lots of technical details]

> 	So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
> been failed and removed on /dev /md3 like it has on /dev/md1 and /dev/md2?

Is it possible there has been no I/O request for /dev/md3 since  
/dev/sdk failed?
-- 
Jeff Woods <jeff@jeffwoods.us>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: OK, Now this is really weird
  2011-02-26  7:36 ` Jeff Woods
@ 2011-02-26 11:20   ` Leslie Rhorer
  2011-02-26 11:35     ` Mathias Burén
  0 siblings, 1 reply; 8+ messages in thread
From: Leslie Rhorer @ 2011-02-26 11:20 UTC (permalink / raw)
  To: 'Jeff Woods'; +Cc: 'Linux RAID'



> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Jeff Woods
> Sent: Saturday, February 26, 2011 1:36 AM
> To: lrhorer@satx.rr.com
> Cc: 'Linux RAID'
> Subject: Re: OK, Now this is really weird
> 
> Quoting Leslie Rhorer <lrhorer@satx.rr.com>:
> > 	I have a pair of drives each of whose 3 partitions are members of a
> > set of 3 RAID arrays.  One of the two drives had a flaky power
> connection
> > which I thought I had fixed, but I guess not, because the drive was
> taken
> > offline again on Tuesday.  The significant issue, however, is that both
> > times the drive failed, mdadm behaved really oddly.  The first time I
> > thought it might just be some odd anomaly, but the second time it did
> > precisely the same thing.  Both times, when the drive was de-registered
> by
> > udev, the first two arrays properly responded to the failure, but the
> third
> > array did not.  Here is the layout:
> 
> [snip lots of technical details]
> 
> > 	So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> /dev/md2?
> 
> Is it possible there has been no I/O request for /dev/md3 since
> /dev/sdk failed?

	Well, I thought about that.  It's swap space, so I suppose it's
possible.  I would have thought, however, that mdadm would fail a missing
member whether there is any I/O or not.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OK, Now this is really weird
  2011-02-26 11:20   ` Leslie Rhorer
@ 2011-02-26 11:35     ` Mathias Burén
  2011-02-26 21:34       ` NeilBrown
  2011-02-27  7:15       ` Leslie Rhorer
  0 siblings, 2 replies; 8+ messages in thread
From: Mathias Burén @ 2011-02-26 11:35 UTC (permalink / raw)
  To: lrhorer; +Cc: Jeff Woods, Linux RAID

On 26 February 2011 11:20, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
>
>
>> -----Original Message-----
>> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
>> owner@vger.kernel.org] On Behalf Of Jeff Woods
>> Sent: Saturday, February 26, 2011 1:36 AM
>> To: lrhorer@satx.rr.com
>> Cc: 'Linux RAID'
>> Subject: Re: OK, Now this is really weird
>>
>> Quoting Leslie Rhorer <lrhorer@satx.rr.com>:
>> >     I have a pair of drives each of whose 3 partitions are members of a
>> > set of 3 RAID arrays.  One of the two drives had a flaky power
>> connection
>> > which I thought I had fixed, but I guess not, because the drive was
>> taken
>> > offline again on Tuesday.  The significant issue, however, is that both
>> > times the drive failed, mdadm behaved really oddly.  The first time I
>> > thought it might just be some odd anomaly, but the second time it did
>> > precisely the same thing.  Both times, when the drive was de-registered
>> by
>> > udev, the first two arrays properly responded to the failure, but the
>> third
>> > array did not.  Here is the layout:
>>
>> [snip lots of technical details]
>>
>> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
>> > been failed and removed on /dev /md3 like it has on /dev/md1 and
>> /dev/md2?
>>
>> Is it possible there has been no I/O request for /dev/md3 since
>> /dev/sdk failed?
>
>        Well, I thought about that.  It's swap space, so I suppose it's
> possible.  I would have thought, however, that mdadm would fail a missing
> member whether there is any I/O or not.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I thought so as well. But how will mdadm know is the device is faulty,
unless the device is generating errors? (which usually only happens on
read and/or write)

// Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OK, Now this is really weird
  2011-02-26 11:35     ` Mathias Burén
@ 2011-02-26 21:34       ` NeilBrown
  2011-02-27  7:22         ` Leslie Rhorer
  2011-02-27  7:15       ` Leslie Rhorer
  1 sibling, 1 reply; 8+ messages in thread
From: NeilBrown @ 2011-02-26 21:34 UTC (permalink / raw)
  To: Mathias Burén; +Cc: lrhorer, Jeff Woods, Linux RAID

On Sat, 26 Feb 2011 11:35:11 +0000 Mathias Burén <mathias.buren@gmail.com>
wrote:

> On 26 February 2011 11:20, Leslie Rhorer <lrhorer@satx.rr.com> wrote:
> >
> >
> >> -----Original Message-----
> >> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> >> owner@vger.kernel.org] On Behalf Of Jeff Woods
> >> Sent: Saturday, February 26, 2011 1:36 AM
> >> To: lrhorer@satx.rr.com
> >> Cc: 'Linux RAID'
> >> Subject: Re: OK, Now this is really weird
> >>
> >> Quoting Leslie Rhorer <lrhorer@satx.rr.com>:
> >> >     I have a pair of drives each of whose 3 partitions are members of a
> >> > set of 3 RAID arrays.  One of the two drives had a flaky power
> >> connection
> >> > which I thought I had fixed, but I guess not, because the drive was
> >> taken
> >> > offline again on Tuesday.  The significant issue, however, is that both
> >> > times the drive failed, mdadm behaved really oddly.  The first time I
> >> > thought it might just be some odd anomaly, but the second time it did
> >> > precisely the same thing.  Both times, when the drive was de-registered
> >> by
> >> > udev, the first two arrays properly responded to the failure, but the
> >> third
> >> > array did not.  Here is the layout:
> >>
> >> [snip lots of technical details]
> >>
> >> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
> >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> >> /dev/md2?
> >>
> >> Is it possible there has been no I/O request for /dev/md3 since
> >> /dev/sdk failed?
> >
> >        Well, I thought about that.  It's swap space, so I suppose it's
> > possible.  I would have thought, however, that mdadm would fail a missing
> > member whether there is any I/O or not.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> I thought so as well. But how will mdadm know is the device is faulty,
> unless the device is generating errors? (which usually only happens on
> read and/or write)

With very recent mdadm the command

   mdadm -If sdXX

will find any md array that has /dev/sdXX as a member and will fail and
remove it.
Note the device name is 'sdxx', not '/dev/something'.  This is because that
at the time you want to do this, udev has probably removed all trace
from /dev so you need to use the name mentioned in /proc/mdstat
or /sys/block/mdXX/md/dev-$DEVNAME

You can set up a udev rule to run mdadm like this automatically when a device
is hot-unplugged.
something like

 SUBSYSTEM=="block", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: OK, Now this is really weird
  2011-02-26 21:34       ` NeilBrown
@ 2011-02-27  7:22         ` Leslie Rhorer
  2011-02-27  7:57           ` NeilBrown
  0 siblings, 1 reply; 8+ messages in thread
From: Leslie Rhorer @ 2011-02-27  7:22 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: 'Linux RAID'


> > >> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't
> it
> > >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> > >> /dev/md2?
> > >>
> > >> Is it possible there has been no I/O request for /dev/md3 since
> > >> /dev/sdk failed?
> > >
> > >        Well, I thought about that.  It's swap space, so I suppose it's
> > > possible.  I would have thought, however, that mdadm would fail a
> missing
> > > member whether there is any I/O or not.
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >
> >
> > I thought so as well. But how will mdadm know is the device is faulty,
> > unless the device is generating errors? (which usually only happens on
> > read and/or write)
> 
> With very recent mdadm the command
> 
>    mdadm -If sdXX
> 
> will find any md array that has /dev/sdXX as a member and will fail and
> remove it.

	No, it's version 3.1.4, and that gives me a "Device or Resource
busy" error.  It does report that it set sdk3 faulty, but the hot remove
fails.

	So how can I remove the drive (so I can add it back)?

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: OK, Now this is really weird
  2011-02-27  7:22         ` Leslie Rhorer
@ 2011-02-27  7:57           ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2011-02-27  7:57 UTC (permalink / raw)
  To: lrhorer; +Cc: 'Linux RAID'

On Sun, 27 Feb 2011 01:22:41 -0600 "Leslie Rhorer" <lrhorer@satx.rr.com>
wrote:

> 
> > > >> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't
> > it
> > > >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> > > >> /dev/md2?
> > > >>
> > > >> Is it possible there has been no I/O request for /dev/md3 since
> > > >> /dev/sdk failed?
> > > >
> > > >        Well, I thought about that.  It's swap space, so I suppose it's
> > > > possible.  I would have thought, however, that mdadm would fail a
> > missing
> > > > member whether there is any I/O or not.
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> > in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > > I thought so as well. But how will mdadm know is the device is faulty,
> > > unless the device is generating errors? (which usually only happens on
> > > read and/or write)
> > 
> > With very recent mdadm the command
> > 
> >    mdadm -If sdXX
> > 
> > will find any md array that has /dev/sdXX as a member and will fail and
> > remove it.
> 
> 	No, it's version 3.1.4, and that gives me a "Device or Resource
> busy" error.  It does report that it set sdk3 faulty, but the hot remove
> fails.
> 
> 	So how can I remove the drive (so I can add it back)?

Maybe:
  mdadm /dev/md2 --remove failed

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: OK, Now this is really weird
  2011-02-26 11:35     ` Mathias Burén
  2011-02-26 21:34       ` NeilBrown
@ 2011-02-27  7:15       ` Leslie Rhorer
  1 sibling, 0 replies; 8+ messages in thread
From: Leslie Rhorer @ 2011-02-27  7:15 UTC (permalink / raw)
  To: 'Mathias Burén'; +Cc: 'Linux RAID'


> >> >     So what gives?  /dev/sdk3 no longer even exists, so why hasn't it
> >> > been failed and removed on /dev /md3 like it has on /dev/md1 and
> >> /dev/md2?
> >>
> >> Is it possible there has been no I/O request for /dev/md3 since
> >> /dev/sdk failed?
> >
> >        Well, I thought about that.  It's swap space, so I suppose it's
> > possible.  I would have thought, however, that mdadm would fail a
> missing
> > member whether there is any I/O or not.
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> I thought so as well. But how will mdadm know is the device is faulty,
> unless the device is generating errors? (which usually only happens on
> read and/or write)

Well, reading here, I believe I have seen posts talking about mdadm waking
up sleeping spindles periodically, thereby killing part of the power saving
functions of "green" drives.  Have those posts been in error?  It's been
days since the drive "failed".

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-02-27  7:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-26  7:00 OK, Now this is really weird Leslie Rhorer
2011-02-26  7:36 ` Jeff Woods
2011-02-26 11:20   ` Leslie Rhorer
2011-02-26 11:35     ` Mathias Burén
2011-02-26 21:34       ` NeilBrown
2011-02-27  7:22         ` Leslie Rhorer
2011-02-27  7:57           ` NeilBrown
2011-02-27  7:15       ` Leslie Rhorer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox