Failed, but "md: cannot remove active disk..."

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Failed, but "md: cannot remove active disk..."
@ 2012-05-13 18:21 Michał Sawicz
  2012-05-14 10:22 ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Michał Sawicz @ 2012-05-13 18:21 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2403 bytes --]

Hey,

I've a weird issue with a RAID6 setup, /proc/mdstat says:

> md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc[8] sdb[7]
>       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]

So sdg1 is (F)ailed, yet `mdadm --remove` yields:

> md: cannot remove active disk sdg1 from md126 ...

in dmesg...

`mdadm --examine` shows:

> /dev/sdg1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
>            Name : media:store  (local to host media)
>   Creation Time : Tue Sep 13 21:36:43 2011
>      Raid Level : raid6
>    Raid Devices : 7
> 
> Avail Dev Size : 3907024896 (1863.01 GiB 2000.40 GB)
>      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
>   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : 4bcee8e2:709419b6:fbeb3a8e:5c9bb68a
> 
>     Update Time : Sat May 12 21:57:27 2012
>        Checksum : ffb03189 - correct
>          Events : 304564
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>    Device Role : Active device 0
>    Array State : AAAAAAA ('A' == active, '.' == missing)

So that superblock thinks it's active, but that's normal, right? It
wasn't updated due to fail? Others correctly show:

> dev/sdc:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
>            Name : media:store  (local to host media)
>   Creation Time : Tue Sep 13 21:36:43 2011
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
>      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
>   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : b713fd2b:eef145b0:ce91de0a:9077554b
> 
>     Update Time : Sat May 12 21:57:57 2012
>        Checksum : 80345876 - correct
>          Events : 304581
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>    Device Role : Active device 2
>    Array State : .AAAAAA ('A' == active, '.' == missing)

Any ideas?

Cheers,
-- 
Michał Sawicz <michal@sawicz.net>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-13 18:21 Failed, but "md: cannot remove active disk..." Michał Sawicz
@ 2012-05-14 10:22 ` NeilBrown
  2012-05-14 10:53   ` Michał Sawicz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2012-05-14 10:22 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3599 bytes --]

On Sun, 13 May 2012 20:21:48 +0200 Michał Sawicz <michal@sawicz.net> wrote:

> Hey,
> 
> I've a weird issue with a RAID6 setup, /proc/mdstat says:
> 
> > md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc[8] sdb[7]
> >       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> 
> So sdg1 is (F)ailed, yet `mdadm --remove` yields:
> 
> > md: cannot remove active disk sdg1 from md126 ...

There is a period of time between when a device fails and when the raid456
module finally lets go of it so it can be removed.  You seem to be in this
period of time.
Normally it is very short.  It needs to wait for any requests that have
already been sent to the device to complete (probably with failure) and
very shortly after that it should be released.  So this is normally much less
than one second but could be several seconds is some excessive retry is
happening.

But I'm guessing you have waited more than a few seconds.

I vaguely recall a bug in the not too distant past whereby RAID456 wouldn't
let go of a device quite as soon as it should.  Unfortunately I don't
remember the details.  You might be able to trigger it to release the drive
by adding a spare - if you have one - or maybe by just
  echo sync > /sys/block/md126/md/sync_action
it won't actually do a sync, but it might check things enough to make
progress.

What kernel are you using?

NeilBrown


> 
> in dmesg...
> 
> `mdadm --examine` shows:
> 
> > /dev/sdg1:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x0
> >      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
> >            Name : media:store  (local to host media)
> >   Creation Time : Tue Sep 13 21:36:43 2011
> >      Raid Level : raid6
> >    Raid Devices : 7
> > 
> > Avail Dev Size : 3907024896 (1863.01 GiB 2000.40 GB)
> >      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
> >   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
> >     Data Offset : 2048 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : 4bcee8e2:709419b6:fbeb3a8e:5c9bb68a
> > 
> >     Update Time : Sat May 12 21:57:27 2012
> >        Checksum : ffb03189 - correct
> >          Events : 304564
> > 
> >          Layout : left-symmetric
> >      Chunk Size : 512K
> > 
> >    Device Role : Active device 0
> >    Array State : AAAAAAA ('A' == active, '.' == missing)
> 
> So that superblock thinks it's active, but that's normal, right? It
> wasn't updated due to fail? Others correctly show:
> 
> > dev/sdc:
> >           Magic : a92b4efc
> >         Version : 1.2
> >     Feature Map : 0x0
> >      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
> >            Name : media:store  (local to host media)
> >   Creation Time : Tue Sep 13 21:36:43 2011
> >      Raid Level : raid6
> >    Raid Devices : 7
> > 
> >  Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
> >      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
> >   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
> >     Data Offset : 2048 sectors
> >    Super Offset : 8 sectors
> >           State : clean
> >     Device UUID : b713fd2b:eef145b0:ce91de0a:9077554b
> > 
> >     Update Time : Sat May 12 21:57:57 2012
> >        Checksum : 80345876 - correct
> >          Events : 304581
> > 
> >          Layout : left-symmetric
> >      Chunk Size : 512K
> > 
> >    Device Role : Active device 2
> >    Array State : .AAAAAA ('A' == active, '.' == missing)
> 
> Any ideas?
> 
> Cheers,


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-14 10:22 ` NeilBrown
@ 2012-05-14 10:53   ` Michał Sawicz
  2012-05-14 11:36     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Michał Sawicz @ 2012-05-14 10:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1881 bytes --]

Dnia 2012-05-14, pon o godzinie 20:22 +1000, NeilBrown pisze:
> On Sun, 13 May 2012 20:21:48 +0200 Michał Sawicz <michal@sawicz.net> wrote:
> 
> > Hey,
> > 
> > I've a weird issue with a RAID6 setup, /proc/mdstat says:
> > 
> > > md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc[8] sdb[7]
> > >       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> > 
> > So sdg1 is (F)ailed, yet `mdadm --remove` yields:
> > 
> > > md: cannot remove active disk sdg1 from md126 ...
> 
> There is a period of time between when a device fails and when the raid456
> module finally lets go of it so it can be removed.  You seem to be in this
> period of time.
> Normally it is very short.  It needs to wait for any requests that have
> already been sent to the device to complete (probably with failure) and
> very shortly after that it should be released.  So this is normally much less
> than one second but could be several seconds is some excessive retry is
> happening.
> 
> But I'm guessing you have waited more than a few seconds.

Yup :)

> I vaguely recall a bug in the not too distant past whereby RAID456 wouldn't
> let go of a device quite as soon as it should.  Unfortunately I don't
> remember the details.  You might be able to trigger it to release the drive
> by adding a spare - if you have one - or maybe by just
>   echo sync > /sys/block/md126/md/sync_action
> it won't actually do a sync, but it might check things enough to make
> progress.

# echo sync > /sys/block/md126/md/sync_action
-bash: echo: write error: Device or resource busy

eh?

> What kernel are you using?

# uname -a
Linux media 2.6.38-gentoo-r6 #2 SMP Tue Sep 13 19:13:42 CEST 2011 x86_64
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux

Thanks,
-- 
Michał Sawicz <michal@sawicz.net>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-14 10:53   ` Michał Sawicz
@ 2012-05-14 11:36     ` NeilBrown
  2012-05-14 11:44       ` Michał Sawicz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2012-05-14 11:36 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On Mon, 14 May 2012 12:53:00 +0200 Michał Sawicz <michal@sawicz.net> wrote:

> Dnia 2012-05-14, pon o godzinie 20:22 +1000, NeilBrown pisze:
> > On Sun, 13 May 2012 20:21:48 +0200 Michał Sawicz <michal@sawicz.net> wrote:
> > 
> > > Hey,
> > > 
> > > I've a weird issue with a RAID6 setup, /proc/mdstat says:
> > > 
> > > > md126 : active raid6 sda1[3] sdh1[6] sdg1[0](F) sdf1[5] sdi1[1] sdc[8] sdb[7]
> > > >       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/6] [_UUUUUU]
> > > 
> > > So sdg1 is (F)ailed, yet `mdadm --remove` yields:
> > > 
> > > > md: cannot remove active disk sdg1 from md126 ...
> > 
> > There is a period of time between when a device fails and when the raid456
> > module finally lets go of it so it can be removed.  You seem to be in this
> > period of time.
> > Normally it is very short.  It needs to wait for any requests that have
> > already been sent to the device to complete (probably with failure) and
> > very shortly after that it should be released.  So this is normally much less
> > than one second but could be several seconds is some excessive retry is
> > happening.
> > 
> > But I'm guessing you have waited more than a few seconds.
> 
> Yup :)
> 
> > I vaguely recall a bug in the not too distant past whereby RAID456 wouldn't
> > let go of a device quite as soon as it should.  Unfortunately I don't
> > remember the details.  You might be able to trigger it to release the drive
> > by adding a spare - if you have one - or maybe by just
> >   echo sync > /sys/block/md126/md/sync_action
> > it won't actually do a sync, but it might check things enough to make
> > progress.
> 
> # echo sync > /sys/block/md126/md/sync_action
> -bash: echo: write error: Device or resource busy

Hmmm....

Looks like MD_RECOVERY_NEEDED is already set.
But remove_and_add_spares() isn't removing the failed device
from the array.

I cannot find anything since 2.6.38 that looks like your symptoms.

Is the array still functioning?
Are there any interesting messages appearing in the kernel logs?

What does
  grep . /sys/block/md126/md/dev*/*
show?

NeilBrown


> 
> eh?
> 
> > What kernel are you using?
> 
> # uname -a
> Linux media 2.6.38-gentoo-r6 #2 SMP Tue Sep 13 19:13:42 CEST 2011 x86_64
> AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ AuthenticAMD GNU/Linux
> 
> Thanks,


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-14 11:36     ` NeilBrown
@ 2012-05-14 11:44       ` Michał Sawicz
  2012-05-15  3:38         ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Michał Sawicz @ 2012-05-14 11:44 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2203 bytes --]

Dnia 2012-05-14, pon o godzinie 21:36 +1000, NeilBrown pisze:
> Is the array still functioning?
> Are there any interesting messages appearing in the kernel logs?

Yes and no, the array is fine, working, nothing interesting in any logs
I can see.

> What does
>   grep . /sys/block/md126/md/dev*/*
> show?

# grep . /sys/block/md126/md/dev*/*
/sys/block/md126/md/dev-sda1/errors:0
/sys/block/md126/md/dev-sda1/offset:2048
/sys/block/md126/md/dev-sda1/recovery_start:none
/sys/block/md126/md/dev-sda1/size:1953512448
/sys/block/md126/md/dev-sda1/slot:3
/sys/block/md126/md/dev-sda1/state:in_sync
/sys/block/md126/md/dev-sdb/errors:0
/sys/block/md126/md/dev-sdb/offset:2048
/sys/block/md126/md/dev-sdb/recovery_start:none
/sys/block/md126/md/dev-sdb/size:1953513560
/sys/block/md126/md/dev-sdb/slot:5
/sys/block/md126/md/dev-sdb/state:in_sync
/sys/block/md126/md/dev-sdc/errors:0
/sys/block/md126/md/dev-sdc/offset:2048
/sys/block/md126/md/dev-sdc/recovery_start:none
/sys/block/md126/md/dev-sdc/size:1953513560
/sys/block/md126/md/dev-sdc/slot:2
/sys/block/md126/md/dev-sdc/state:in_sync
/sys/block/md126/md/dev-sdf1/errors:0
/sys/block/md126/md/dev-sdf1/offset:2048
/sys/block/md126/md/dev-sdf1/recovery_start:none
/sys/block/md126/md/dev-sdf1/size:1953512448
/sys/block/md126/md/dev-sdf1/slot:4
/sys/block/md126/md/dev-sdf1/state:in_sync
/sys/block/md126/md/dev-sdg1/errors:2776
/sys/block/md126/md/dev-sdg1/offset:2048
/sys/block/md126/md/dev-sdg1/recovery_start:none
/sys/block/md126/md/dev-sdg1/size:1953512448
/sys/block/md126/md/dev-sdg1/slot:0
/sys/block/md126/md/dev-sdg1/state:faulty
/sys/block/md126/md/dev-sdh1/errors:0
/sys/block/md126/md/dev-sdh1/offset:2048
/sys/block/md126/md/dev-sdh1/recovery_start:none
/sys/block/md126/md/dev-sdh1/size:1953512448
/sys/block/md126/md/dev-sdh1/slot:6
/sys/block/md126/md/dev-sdh1/state:in_sync
/sys/block/md126/md/dev-sdi1/errors:0
/sys/block/md126/md/dev-sdi1/offset:2048
/sys/block/md126/md/dev-sdi1/recovery_start:none
/sys/block/md126/md/dev-sdi1/size:1953512448
/sys/block/md126/md/dev-sdi1/slot:1
/sys/block/md126/md/dev-sdi1/state:in_sync

Thanks,
-- 
Michał Sawicz <michal@sawicz.net>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-14 11:44       ` Michał Sawicz
@ 2012-05-15  3:38         ` NeilBrown
  2012-05-15  7:56           ` Michał Sawicz
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2012-05-15  3:38 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1056 bytes --]

On Mon, 14 May 2012 13:44:25 +0200 Michał Sawicz <michal@sawicz.net> wrote:

> Dnia 2012-05-14, pon o godzinie 21:36 +1000, NeilBrown pisze:
> > Is the array still functioning?
> > Are there any interesting messages appearing in the kernel logs?
> 
> Yes and no, the array is fine, working, nothing interesting in any logs
> I can see.
> 
> > What does
> >   grep . /sys/block/md126/md/dev*/*
> > show?
>
//snip//


Thanks. No hints there - all normal.

If you can write to the array, then md_check_recovery must be getting run,
and the array is not read-only. So...

Is there a 'sync' thread still running?  It would be called
    md126_resync

That would stop things from progressing.
If there is, what does
    cat /proc/$PID/stack
show for the relevant PID ??

What about
    cat /sys/block/md126/md/sync_action
??
If that were 'frozen', that might explain it.

    cat /sys/block/md126/md/reshape_position

shows "none" I suspect?

I cannot think of anything else that could be getting in the way.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Failed, but "md: cannot remove active disk..."
  2012-05-15  3:38         ` NeilBrown
@ 2012-05-15  7:56           ` Michał Sawicz
  0 siblings, 0 replies; 7+ messages in thread
From: Michał Sawicz @ 2012-05-15  7:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 724 bytes --]

Dnia 2012-05-15, wto o godzinie 13:38 +1000, NeilBrown pisze:
On Mon, 14 May 2012 13:44:25 +0200 Michał Sawicz <michal@sawicz.net>
wrote:
> 
> > Dnia 2012-05-14, pon o godzinie 21:36 +1000, NeilBrown pisze:
> > > Is the array still functioning?
> > > Are there any interesting messages appearing in the kernel logs?
> > 
> > Yes and no, the array is fine, working, nothing interesting in any
logs
> > I can see.
> > 
> > > What does
> > >   grep . /sys/block/md126/md/dev*/*
> > > show?
> >
> //snip//
> 
> 
> Thanks. No hints there - all normal.
> 
--8<--

I ended up rebooting the machine and managed to get everything going
again. Thanks for the help.

-- 
Michał Sawicz <michal@sawicz.net>

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-05-15  7:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-13 18:21 Failed, but "md: cannot remove active disk..." Michał Sawicz
2012-05-14 10:22 ` NeilBrown
2012-05-14 10:53   ` Michał Sawicz
2012-05-14 11:36     ` NeilBrown
2012-05-14 11:44       ` Michał Sawicz
2012-05-15  3:38         ` NeilBrown
2012-05-15  7:56           ` Michał Sawicz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).