Spare disk not becoming active

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Spare disk not becoming active
@ 2012-12-03  0:04 Tudor Holton
  2012-12-19 23:19 ` Tudor Holton
  2012-12-20  0:03 ` Roger Heflin
  0 siblings, 2 replies; 5+ messages in thread
From: Tudor Holton @ 2012-12-03  0:04 UTC (permalink / raw)
  To: linux-raid

Hallo,

I'm having some trouble with an array I have that has become degraded.

I have an array with this array state:

md101 : active raid1 sdf1[0] sdb1[2](S)
       1953511936 blocks [2/1] [U_]


mdadm --detail says:

/dev/md101:
         Version : 0.90
   Creation Time : Thu Jan 13 14:34:27 2011
      Raid Level : raid1
      Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
   Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 101
     Persistence : Superblock is persistent

     Update Time : Fri Nov 23 03:23:04 2012
           State : clean, degraded
  Active Devices : 1
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 1

            UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host 
barney)
          Events : 0.2127

     Number   Major   Minor   RaidDevice State
        0       8       81        0      active sync /dev/sdf1
        1       0        0        1      removed

        2       8       17        -      spare   /dev/sdb1


If I attempt to force the spare to become active it begins to recover:
$ sudo mdadm -S /dev/md101
mdadm: stopped /dev/md101
$ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
$ cat /proc/mdstat
md101 : active raid1 sdf1[0] sdb1[2]
       1953511936 blocks [2/1] [U_]
       [>....................]  recovery =  0.0% (541440/1953511936) 
finish=420.8min speed=77348K/sec

This runs for the allotted time but returns to the state of spare.

Neither disk partition report errors:
$ cat /sys/block/md101/md/dev-sdf1/errors
0
$ cat /sys/block/md101/md/dev-sdb1/errors
0

Are there mdadm logs to find out why this is not recovering properly?  
How otherwise do I debug this?

Cheers,
Tudor.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Spare disk not becoming active
  2012-12-03  0:04 Spare disk not becoming active Tudor Holton
@ 2012-12-19 23:19 ` Tudor Holton
  2012-12-20  0:03 ` Roger Heflin
  1 sibling, 0 replies; 5+ messages in thread
From: Tudor Holton @ 2012-12-19 23:19 UTC (permalink / raw)
  To: linux-raid

I don't mean to be rude, but it's been two weeks and my system is still 
in this state.  Bump, anyone?

A thorough search of the web (before I originally posted this to the 
list) revealed nothing.  No explanation as to why this occurs seemed 
apparent, only that it's happened a number of times.  Most reports 
indicate that a complete stop of the array and reassemble fixes it, but 
I tried that and it still returned to spare. Some reports indicated my 
position but no response that seems complete.

Eventually the discussions runs to wiping the disks and starting again.  
That seems a bit drastic and I'm concerned that *one* of the disks is 
faulty but not being reported as such, and I don't want to pick the 
wrong one to wipe off the superblock.  mdadm reports no errors, but 
SMART indicates there may be a problem with the *active* disk, which is 
even more worrying because without making the spare active I can't 
remove it to test it properly.

Any ideas?

Cheers,
Tudor.

On 03/12/12 11:04, Tudor Holton wrote:
> Hallo,
>
> I'm having some trouble with an array I have that has become degraded.
>
> I have an array with this array state:
>
> md101 : active raid1 sdf1[0] sdb1[2](S)
>       1953511936 blocks [2/1] [U_]
>
>
> mdadm --detail says:
>
> /dev/md101:
>         Version : 0.90
>   Creation Time : Thu Jan 13 14:34:27 2011
>      Raid Level : raid1
>      Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>   Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 101
>     Persistence : Superblock is persistent
>
>     Update Time : Fri Nov 23 03:23:04 2012
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 1
>
>            UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host 
> barney)
>          Events : 0.2127
>
>     Number   Major   Minor   RaidDevice State
>        0       8       81        0      active sync /dev/sdf1
>        1       0        0        1      removed
>
>        2       8       17        -      spare   /dev/sdb1
>
>
> If I attempt to force the spare to become active it begins to recover:
> $ sudo mdadm -S /dev/md101
> mdadm: stopped /dev/md101
> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 
> /dev/sdb1
> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
> $ cat /proc/mdstat
> md101 : active raid1 sdf1[0] sdb1[2]
>       1953511936 blocks [2/1] [U_]
>       [>....................]  recovery =  0.0% (541440/1953511936) 
> finish=420.8min speed=77348K/sec
>
> This runs for the allotted time but returns to the state of spare.
>
> Neither disk partition report errors:
> $ cat /sys/block/md101/md/dev-sdf1/errors
> 0
> $ cat /sys/block/md101/md/dev-sdb1/errors
> 0
>
> Are there mdadm logs to find out why this is not recovering properly?  
> How otherwise do I debug this?
>
> Cheers,
> Tudor.
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Spare disk not becoming active
  2012-12-03  0:04 Spare disk not becoming active Tudor Holton
  2012-12-19 23:19 ` Tudor Holton
@ 2012-12-20  0:03 ` Roger Heflin
  2012-12-24  7:24   ` Tudor Holton
  1 sibling, 1 reply; 5+ messages in thread
From: Roger Heflin @ 2012-12-20  0:03 UTC (permalink / raw)
  To: Tudor Holton; +Cc: linux-raid

On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote:
> Hallo,
>
> I'm having some trouble with an array I have that has become degraded.
>
> I have an array with this array state:
>
> md101 : active raid1 sdf1[0] sdb1[2](S)
>       1953511936 blocks [2/1] [U_]
>
>
> mdadm --detail says:
>
> /dev/md101:
>         Version : 0.90
>   Creation Time : Thu Jan 13 14:34:27 2011
>      Raid Level : raid1
>      Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>   Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>    Raid Devices : 2
>   Total Devices : 2
> Preferred Minor : 101
>     Persistence : Superblock is persistent
>
>     Update Time : Fri Nov 23 03:23:04 2012
>           State : clean, degraded
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 0
>   Spare Devices : 1
>
>            UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney)
>          Events : 0.2127
>
>     Number   Major   Minor   RaidDevice State
>        0       8       81        0      active sync /dev/sdf1
>        1       0        0        1      removed
>
>        2       8       17        -      spare   /dev/sdb1
>
>
> If I attempt to force the spare to become active it begins to recover:
> $ sudo mdadm -S /dev/md101
> mdadm: stopped /dev/md101
> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
> $ cat /proc/mdstat
> md101 : active raid1 sdf1[0] sdb1[2]
>       1953511936 blocks [2/1] [U_]
>       [>....................]  recovery =  0.0% (541440/1953511936)
> finish=420.8min speed=77348K/sec
>
> This runs for the allotted time but returns to the state of spare.
>
> Neither disk partition report errors:
> $ cat /sys/block/md101/md/dev-sdf1/errors
> 0
> $ cat /sys/block/md101/md/dev-sdb1/errors
> 0
>
> Are there mdadm logs to find out why this is not recovering properly?  How
> otherwise do I debug this?
>
> Cheers,
> Tudor.

Did you look in the various /var/log/messages (current and previous
ones) to see what it indicated happened the about the time it
completed?

There is almost certainly something in there indicating what went wrong.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Spare disk not becoming active
  2012-12-20  0:03 ` Roger Heflin
@ 2012-12-24  7:24   ` Tudor Holton
  2012-12-24 15:03     ` Roger Heflin
  0 siblings, 1 reply; 5+ messages in thread
From: Tudor Holton @ 2012-12-24  7:24 UTC (permalink / raw)
  To: Roger Heflin; +Cc: linux-raid

On 20/12/12 11:03, Roger Heflin wrote:
> On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote:
>> Hallo,
>>
>> I'm having some trouble with an array I have that has become degraded.
>>
>> I have an array with this array state:
>>
>> md101 : active raid1 sdf1[0] sdb1[2](S)
>>        1953511936 blocks [2/1] [U_]
>>
>>
>> mdadm --detail says:
>>
>> /dev/md101:
>>          Version : 0.90
>>    Creation Time : Thu Jan 13 14:34:27 2011
>>       Raid Level : raid1
>>       Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>    Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>     Raid Devices : 2
>>    Total Devices : 2
>> Preferred Minor : 101
>>      Persistence : Superblock is persistent
>>
>>      Update Time : Fri Nov 23 03:23:04 2012
>>            State : clean, degraded
>>   Active Devices : 1
>> Working Devices : 2
>>   Failed Devices : 0
>>    Spare Devices : 1
>>
>>             UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney)
>>           Events : 0.2127
>>
>>      Number   Major   Minor   RaidDevice State
>>         0       8       81        0      active sync /dev/sdf1
>>         1       0        0        1      removed
>>
>>         2       8       17        -      spare   /dev/sdb1
>>
>>
>> If I attempt to force the spare to become active it begins to recover:
>> $ sudo mdadm -S /dev/md101
>> mdadm: stopped /dev/md101
>> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
>> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
>> $ cat /proc/mdstat
>> md101 : active raid1 sdf1[0] sdb1[2]
>>        1953511936 blocks [2/1] [U_]
>>        [>....................]  recovery =  0.0% (541440/1953511936)
>> finish=420.8min speed=77348K/sec
>>
>> This runs for the allotted time but returns to the state of spare.
>>
>> Neither disk partition report errors:
>> $ cat /sys/block/md101/md/dev-sdf1/errors
>> 0
>> $ cat /sys/block/md101/md/dev-sdb1/errors
>> 0
>>
>> Are there mdadm logs to find out why this is not recovering properly?  How
>> otherwise do I debug this?
>>
>> Cheers,
>> Tudor.
> Did you look in the various /var/log/messages (current and previous
> ones) to see what it indicated happened the about the time it
> completed?
>
> There is almost certainly something in there indicating what went wrong.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thanks.  I watched the logs messages during the recovery.  During the 
last 0.1% (at 99.9%) messages like this appeared:
Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf] 
Unhandled sense code
Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf] 
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense 
Key : Medium Error [current] [descriptor]
Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data 
with sense descriptors (in hex):
Dec 24 18:20:32 barney kernel: [2796835.703327]         72 03 11 04 00 
00 00 0c 00 0a 80 00 00 00 00 00
Dec 24 18:20:32 barney kernel: [2796835.703335]         e8 e0 5f 86
Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add. 
Sense: Unrecovered read error - auto reallocate failed
Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB: 
Read(10): 28 00 e8 e0 5f 7f 00 00 08 00
Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error, 
dev sdf, sector 3907018630
Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete
Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf: 
unrecoverable I/O read error for block 3907018496

Unfortunately, sdf is the active disk in this case.  So I guess my only 
option left is to create a new array and copy as much over as it will 
let me?

Cheers,
Tudor.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Spare disk not becoming active
  2012-12-24  7:24   ` Tudor Holton
@ 2012-12-24 15:03     ` Roger Heflin
  0 siblings, 0 replies; 5+ messages in thread
From: Roger Heflin @ 2012-12-24 15:03 UTC (permalink / raw)
  To: Tudor Holton; +Cc: linux-raid

On Mon, Dec 24, 2012 at 1:24 AM, Tudor Holton <tudor@smartguide.com.au> wrote:
> On 20/12/12 11:03, Roger Heflin wrote:
>>
>> On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au>
>> wrote:
>>>
>>> Hallo,
>>>
>>> I'm having some trouble with an array I have that has become degraded.
>>>
>>> I have an array with this array state:
>>>
>>> md101 : active raid1 sdf1[0] sdb1[2](S)
>>>        1953511936 blocks [2/1] [U_]
>>>
>>>
>>> mdadm --detail says:
>>>
>>> /dev/md101:
>>>          Version : 0.90
>>>    Creation Time : Thu Jan 13 14:34:27 2011
>>>       Raid Level : raid1
>>>       Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>    Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>>     Raid Devices : 2
>>>    Total Devices : 2
>>> Preferred Minor : 101
>>>      Persistence : Superblock is persistent
>>>
>>>      Update Time : Fri Nov 23 03:23:04 2012
>>>            State : clean, degraded
>>>   Active Devices : 1
>>> Working Devices : 2
>>>   Failed Devices : 0
>>>    Spare Devices : 1
>>>
>>>             UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host
>>> barney)
>>>           Events : 0.2127
>>>
>>>      Number   Major   Minor   RaidDevice State
>>>         0       8       81        0      active sync /dev/sdf1
>>>         1       0        0        1      removed
>>>
>>>         2       8       17        -      spare   /dev/sdb1
>>>
>>>
>>> If I attempt to force the spare to become active it begins to recover:
>>> $ sudo mdadm -S /dev/md101
>>> mdadm: stopped /dev/md101
>>> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1
>>> /dev/sdb1
>>> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
>>> $ cat /proc/mdstat
>>> md101 : active raid1 sdf1[0] sdb1[2]
>>>        1953511936 blocks [2/1] [U_]
>>>        [>....................]  recovery =  0.0% (541440/1953511936)
>>> finish=420.8min speed=77348K/sec
>>>
>>> This runs for the allotted time but returns to the state of spare.
>>>
>>> Neither disk partition report errors:
>>> $ cat /sys/block/md101/md/dev-sdf1/errors
>>> 0
>>> $ cat /sys/block/md101/md/dev-sdb1/errors
>>> 0
>>>
>>> Are there mdadm logs to find out why this is not recovering properly?
>>> How
>>> otherwise do I debug this?
>>>
>>> Cheers,
>>> Tudor.
>>
>> Did you look in the various /var/log/messages (current and previous
>> ones) to see what it indicated happened the about the time it
>> completed?
>>
>> There is almost certainly something in there indicating what went wrong.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> Thanks.  I watched the logs messages during the recovery.  During the last
> 0.1% (at 99.9%) messages like this appeared:
> Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf] Unhandled
> sense code
> Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense Key
> : Medium Error [current] [descriptor]
> Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data with
> sense descriptors (in hex):
> Dec 24 18:20:32 barney kernel: [2796835.703327]         72 03 11 04 00 00 00
> 0c 00 0a 80 00 00 00 00 00
> Dec 24 18:20:32 barney kernel: [2796835.703335]         e8 e0 5f 86
> Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add.
> Sense: Unrecovered read error - auto reallocate failed
> Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB:
> Read(10): 28 00 e8 e0 5f 7f 00 00 08 00
> Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error, dev
> sdf, sector 3907018630
> Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete
> Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf:
> unrecoverable I/O read error for block 3907018496
>
> Unfortunately, sdf is the active disk in this case.  So I guess my only
> option left is to create a new array and copy as much over as it will let
> me?

If you are lucky that may be an unused area of the fs and you may not lose
any data at all.    Worse case is you will probably lose a couple of
files, the way
to tell which you lost is when you read the files with the bad block
you will get an
io error.

I don't know if there is a better way, but the process you mentioned is probably
reasonable (build a new one and copy all of the data over), then add
the bad disk
back to the new mirror and it will build back in, since it is writing
it will write to
that block and either correct the error or the disk will relocate the
sector if the sector
is too bad.

One you have the array rebuild, make sure on the new array to do check
either 1x
per month or 1x per week (man md - see section on scrubbing), if you
were not doing
this before then a "sector" can slowly go bad and is never read so the
disk (if the
disk is not doing test reads) won't know the sector is going bad until
something
like a rebuild happens and that is too late, if the disk is forced to
read it or is scanning
itself then the disk will find and relocate (or rewrite) the sector
before it is completely bad.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-12-24 15:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-03  0:04 Spare disk not becoming active Tudor Holton
2012-12-19 23:19 ` Tudor Holton
2012-12-20  0:03 ` Roger Heflin
2012-12-24  7:24   ` Tudor Holton
2012-12-24 15:03     ` Roger Heflin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).