* Spare disk not becoming active
@ 2012-12-03 0:04 Tudor Holton
2012-12-19 23:19 ` Tudor Holton
2012-12-20 0:03 ` Roger Heflin
0 siblings, 2 replies; 5+ messages in thread
From: Tudor Holton @ 2012-12-03 0:04 UTC (permalink / raw)
To: linux-raid
Hallo,
I'm having some trouble with an array I have that has become degraded.
I have an array with this array state:
md101 : active raid1 sdf1[0] sdb1[2](S)
1953511936 blocks [2/1] [U_]
mdadm --detail says:
/dev/md101:
Version : 0.90
Creation Time : Thu Jan 13 14:34:27 2011
Raid Level : raid1
Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 101
Persistence : Superblock is persistent
Update Time : Fri Nov 23 03:23:04 2012
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host
barney)
Events : 0.2127
Number Major Minor RaidDevice State
0 8 81 0 active sync /dev/sdf1
1 0 0 1 removed
2 8 17 - spare /dev/sdb1
If I attempt to force the spare to become active it begins to recover:
$ sudo mdadm -S /dev/md101
mdadm: stopped /dev/md101
$ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
$ cat /proc/mdstat
md101 : active raid1 sdf1[0] sdb1[2]
1953511936 blocks [2/1] [U_]
[>....................] recovery = 0.0% (541440/1953511936)
finish=420.8min speed=77348K/sec
This runs for the allotted time but returns to the state of spare.
Neither disk partition report errors:
$ cat /sys/block/md101/md/dev-sdf1/errors
0
$ cat /sys/block/md101/md/dev-sdb1/errors
0
Are there mdadm logs to find out why this is not recovering properly?
How otherwise do I debug this?
Cheers,
Tudor.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active
2012-12-03 0:04 Spare disk not becoming active Tudor Holton
@ 2012-12-19 23:19 ` Tudor Holton
2012-12-20 0:03 ` Roger Heflin
1 sibling, 0 replies; 5+ messages in thread
From: Tudor Holton @ 2012-12-19 23:19 UTC (permalink / raw)
To: linux-raid
I don't mean to be rude, but it's been two weeks and my system is still
in this state. Bump, anyone?
A thorough search of the web (before I originally posted this to the
list) revealed nothing. No explanation as to why this occurs seemed
apparent, only that it's happened a number of times. Most reports
indicate that a complete stop of the array and reassemble fixes it, but
I tried that and it still returned to spare. Some reports indicated my
position but no response that seems complete.
Eventually the discussions runs to wiping the disks and starting again.
That seems a bit drastic and I'm concerned that *one* of the disks is
faulty but not being reported as such, and I don't want to pick the
wrong one to wipe off the superblock. mdadm reports no errors, but
SMART indicates there may be a problem with the *active* disk, which is
even more worrying because without making the spare active I can't
remove it to test it properly.
Any ideas?
Cheers,
Tudor.
On 03/12/12 11:04, Tudor Holton wrote:
> Hallo,
>
> I'm having some trouble with an array I have that has become degraded.
>
> I have an array with this array state:
>
> md101 : active raid1 sdf1[0] sdb1[2](S)
> 1953511936 blocks [2/1] [U_]
>
>
> mdadm --detail says:
>
> /dev/md101:
> Version : 0.90
> Creation Time : Thu Jan 13 14:34:27 2011
> Raid Level : raid1
> Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 101
> Persistence : Superblock is persistent
>
> Update Time : Fri Nov 23 03:23:04 2012
> State : clean, degraded
> Active Devices : 1
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 1
>
> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host
> barney)
> Events : 0.2127
>
> Number Major Minor RaidDevice State
> 0 8 81 0 active sync /dev/sdf1
> 1 0 0 1 removed
>
> 2 8 17 - spare /dev/sdb1
>
>
> If I attempt to force the spare to become active it begins to recover:
> $ sudo mdadm -S /dev/md101
> mdadm: stopped /dev/md101
> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1
> /dev/sdb1
> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
> $ cat /proc/mdstat
> md101 : active raid1 sdf1[0] sdb1[2]
> 1953511936 blocks [2/1] [U_]
> [>....................] recovery = 0.0% (541440/1953511936)
> finish=420.8min speed=77348K/sec
>
> This runs for the allotted time but returns to the state of spare.
>
> Neither disk partition report errors:
> $ cat /sys/block/md101/md/dev-sdf1/errors
> 0
> $ cat /sys/block/md101/md/dev-sdb1/errors
> 0
>
> Are there mdadm logs to find out why this is not recovering properly?
> How otherwise do I debug this?
>
> Cheers,
> Tudor.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active
2012-12-03 0:04 Spare disk not becoming active Tudor Holton
2012-12-19 23:19 ` Tudor Holton
@ 2012-12-20 0:03 ` Roger Heflin
2012-12-24 7:24 ` Tudor Holton
1 sibling, 1 reply; 5+ messages in thread
From: Roger Heflin @ 2012-12-20 0:03 UTC (permalink / raw)
To: Tudor Holton; +Cc: linux-raid
On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote:
> Hallo,
>
> I'm having some trouble with an array I have that has become degraded.
>
> I have an array with this array state:
>
> md101 : active raid1 sdf1[0] sdb1[2](S)
> 1953511936 blocks [2/1] [U_]
>
>
> mdadm --detail says:
>
> /dev/md101:
> Version : 0.90
> Creation Time : Thu Jan 13 14:34:27 2011
> Raid Level : raid1
> Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
> Raid Devices : 2
> Total Devices : 2
> Preferred Minor : 101
> Persistence : Superblock is persistent
>
> Update Time : Fri Nov 23 03:23:04 2012
> State : clean, degraded
> Active Devices : 1
> Working Devices : 2
> Failed Devices : 0
> Spare Devices : 1
>
> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney)
> Events : 0.2127
>
> Number Major Minor RaidDevice State
> 0 8 81 0 active sync /dev/sdf1
> 1 0 0 1 removed
>
> 2 8 17 - spare /dev/sdb1
>
>
> If I attempt to force the spare to become active it begins to recover:
> $ sudo mdadm -S /dev/md101
> mdadm: stopped /dev/md101
> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
> $ cat /proc/mdstat
> md101 : active raid1 sdf1[0] sdb1[2]
> 1953511936 blocks [2/1] [U_]
> [>....................] recovery = 0.0% (541440/1953511936)
> finish=420.8min speed=77348K/sec
>
> This runs for the allotted time but returns to the state of spare.
>
> Neither disk partition report errors:
> $ cat /sys/block/md101/md/dev-sdf1/errors
> 0
> $ cat /sys/block/md101/md/dev-sdb1/errors
> 0
>
> Are there mdadm logs to find out why this is not recovering properly? How
> otherwise do I debug this?
>
> Cheers,
> Tudor.
Did you look in the various /var/log/messages (current and previous
ones) to see what it indicated happened the about the time it
completed?
There is almost certainly something in there indicating what went wrong.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active
2012-12-20 0:03 ` Roger Heflin
@ 2012-12-24 7:24 ` Tudor Holton
2012-12-24 15:03 ` Roger Heflin
0 siblings, 1 reply; 5+ messages in thread
From: Tudor Holton @ 2012-12-24 7:24 UTC (permalink / raw)
To: Roger Heflin; +Cc: linux-raid
On 20/12/12 11:03, Roger Heflin wrote:
> On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote:
>> Hallo,
>>
>> I'm having some trouble with an array I have that has become degraded.
>>
>> I have an array with this array state:
>>
>> md101 : active raid1 sdf1[0] sdb1[2](S)
>> 1953511936 blocks [2/1] [U_]
>>
>>
>> mdadm --detail says:
>>
>> /dev/md101:
>> Version : 0.90
>> Creation Time : Thu Jan 13 14:34:27 2011
>> Raid Level : raid1
>> Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>> Raid Devices : 2
>> Total Devices : 2
>> Preferred Minor : 101
>> Persistence : Superblock is persistent
>>
>> Update Time : Fri Nov 23 03:23:04 2012
>> State : clean, degraded
>> Active Devices : 1
>> Working Devices : 2
>> Failed Devices : 0
>> Spare Devices : 1
>>
>> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney)
>> Events : 0.2127
>>
>> Number Major Minor RaidDevice State
>> 0 8 81 0 active sync /dev/sdf1
>> 1 0 0 1 removed
>>
>> 2 8 17 - spare /dev/sdb1
>>
>>
>> If I attempt to force the spare to become active it begins to recover:
>> $ sudo mdadm -S /dev/md101
>> mdadm: stopped /dev/md101
>> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
>> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
>> $ cat /proc/mdstat
>> md101 : active raid1 sdf1[0] sdb1[2]
>> 1953511936 blocks [2/1] [U_]
>> [>....................] recovery = 0.0% (541440/1953511936)
>> finish=420.8min speed=77348K/sec
>>
>> This runs for the allotted time but returns to the state of spare.
>>
>> Neither disk partition report errors:
>> $ cat /sys/block/md101/md/dev-sdf1/errors
>> 0
>> $ cat /sys/block/md101/md/dev-sdb1/errors
>> 0
>>
>> Are there mdadm logs to find out why this is not recovering properly? How
>> otherwise do I debug this?
>>
>> Cheers,
>> Tudor.
> Did you look in the various /var/log/messages (current and previous
> ones) to see what it indicated happened the about the time it
> completed?
>
> There is almost certainly something in there indicating what went wrong.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks. I watched the logs messages during the recovery. During the
last 0.1% (at 99.9%) messages like this appeared:
Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf]
Unhandled sense code
Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf]
Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense
Key : Medium Error [current] [descriptor]
Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data
with sense descriptors (in hex):
Dec 24 18:20:32 barney kernel: [2796835.703327] 72 03 11 04 00
00 00 0c 00 0a 80 00 00 00 00 00
Dec 24 18:20:32 barney kernel: [2796835.703335] e8 e0 5f 86
Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add.
Sense: Unrecovered read error - auto reallocate failed
Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB:
Read(10): 28 00 e8 e0 5f 7f 00 00 08 00
Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error,
dev sdf, sector 3907018630
Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete
Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf:
unrecoverable I/O read error for block 3907018496
Unfortunately, sdf is the active disk in this case. So I guess my only
option left is to create a new array and copy as much over as it will
let me?
Cheers,
Tudor.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active
2012-12-24 7:24 ` Tudor Holton
@ 2012-12-24 15:03 ` Roger Heflin
0 siblings, 0 replies; 5+ messages in thread
From: Roger Heflin @ 2012-12-24 15:03 UTC (permalink / raw)
To: Tudor Holton; +Cc: linux-raid
On Mon, Dec 24, 2012 at 1:24 AM, Tudor Holton <tudor@smartguide.com.au> wrote:
> On 20/12/12 11:03, Roger Heflin wrote:
>>
>> On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au>
>> wrote:
>>>
>>> Hallo,
>>>
>>> I'm having some trouble with an array I have that has become degraded.
>>>
>>> I have an array with this array state:
>>>
>>> md101 : active raid1 sdf1[0] sdb1[2](S)
>>> 1953511936 blocks [2/1] [U_]
>>>
>>>
>>> mdadm --detail says:
>>>
>>> /dev/md101:
>>> Version : 0.90
>>> Creation Time : Thu Jan 13 14:34:27 2011
>>> Raid Level : raid1
>>> Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
>>> Raid Devices : 2
>>> Total Devices : 2
>>> Preferred Minor : 101
>>> Persistence : Superblock is persistent
>>>
>>> Update Time : Fri Nov 23 03:23:04 2012
>>> State : clean, degraded
>>> Active Devices : 1
>>> Working Devices : 2
>>> Failed Devices : 0
>>> Spare Devices : 1
>>>
>>> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host
>>> barney)
>>> Events : 0.2127
>>>
>>> Number Major Minor RaidDevice State
>>> 0 8 81 0 active sync /dev/sdf1
>>> 1 0 0 1 removed
>>>
>>> 2 8 17 - spare /dev/sdb1
>>>
>>>
>>> If I attempt to force the spare to become active it begins to recover:
>>> $ sudo mdadm -S /dev/md101
>>> mdadm: stopped /dev/md101
>>> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1
>>> /dev/sdb1
>>> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
>>> $ cat /proc/mdstat
>>> md101 : active raid1 sdf1[0] sdb1[2]
>>> 1953511936 blocks [2/1] [U_]
>>> [>....................] recovery = 0.0% (541440/1953511936)
>>> finish=420.8min speed=77348K/sec
>>>
>>> This runs for the allotted time but returns to the state of spare.
>>>
>>> Neither disk partition report errors:
>>> $ cat /sys/block/md101/md/dev-sdf1/errors
>>> 0
>>> $ cat /sys/block/md101/md/dev-sdb1/errors
>>> 0
>>>
>>> Are there mdadm logs to find out why this is not recovering properly?
>>> How
>>> otherwise do I debug this?
>>>
>>> Cheers,
>>> Tudor.
>>
>> Did you look in the various /var/log/messages (current and previous
>> ones) to see what it indicated happened the about the time it
>> completed?
>>
>> There is almost certainly something in there indicating what went wrong.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> Thanks. I watched the logs messages during the recovery. During the last
> 0.1% (at 99.9%) messages like this appeared:
> Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf] Unhandled
> sense code
> Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf] Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE
> Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense Key
> : Medium Error [current] [descriptor]
> Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data with
> sense descriptors (in hex):
> Dec 24 18:20:32 barney kernel: [2796835.703327] 72 03 11 04 00 00 00
> 0c 00 0a 80 00 00 00 00 00
> Dec 24 18:20:32 barney kernel: [2796835.703335] e8 e0 5f 86
> Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add.
> Sense: Unrecovered read error - auto reallocate failed
> Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB:
> Read(10): 28 00 e8 e0 5f 7f 00 00 08 00
> Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error, dev
> sdf, sector 3907018630
> Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete
> Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf:
> unrecoverable I/O read error for block 3907018496
>
> Unfortunately, sdf is the active disk in this case. So I guess my only
> option left is to create a new array and copy as much over as it will let
> me?
If you are lucky that may be an unused area of the fs and you may not lose
any data at all. Worse case is you will probably lose a couple of
files, the way
to tell which you lost is when you read the files with the bad block
you will get an
io error.
I don't know if there is a better way, but the process you mentioned is probably
reasonable (build a new one and copy all of the data over), then add
the bad disk
back to the new mirror and it will build back in, since it is writing
it will write to
that block and either correct the error or the disk will relocate the
sector if the sector
is too bad.
One you have the array rebuild, make sure on the new array to do check
either 1x
per month or 1x per week (man md - see section on scrubbing), if you
were not doing
this before then a "sector" can slowly go bad and is never read so the
disk (if the
disk is not doing test reads) won't know the sector is going bad until
something
like a rebuild happens and that is too late, if the disk is forced to
read it or is scanning
itself then the disk will find and relocate (or rewrite) the sector
before it is completely bad.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-12-24 15:03 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-03 0:04 Spare disk not becoming active Tudor Holton
2012-12-19 23:19 ` Tudor Holton
2012-12-20 0:03 ` Roger Heflin
2012-12-24 7:24 ` Tudor Holton
2012-12-24 15:03 ` Roger Heflin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).