* Spare disk not becoming active
@ 2012-12-03 0:04 Tudor Holton
2012-12-19 23:19 ` Tudor Holton
2012-12-20 0:03 ` Roger Heflin
0 siblings, 2 replies; 5+ messages in thread
From: Tudor Holton @ 2012-12-03 0:04 UTC (permalink / raw)
To: linux-raid
Hallo,
I'm having some trouble with an array I have that has become degraded.
I have an array with this array state:
md101 : active raid1 sdf1[0] sdb1[2](S)
1953511936 blocks [2/1] [U_]
mdadm --detail says:
/dev/md101:
Version : 0.90
Creation Time : Thu Jan 13 14:34:27 2011
Raid Level : raid1
Array Size : 1953511936 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 101
Persistence : Superblock is persistent
Update Time : Fri Nov 23 03:23:04 2012
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host
barney)
Events : 0.2127
Number Major Minor RaidDevice State
0 8 81 0 active sync /dev/sdf1
1 0 0 1 removed
2 8 17 - spare /dev/sdb1
If I attempt to force the spare to become active it begins to recover:
$ sudo mdadm -S /dev/md101
mdadm: stopped /dev/md101
$ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1
mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare.
$ cat /proc/mdstat
md101 : active raid1 sdf1[0] sdb1[2]
1953511936 blocks [2/1] [U_]
[>....................] recovery = 0.0% (541440/1953511936)
finish=420.8min speed=77348K/sec
This runs for the allotted time but returns to the state of spare.
Neither disk partition report errors:
$ cat /sys/block/md101/md/dev-sdf1/errors
0
$ cat /sys/block/md101/md/dev-sdb1/errors
0
Are there mdadm logs to find out why this is not recovering properly?
How otherwise do I debug this?
Cheers,
Tudor.
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: Spare disk not becoming active 2012-12-03 0:04 Spare disk not becoming active Tudor Holton @ 2012-12-19 23:19 ` Tudor Holton 2012-12-20 0:03 ` Roger Heflin 1 sibling, 0 replies; 5+ messages in thread From: Tudor Holton @ 2012-12-19 23:19 UTC (permalink / raw) To: linux-raid I don't mean to be rude, but it's been two weeks and my system is still in this state. Bump, anyone? A thorough search of the web (before I originally posted this to the list) revealed nothing. No explanation as to why this occurs seemed apparent, only that it's happened a number of times. Most reports indicate that a complete stop of the array and reassemble fixes it, but I tried that and it still returned to spare. Some reports indicated my position but no response that seems complete. Eventually the discussions runs to wiping the disks and starting again. That seems a bit drastic and I'm concerned that *one* of the disks is faulty but not being reported as such, and I don't want to pick the wrong one to wipe off the superblock. mdadm reports no errors, but SMART indicates there may be a problem with the *active* disk, which is even more worrying because without making the spare active I can't remove it to test it properly. Any ideas? Cheers, Tudor. On 03/12/12 11:04, Tudor Holton wrote: > Hallo, > > I'm having some trouble with an array I have that has become degraded. > > I have an array with this array state: > > md101 : active raid1 sdf1[0] sdb1[2](S) > 1953511936 blocks [2/1] [U_] > > > mdadm --detail says: > > /dev/md101: > Version : 0.90 > Creation Time : Thu Jan 13 14:34:27 2011 > Raid Level : raid1 > Array Size : 1953511936 (1863.01 GiB 2000.40 GB) > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 101 > Persistence : Superblock is persistent > > Update Time : Fri Nov 23 03:23:04 2012 > State : clean, degraded > Active Devices : 1 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 1 > > UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host > barney) > Events : 0.2127 > > Number Major Minor RaidDevice State > 0 8 81 0 active sync /dev/sdf1 > 1 0 0 1 removed > > 2 8 17 - spare /dev/sdb1 > > > If I attempt to force the spare to become active it begins to recover: > $ sudo mdadm -S /dev/md101 > mdadm: stopped /dev/md101 > $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 > /dev/sdb1 > mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare. > $ cat /proc/mdstat > md101 : active raid1 sdf1[0] sdb1[2] > 1953511936 blocks [2/1] [U_] > [>....................] recovery = 0.0% (541440/1953511936) > finish=420.8min speed=77348K/sec > > This runs for the allotted time but returns to the state of spare. > > Neither disk partition report errors: > $ cat /sys/block/md101/md/dev-sdf1/errors > 0 > $ cat /sys/block/md101/md/dev-sdb1/errors > 0 > > Are there mdadm logs to find out why this is not recovering properly? > How otherwise do I debug this? > > Cheers, > Tudor. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active 2012-12-03 0:04 Spare disk not becoming active Tudor Holton 2012-12-19 23:19 ` Tudor Holton @ 2012-12-20 0:03 ` Roger Heflin 2012-12-24 7:24 ` Tudor Holton 1 sibling, 1 reply; 5+ messages in thread From: Roger Heflin @ 2012-12-20 0:03 UTC (permalink / raw) To: Tudor Holton; +Cc: linux-raid On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote: > Hallo, > > I'm having some trouble with an array I have that has become degraded. > > I have an array with this array state: > > md101 : active raid1 sdf1[0] sdb1[2](S) > 1953511936 blocks [2/1] [U_] > > > mdadm --detail says: > > /dev/md101: > Version : 0.90 > Creation Time : Thu Jan 13 14:34:27 2011 > Raid Level : raid1 > Array Size : 1953511936 (1863.01 GiB 2000.40 GB) > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 2 > Total Devices : 2 > Preferred Minor : 101 > Persistence : Superblock is persistent > > Update Time : Fri Nov 23 03:23:04 2012 > State : clean, degraded > Active Devices : 1 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 1 > > UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney) > Events : 0.2127 > > Number Major Minor RaidDevice State > 0 8 81 0 active sync /dev/sdf1 > 1 0 0 1 removed > > 2 8 17 - spare /dev/sdb1 > > > If I attempt to force the spare to become active it begins to recover: > $ sudo mdadm -S /dev/md101 > mdadm: stopped /dev/md101 > $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1 > mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare. > $ cat /proc/mdstat > md101 : active raid1 sdf1[0] sdb1[2] > 1953511936 blocks [2/1] [U_] > [>....................] recovery = 0.0% (541440/1953511936) > finish=420.8min speed=77348K/sec > > This runs for the allotted time but returns to the state of spare. > > Neither disk partition report errors: > $ cat /sys/block/md101/md/dev-sdf1/errors > 0 > $ cat /sys/block/md101/md/dev-sdb1/errors > 0 > > Are there mdadm logs to find out why this is not recovering properly? How > otherwise do I debug this? > > Cheers, > Tudor. Did you look in the various /var/log/messages (current and previous ones) to see what it indicated happened the about the time it completed? There is almost certainly something in there indicating what went wrong. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active 2012-12-20 0:03 ` Roger Heflin @ 2012-12-24 7:24 ` Tudor Holton 2012-12-24 15:03 ` Roger Heflin 0 siblings, 1 reply; 5+ messages in thread From: Tudor Holton @ 2012-12-24 7:24 UTC (permalink / raw) To: Roger Heflin; +Cc: linux-raid On 20/12/12 11:03, Roger Heflin wrote: > On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> wrote: >> Hallo, >> >> I'm having some trouble with an array I have that has become degraded. >> >> I have an array with this array state: >> >> md101 : active raid1 sdf1[0] sdb1[2](S) >> 1953511936 blocks [2/1] [U_] >> >> >> mdadm --detail says: >> >> /dev/md101: >> Version : 0.90 >> Creation Time : Thu Jan 13 14:34:27 2011 >> Raid Level : raid1 >> Array Size : 1953511936 (1863.01 GiB 2000.40 GB) >> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) >> Raid Devices : 2 >> Total Devices : 2 >> Preferred Minor : 101 >> Persistence : Superblock is persistent >> >> Update Time : Fri Nov 23 03:23:04 2012 >> State : clean, degraded >> Active Devices : 1 >> Working Devices : 2 >> Failed Devices : 0 >> Spare Devices : 1 >> >> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host barney) >> Events : 0.2127 >> >> Number Major Minor RaidDevice State >> 0 8 81 0 active sync /dev/sdf1 >> 1 0 0 1 removed >> >> 2 8 17 - spare /dev/sdb1 >> >> >> If I attempt to force the spare to become active it begins to recover: >> $ sudo mdadm -S /dev/md101 >> mdadm: stopped /dev/md101 >> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 /dev/sdb1 >> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare. >> $ cat /proc/mdstat >> md101 : active raid1 sdf1[0] sdb1[2] >> 1953511936 blocks [2/1] [U_] >> [>....................] recovery = 0.0% (541440/1953511936) >> finish=420.8min speed=77348K/sec >> >> This runs for the allotted time but returns to the state of spare. >> >> Neither disk partition report errors: >> $ cat /sys/block/md101/md/dev-sdf1/errors >> 0 >> $ cat /sys/block/md101/md/dev-sdb1/errors >> 0 >> >> Are there mdadm logs to find out why this is not recovering properly? How >> otherwise do I debug this? >> >> Cheers, >> Tudor. > Did you look in the various /var/log/messages (current and previous > ones) to see what it indicated happened the about the time it > completed? > > There is almost certainly something in there indicating what went wrong. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks. I watched the logs messages during the recovery. During the last 0.1% (at 99.9%) messages like this appeared: Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf] Unhandled sense code Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense Key : Medium Error [current] [descriptor] Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data with sense descriptors (in hex): Dec 24 18:20:32 barney kernel: [2796835.703327] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Dec 24 18:20:32 barney kernel: [2796835.703335] e8 e0 5f 86 Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add. Sense: Unrecovered read error - auto reallocate failed Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB: Read(10): 28 00 e8 e0 5f 7f 00 00 08 00 Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error, dev sdf, sector 3907018630 Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf: unrecoverable I/O read error for block 3907018496 Unfortunately, sdf is the active disk in this case. So I guess my only option left is to create a new array and copy as much over as it will let me? Cheers, Tudor. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Spare disk not becoming active 2012-12-24 7:24 ` Tudor Holton @ 2012-12-24 15:03 ` Roger Heflin 0 siblings, 0 replies; 5+ messages in thread From: Roger Heflin @ 2012-12-24 15:03 UTC (permalink / raw) To: Tudor Holton; +Cc: linux-raid On Mon, Dec 24, 2012 at 1:24 AM, Tudor Holton <tudor@smartguide.com.au> wrote: > On 20/12/12 11:03, Roger Heflin wrote: >> >> On Sun, Dec 2, 2012 at 6:04 PM, Tudor Holton <tudor@smartguide.com.au> >> wrote: >>> >>> Hallo, >>> >>> I'm having some trouble with an array I have that has become degraded. >>> >>> I have an array with this array state: >>> >>> md101 : active raid1 sdf1[0] sdb1[2](S) >>> 1953511936 blocks [2/1] [U_] >>> >>> >>> mdadm --detail says: >>> >>> /dev/md101: >>> Version : 0.90 >>> Creation Time : Thu Jan 13 14:34:27 2011 >>> Raid Level : raid1 >>> Array Size : 1953511936 (1863.01 GiB 2000.40 GB) >>> Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) >>> Raid Devices : 2 >>> Total Devices : 2 >>> Preferred Minor : 101 >>> Persistence : Superblock is persistent >>> >>> Update Time : Fri Nov 23 03:23:04 2012 >>> State : clean, degraded >>> Active Devices : 1 >>> Working Devices : 2 >>> Failed Devices : 0 >>> Spare Devices : 1 >>> >>> UUID : 43e92a79:90295495:0a76e71e:56c99031 (local to host >>> barney) >>> Events : 0.2127 >>> >>> Number Major Minor RaidDevice State >>> 0 8 81 0 active sync /dev/sdf1 >>> 1 0 0 1 removed >>> >>> 2 8 17 - spare /dev/sdb1 >>> >>> >>> If I attempt to force the spare to become active it begins to recover: >>> $ sudo mdadm -S /dev/md101 >>> mdadm: stopped /dev/md101 >>> $ sudo mdadm --assemble --force --no-degraded /dev/md101 /dev/sdf1 >>> /dev/sdb1 >>> mdadm: /dev/md101 has been started with 1 drive (out of 2) and 1 spare. >>> $ cat /proc/mdstat >>> md101 : active raid1 sdf1[0] sdb1[2] >>> 1953511936 blocks [2/1] [U_] >>> [>....................] recovery = 0.0% (541440/1953511936) >>> finish=420.8min speed=77348K/sec >>> >>> This runs for the allotted time but returns to the state of spare. >>> >>> Neither disk partition report errors: >>> $ cat /sys/block/md101/md/dev-sdf1/errors >>> 0 >>> $ cat /sys/block/md101/md/dev-sdb1/errors >>> 0 >>> >>> Are there mdadm logs to find out why this is not recovering properly? >>> How >>> otherwise do I debug this? >>> >>> Cheers, >>> Tudor. >> >> Did you look in the various /var/log/messages (current and previous >> ones) to see what it indicated happened the about the time it >> completed? >> >> There is almost certainly something in there indicating what went wrong. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Thanks. I watched the logs messages during the recovery. During the last > 0.1% (at 99.9%) messages like this appeared: > Dec 24 18:20:32 barney kernel: [2796835.703313] sd 2:0:0:0: [sdf] Unhandled > sense code > Dec 24 18:20:32 barney kernel: [2796835.703316] sd 2:0:0:0: [sdf] Result: > hostbyte=DID_OK driverbyte=DRIVER_SENSE > Dec 24 18:20:32 barney kernel: [2796835.703320] sd 2:0:0:0: [sdf] Sense Key > : Medium Error [current] [descriptor] > Dec 24 18:20:32 barney kernel: [2796835.703325] Descriptor sense data with > sense descriptors (in hex): > Dec 24 18:20:32 barney kernel: [2796835.703327] 72 03 11 04 00 00 00 > 0c 00 0a 80 00 00 00 00 00 > Dec 24 18:20:32 barney kernel: [2796835.703335] e8 e0 5f 86 > Dec 24 18:20:32 barney kernel: [2796835.703339] sd 2:0:0:0: [sdf] Add. > Sense: Unrecovered read error - auto reallocate failed > Dec 24 18:20:32 barney kernel: [2796835.703345] sd 2:0:0:0: [sdf] CDB: > Read(10): 28 00 e8 e0 5f 7f 00 00 08 00 > Dec 24 18:20:32 barney kernel: [2796835.703353] end_request: I/O error, dev > sdf, sector 3907018630 > Dec 24 18:20:32 barney kernel: [2796835.703366] ata3: EH complete > Dec 24 18:20:32 barney kernel: [2796835.703383] md/raid1:md101: sdf: > unrecoverable I/O read error for block 3907018496 > > Unfortunately, sdf is the active disk in this case. So I guess my only > option left is to create a new array and copy as much over as it will let > me? If you are lucky that may be an unused area of the fs and you may not lose any data at all. Worse case is you will probably lose a couple of files, the way to tell which you lost is when you read the files with the bad block you will get an io error. I don't know if there is a better way, but the process you mentioned is probably reasonable (build a new one and copy all of the data over), then add the bad disk back to the new mirror and it will build back in, since it is writing it will write to that block and either correct the error or the disk will relocate the sector if the sector is too bad. One you have the array rebuild, make sure on the new array to do check either 1x per month or 1x per week (man md - see section on scrubbing), if you were not doing this before then a "sector" can slowly go bad and is never read so the disk (if the disk is not doing test reads) won't know the sector is going bad until something like a rebuild happens and that is too late, if the disk is forced to read it or is scanning itself then the disk will find and relocate (or rewrite) the sector before it is completely bad. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-12-24 15:03 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-03 0:04 Spare disk not becoming active Tudor Holton 2012-12-19 23:19 ` Tudor Holton 2012-12-20 0:03 ` Roger Heflin 2012-12-24 7:24 ` Tudor Holton 2012-12-24 15:03 ` Roger Heflin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).