* Help with degraded array
@ 2013-12-13 22:11 Alex
2013-12-14 10:38 ` David C. Rankin
0 siblings, 1 reply; 8+ messages in thread
From: Alex @ 2013-12-13 22:11 UTC (permalink / raw)
To: Linux RAID
Hi,
I have an RAID1 array that has entered a degraded state, and the disk
appears to have changed from sda to sdc during this process. I'm not
sure how this affects the array, and could really use some help to
keep me from screwing it up.
The system is an fc15 box with just one array, md0.
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1]
148478904 blocks super 1.2 [2/1] [_U]
bitmap: 1/2 pages [4KB], 65536KB chunk
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Mar 21 12:31:23 2012
Raid Level : raid1
Array Size : 148478904 (141.60 GiB 152.04 GB)
Used Dev Size : 148478904 (141.60 GiB 152.04 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Dec 13 17:09:19 2013
State : active, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Name : hqmailrelay:0
UUID : 99acf2a0:afa1266c:b870423d:f06e4009
Events : 11504
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 18 1 active sync /dev/sdb2
# fdisk -l /dev/sda /dev/sdb /dev/sdc
Disk /dev/sdb: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e7781
Device Boot Start End Blocks Id System
/dev/sdb1 2048 6143 2048 83 Linux
/dev/sdb2 6144 296966143 148480000 fd Linux raid autodetect
/dev/sdb3 296966144 299014143 1024000 83 Linux
/dev/sdb4 299014144 312580095 6782976 82 Linux swap / Solaris
Disk /dev/sdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c0feb
Device Boot Start End Blocks Id System
/dev/sdc1 2048 6143 2048 83 Linux
/dev/sdc2 * 6144 2054143 1024000 83 Linux
/dev/sdc3 2054144 299014143 148480000 fd Linux raid autodetect
/dev/sdc4 299014144 312580095 6782976 82 Linux swap / Solaris
Here also is the relevant info from dmesg:
ata1: exception Emask 0x50 SAct 0x0 SErr 0x40d0800 action 0xe frozen
ata1: irq_stat 0x00400040, connection status changed
ata1: SError: { HostInt PHYRdyChg CommWake 10B8B DevExch }
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 300)
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 300)
ata1: limiting SATA link speed to 1.5 Gbps
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 310)
ata1.00: disabled
ata1: EH complete
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: [sda] killing request
sd 0:0:0:0: rejecting I/O to offline device
md: super_written gets error=-5, uptodate=0
md/raid1:md0: Disk failure on sda3, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
ata1.00: detaching (SCSI 0:0:0:0)
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] Stopping disk
sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Buffer I/O error on device sda2, logical block 65536
lost page write due to I/O error on sda2
JBD2: Error -5 detected when updating journal superblock for sda2-8.
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:1, o:0, dev:sda3
disk 1, wo:0, o:1, dev:sdb2
RAID1 conf printout:
--- wd:1 rd:2
disk 1, wo:0, o:1, dev:sdb2
md: unbind<sda3>
md: export_rdev(sda3)
ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
ata1: irq_stat 0x00000040, connection status changed
ata1: SError: { CommWake DevExch }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: ST3160316AS, JC4B, max UDMA/133
ata1.00: 312581808 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata1: EH complete
scsi 0:0:0:0: Direct-Access ATA ST3160316AS JC4B PQ: 0 ANSI: 5
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 GiB)
sd 0:0:0:0: [sdc] Write Protect is off
sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sdc: sdc1 sdc2 sdc3 sdc4
sd 0:0:0:0: [sdc] Attached SCSI disk
I'd really appreciate any guidance you could provide.
Thanks,
Alex
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-13 22:11 Help with degraded array Alex
@ 2013-12-14 10:38 ` David C. Rankin
2013-12-14 17:40 ` Alex
0 siblings, 1 reply; 8+ messages in thread
From: David C. Rankin @ 2013-12-14 10:38 UTC (permalink / raw)
To: mdraid
On 12/13/2013 04:11 PM, Alex wrote:
> I'd really appreciate any guidance you could provide.
>
> Thanks,
> Alex
It simply looks like the drive designation changed /dev/sda -> /dev/sdc and
mdadm kicked sda out of the array because it wasn't there. Are you not using
UUID designation to assemble the arrays? What does your /etc/mdadm.conf look like?
It looks like you could add /dev/sdc3 back into the array, let it rebuild if
necessary, then use:
mdadm --detail --scan >> /etc/mdadm.conf
To create a new UUID based mdadm.conf that would prevent this from happening again.
--
David C. Rankin, J.D.,P.E.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-14 10:38 ` David C. Rankin
@ 2013-12-14 17:40 ` Alex
2013-12-14 18:25 ` David C. Rankin
0 siblings, 1 reply; 8+ messages in thread
From: Alex @ 2013-12-14 17:40 UTC (permalink / raw)
To: David C. Rankin, Linux RAID
Hi,
> It simply looks like the drive designation changed /dev/sda -> /dev/sdc and
> mdadm kicked sda out of the array because it wasn't there. Are you not using
> UUID designation to assemble the arrays? What does your /etc/mdadm.conf look like?
>
> It looks like you could add /dev/sdc3 back into the array, let it rebuild if
> necessary, then use:
>
> mdadm --detail --scan >> /etc/mdadm.conf
>
> To create a new UUID based mdadm.conf that would prevent this from happening again.
# cat /etc/mdadm.conf
# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=99acf2a0:afa1266c:b870423d:f06e4009
So I would then use "mdadm --add /dev/md0 /dev/sdc3" to add it,
correct? It's not necessary to first fail the device?
Thanks again,
Alex
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-14 17:40 ` Alex
@ 2013-12-14 18:25 ` David C. Rankin
2013-12-15 0:34 ` Alex
2013-12-18 16:49 ` Alex
0 siblings, 2 replies; 8+ messages in thread
From: David C. Rankin @ 2013-12-14 18:25 UTC (permalink / raw)
To: mdraid
On 12/14/2013 11:40 AM, Alex wrote:
> # cat /etc/mdadm.conf
>
> # mdadm.conf written out by anaconda
> MAILADDR root
> AUTO +imsm +1.x -all
> ARRAY /dev/md0 level=raid1 num-devices=2
> UUID=99acf2a0:afa1266c:b870423d:f06e4009
>
> So I would then use "mdadm --add /dev/md0 /dev/sdc3" to add it,
> correct? It's not necessary to first fail the device?
From your earlier post:
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Mar 21 12:31:23 2012
Raid Level : raid1
Array Size : 148478904 (141.60 GiB 152.04 GB)
Used Dev Size : 148478904 (141.60 GiB 152.04 GB)
Raid Devices : 2
Total Devices : 1
^^^^^^^^^^^^^^^^^^^^^ Persistence : Superblock is persistent
<snip>
Number Major Minor RaidDevice State
0 0 0 0 removed
^^^^^^^^^^^^^^^^^^^^^^
1 8 18 1 active sync /dev/sdb2
The drive at /dev/sda3 has already been failed/removed. The reason was:
Buffer I/O error on device sda2, logical block 65536
lost page write due to I/O error on sda2
JBD2: Error -5 detected when updating journal superblock for sda2-8.
What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
and make sure the drive was reported as PASSED and not in imminent failure. Then
run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
DO NOT fsck /dev/sdc3.
Then I would reboot to see if the drive designation didn't revert to sda and
give mdadm a chance to reassemble the array automatically. If that did not work,
then I would check the drive designation (sda/sdc) and try the "--add" as you
specified, if that failed use "--add --force".
Good luck.
--
David C. Rankin, J.D.,P.E.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-14 18:25 ` David C. Rankin
@ 2013-12-15 0:34 ` Alex
2013-12-15 10:14 ` Ken Drummond
2013-12-16 23:21 ` David C. Rankin
2013-12-18 16:49 ` Alex
1 sibling, 2 replies; 8+ messages in thread
From: Alex @ 2013-12-15 0:34 UTC (permalink / raw)
To: David C. Rankin, Linux RAID
Hi,
> What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
> and make sure the drive was reported as PASSED and not in imminent failure. Then
> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
> DO NOT fsck /dev/sdc3.
I ran a short (which succeeded) then a long (which also succeeded)
test on /dev/sdc.
I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
it can't be checked.
This box is in a remote datacenter. Still think it's okay to boot remotely?
Thanks,
Alex
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-15 0:34 ` Alex
@ 2013-12-15 10:14 ` Ken Drummond
2013-12-16 23:21 ` David C. Rankin
1 sibling, 0 replies; 8+ messages in thread
From: Ken Drummond @ 2013-12-15 10:14 UTC (permalink / raw)
To: Linux RAID
On 15/12/2013 10:34 AM, Alex wrote:
> Hi,
>
>> What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
>> and make sure the drive was reported as PASSED and not in imminent failure. Then
>> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
>> DO NOT fsck /dev/sdc3.
> I ran a short (which succeeded) then a long (which also succeeded)
> test on /dev/sdc.
>
> I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
> it can't be checked.
>
> This box is in a remote datacenter. Still think it's okay to boot remotely?
>
> Thanks,
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
You could also try running a non destructive badblocks check on the
partition "badblocks -nsv /dev/sdc3" and see if it fails again, it could
be the cable rather than the disk that is at fault. I would expect that
a reboot would set the drive back to sda.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-15 0:34 ` Alex
2013-12-15 10:14 ` Ken Drummond
@ 2013-12-16 23:21 ` David C. Rankin
1 sibling, 0 replies; 8+ messages in thread
From: David C. Rankin @ 2013-12-16 23:21 UTC (permalink / raw)
To: mdraid
On 12/14/2013 06:34 PM, Alex wrote:
> I ran a short (which succeeded) then a long (which also succeeded)
> test on /dev/sdc.
>
> I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
> it can't be checked.
>
> This box is in a remote datacenter. Still think it's okay to boot remotely?
>
> Thanks,
> Alex
Alex,
I think the drive is fine. It looks like a momentary hiccup (bad
block/cylinder, etc..) If it smartctl tested OK, then there is no reason to
think it dying/dead.
You can always just stop the array and then attempt to start (--assemble) it
before you reboot. That is probably a conservative approach to take before
reboot. Of course things can always go haywire and fail, but I have boxes I
manage remotely and routinely reboot without issues. It obviously booted fine
with the reassignment of sda->sdc occurred.
Good luck.
--
David C. Rankin, J.D.,P.E.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Help with degraded array
2013-12-14 18:25 ` David C. Rankin
2013-12-15 0:34 ` Alex
@ 2013-12-18 16:49 ` Alex
1 sibling, 0 replies; 8+ messages in thread
From: Alex @ 2013-12-18 16:49 UTC (permalink / raw)
To: David C. Rankin, Linux RAID
Hi,
> The drive at /dev/sda3 has already been failed/removed. The reason was:
>
> Buffer I/O error on device sda2, logical block 65536
> lost page write due to I/O error on sda2
> JBD2: Error -5 detected when updating journal superblock for sda2-8.
>
> What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
> and make sure the drive was reported as PASSED and not in imminent failure. Then
> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
> DO NOT fsck /dev/sdc3.
Okay, ran a non-destructive badblocks after the smartctl test and it
passed. Rebooted and it switched back to sda from sdc. Ran "mdadm
--add /dev/md0 /dev/sda3" and it's now rebuilding.
I've done all this before, but somehow got a little nervous with it
being 3000 miles away and at a customer's location :-)
Thanks so much, everyone, for all your help.
Regards,
Alex
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-12-18 16:49 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-13 22:11 Help with degraded array Alex
2013-12-14 10:38 ` David C. Rankin
2013-12-14 17:40 ` Alex
2013-12-14 18:25 ` David C. Rankin
2013-12-15 0:34 ` Alex
2013-12-15 10:14 ` Ken Drummond
2013-12-16 23:21 ` David C. Rankin
2013-12-18 16:49 ` Alex
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).