Help with degraded array

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Help with degraded array
@ 2013-12-13 22:11 Alex
  2013-12-14 10:38 ` David C. Rankin
  0 siblings, 1 reply; 8+ messages in thread
From: Alex @ 2013-12-13 22:11 UTC (permalink / raw)
  To: Linux RAID

Hi,

I have an RAID1 array that has entered a degraded state, and the disk
appears to have changed from sda to sdc during this process. I'm not
sure how this affects the array, and could really use some help to
keep me from screwing it up.

The system is an fc15 box with just one array, md0.

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb2[1]
      148478904 blocks super 1.2 [2/1] [_U]
      bitmap: 1/2 pages [4KB], 65536KB chunk

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Mar 21 12:31:23 2012
     Raid Level : raid1
     Array Size : 148478904 (141.60 GiB 152.04 GB)
  Used Dev Size : 148478904 (141.60 GiB 152.04 GB)
   Raid Devices : 2
  Total Devices : 1
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Fri Dec 13 17:09:19 2013
          State : active, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           Name : hqmailrelay:0
           UUID : 99acf2a0:afa1266c:b870423d:f06e4009
         Events : 11504

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2

# fdisk -l /dev/sda /dev/sdb /dev/sdc

Disk /dev/sdb: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e7781

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048        6143        2048   83  Linux
/dev/sdb2            6144   296966143   148480000   fd  Linux raid autodetect
/dev/sdb3       296966144   299014143     1024000   83  Linux
/dev/sdb4       299014144   312580095     6782976   82  Linux swap / Solaris

Disk /dev/sdc: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312581808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c0feb

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048        6143        2048   83  Linux
/dev/sdc2   *        6144     2054143     1024000   83  Linux
/dev/sdc3         2054144   299014143   148480000   fd  Linux raid autodetect
/dev/sdc4       299014144   312580095     6782976   82  Linux swap / Solaris

Here also is the relevant info from dmesg:

ata1: exception Emask 0x50 SAct 0x0 SErr 0x40d0800 action 0xe frozen
ata1: irq_stat 0x00400040, connection status changed
ata1: SError: { HostInt PHYRdyChg CommWake 10B8B DevExch }
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 300)
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 300)
ata1: limiting SATA link speed to 1.5 Gbps
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 310)
ata1.00: disabled
ata1: EH complete
sd 0:0:0:0: rejecting I/O to offline device
sd 0:0:0:0: [sda] killing request
sd 0:0:0:0: rejecting I/O to offline device
md: super_written gets error=-5, uptodate=0
md/raid1:md0: Disk failure on sda3, disabling device.
md/raid1:md0: Operation continuing on 1 devices.
ata1.00: detaching (SCSI 0:0:0:0)
sd 0:0:0:0: [sda] Synchronizing SCSI cache
sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
sd 0:0:0:0: [sda] Stopping disk
sd 0:0:0:0: [sda] START_STOP FAILED
sd 0:0:0:0: [sda]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Buffer I/O error on device sda2, logical block 65536
lost page write due to I/O error on sda2
JBD2: Error -5 detected when updating journal superblock for sda2-8.
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sda3
 disk 1, wo:0, o:1, dev:sdb2
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sdb2
md: unbind<sda3>
md: export_rdev(sda3)
ata1: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
ata1: irq_stat 0x00000040, connection status changed
ata1: SError: { CommWake DevExch }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-8: ST3160316AS, JC4B, max UDMA/133
ata1.00: 312581808 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133
ata1: EH complete
scsi 0:0:0:0: Direct-Access     ATA      ST3160316AS      JC4B PQ: 0 ANSI: 5
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 0:0:0:0: [sdc] 312581808 512-byte logical blocks: (160 GB/149 GiB)
sd 0:0:0:0: [sdc] Write Protect is off
sd 0:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
 sdc: sdc1 sdc2 sdc3 sdc4
sd 0:0:0:0: [sdc] Attached SCSI disk

I'd really appreciate any guidance you could provide.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-13 22:11 Help with degraded array Alex
@ 2013-12-14 10:38 ` David C. Rankin
  2013-12-14 17:40   ` Alex
  0 siblings, 1 reply; 8+ messages in thread
From: David C. Rankin @ 2013-12-14 10:38 UTC (permalink / raw)
  To: mdraid

On 12/13/2013 04:11 PM, Alex wrote:
> I'd really appreciate any guidance you could provide.
> 
> Thanks,
> Alex

It simply looks like the drive designation changed /dev/sda -> /dev/sdc and
mdadm kicked sda out of the array because it wasn't there. Are you not using
UUID designation to assemble the arrays? What does your /etc/mdadm.conf look like?

It looks like you could add /dev/sdc3 back into the array, let it rebuild if
necessary, then use:

 mdadm --detail --scan >> /etc/mdadm.conf

To create a new UUID based mdadm.conf that would prevent this from happening again.

-- 
David C. Rankin, J.D.,P.E.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-14 10:38 ` David C. Rankin
@ 2013-12-14 17:40   ` Alex
  2013-12-14 18:25     ` David C. Rankin
  0 siblings, 1 reply; 8+ messages in thread
From: Alex @ 2013-12-14 17:40 UTC (permalink / raw)
  To: David C. Rankin, Linux RAID

Hi,

> It simply looks like the drive designation changed /dev/sda -> /dev/sdc and
> mdadm kicked sda out of the array because it wasn't there. Are you not using
> UUID designation to assemble the arrays? What does your /etc/mdadm.conf look like?
>
> It looks like you could add /dev/sdc3 back into the array, let it rebuild if
> necessary, then use:
>
>  mdadm --detail --scan >> /etc/mdadm.conf
>
> To create a new UUID based mdadm.conf that would prevent this from happening again.

# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2
UUID=99acf2a0:afa1266c:b870423d:f06e4009

So I would then use "mdadm --add /dev/md0 /dev/sdc3" to add it,
correct? It's not necessary to first fail the device?

Thanks again,
Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-14 17:40   ` Alex
@ 2013-12-14 18:25     ` David C. Rankin
  2013-12-15  0:34       ` Alex
  2013-12-18 16:49       ` Alex
  0 siblings, 2 replies; 8+ messages in thread
From: David C. Rankin @ 2013-12-14 18:25 UTC (permalink / raw)
  To: mdraid

On 12/14/2013 11:40 AM, Alex wrote:
> # cat /etc/mdadm.conf
> 
> # mdadm.conf written out by anaconda
> MAILADDR root
> AUTO +imsm +1.x -all
> ARRAY /dev/md0 level=raid1 num-devices=2
> UUID=99acf2a0:afa1266c:b870423d:f06e4009
> 
> So I would then use "mdadm --add /dev/md0 /dev/sdc3" to add it,
> correct? It's not necessary to first fail the device?

From your earlier post:

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Mar 21 12:31:23 2012
     Raid Level : raid1
     Array Size : 148478904 (141.60 GiB 152.04 GB)
  Used Dev Size : 148478904 (141.60 GiB 152.04 GB)
   Raid Devices : 2
  Total Devices : 1
^^^^^^^^^^^^^^^^^^^^^    Persistence : Superblock is persistent
<snip>
    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
                                      ^^^^^^^^^^^^^^^^^^^^^^
       1       8       18        1      active sync   /dev/sdb2

The drive at /dev/sda3 has already been failed/removed. The reason was:

Buffer I/O error on device sda2, logical block 65536
lost page write due to I/O error on sda2
JBD2: Error -5 detected when updating journal superblock for sda2-8.

  What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
and make sure the drive was reported as PASSED and not in imminent failure. Then
run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
DO NOT fsck /dev/sdc3.

  Then I would reboot to see if the drive designation didn't revert to sda and
give mdadm a chance to reassemble the array automatically. If that did not work,
then I would check the drive designation (sda/sdc) and try the "--add" as you
specified, if that failed use "--add --force".

Good luck.

-- 
David C. Rankin, J.D.,P.E.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-14 18:25     ` David C. Rankin
@ 2013-12-15  0:34       ` Alex
  2013-12-15 10:14         ` Ken Drummond
  2013-12-16 23:21         ` David C. Rankin
  2013-12-18 16:49       ` Alex
  1 sibling, 2 replies; 8+ messages in thread
From: Alex @ 2013-12-15  0:34 UTC (permalink / raw)
  To: David C. Rankin, Linux RAID

Hi,

>   What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
> and make sure the drive was reported as PASSED and not in imminent failure. Then
> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
> DO NOT fsck /dev/sdc3.

I ran a short (which succeeded) then a long (which also succeeded)
test on /dev/sdc.

I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
it can't be checked.

This box is in a remote datacenter. Still think it's okay to boot remotely?

Thanks,
Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-15  0:34       ` Alex
@ 2013-12-15 10:14         ` Ken Drummond
  2013-12-16 23:21         ` David C. Rankin
  1 sibling, 0 replies; 8+ messages in thread
From: Ken Drummond @ 2013-12-15 10:14 UTC (permalink / raw)
  To: Linux RAID

On 15/12/2013 10:34 AM, Alex wrote:
> Hi,
>
>>    What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
>> and make sure the drive was reported as PASSED and not in imminent failure. Then
>> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
>> DO NOT fsck /dev/sdc3.
> I ran a short (which succeeded) then a long (which also succeeded)
> test on /dev/sdc.
>
> I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
> it can't be checked.
>
> This box is in a remote datacenter. Still think it's okay to boot remotely?
>
> Thanks,
> Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  You could also try running a non destructive badblocks check on the 
partition "badblocks -nsv /dev/sdc3" and see if it fails again, it could 
be the cable rather than the disk that is at fault.  I would expect that 
a reboot would set the drive back to sda.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-15  0:34       ` Alex
  2013-12-15 10:14         ` Ken Drummond
@ 2013-12-16 23:21         ` David C. Rankin
  1 sibling, 0 replies; 8+ messages in thread
From: David C. Rankin @ 2013-12-16 23:21 UTC (permalink / raw)
  To: mdraid

On 12/14/2013 06:34 PM, Alex wrote:
> I ran a short (which succeeded) then a long (which also succeeded)
> test on /dev/sdc.
> 
> I realized that /dev/sdc2 is a non-raid partition mounted as /boot, so
> it can't be checked.
> 
> This box is in a remote datacenter. Still think it's okay to boot remotely?
> 
> Thanks,
> Alex

Alex,

  I think the drive is fine. It looks like a momentary hiccup (bad
block/cylinder, etc..) If it smartctl tested OK, then there is no reason to
think it dying/dead.

  You can always just stop the array and then attempt to start (--assemble) it
before you reboot. That is probably a conservative approach to take before
reboot.  Of course things can always go haywire and fail, but I have boxes I
manage remotely and routinely reboot without issues. It obviously booted fine
with the reassignment of sda->sdc occurred.

  Good luck.

-- 
David C. Rankin, J.D.,P.E.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Help with degraded array
  2013-12-14 18:25     ` David C. Rankin
  2013-12-15  0:34       ` Alex
@ 2013-12-18 16:49       ` Alex
  1 sibling, 0 replies; 8+ messages in thread
From: Alex @ 2013-12-18 16:49 UTC (permalink / raw)
  To: David C. Rankin, Linux RAID

Hi,

> The drive at /dev/sda3 has already been failed/removed. The reason was:
>
> Buffer I/O error on device sda2, logical block 65536
> lost page write due to I/O error on sda2
> JBD2: Error -5 detected when updating journal superblock for sda2-8.
>
>   What I would do is run smartctl -t short /dev/sdc, then smartctl -a /dev/sdc
> and make sure the drive was reported as PASSED and not in imminent failure. Then
> run fsck on /dev/sdc2 (the partition reporting the I/O error on what was sda2 --
> DO NOT fsck /dev/sdc3.

Okay, ran a non-destructive badblocks after the smartctl test and it
passed. Rebooted and it switched back to sda from sdc. Ran "mdadm
--add /dev/md0 /dev/sda3" and it's now rebuilding.

I've done all this before, but somehow got a little nervous with it
being 3000 miles away and at a customer's location :-)

Thanks so much, everyone, for all your help.
Regards,
Alex

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-12-18 16:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-13 22:11 Help with degraded array Alex
2013-12-14 10:38 ` David C. Rankin
2013-12-14 17:40   ` Alex
2013-12-14 18:25     ` David C. Rankin
2013-12-15  0:34       ` Alex
2013-12-15 10:14         ` Ken Drummond
2013-12-16 23:21         ` David C. Rankin
2013-12-18 16:49       ` Alex

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).