Re: raid disk failure, options?

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Fjellstrom <tfjellstrom@shaw.ca>
To: Justin Piszcz <jpiszcz@lucidpixels.com>
Cc: linux-raid@vger.kernel.org, linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: raid disk failure, options?
Date: Sun, 1 Nov 2009 17:45:47 -0600	[thread overview]
Message-ID: <200911011645.47649.tfjellstrom@shaw.ca> (raw)
In-Reply-To: <alpine.DEB.2.00.0911011818190.15187@p34.internal.lan>

On Sun November 1 2009, Justin Piszcz wrote:
> On Sun, 1 Nov 2009, Thomas Fjellstrom wrote:
> > My main raid array just had a disk failure. I tried to hot remove the
> > device, and use the scsi bus rescan sysfs entries, but it seems to fail
> > on IDENTIFY.
> >
> > Can I assume my disk is dead?
> >
> >
> > [5015721.851044] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action
> > 0x0 [5015721.851089] ata3.00: irq_stat 0x40000001
> > [5015721.851124] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> > [5015721.851125]          res 71/04:03:80:01:32/00:00:00:00:00/e0 Emask
> > 0x1 (device error)
> > [5015721.851193] ata3.00: status: { DRDY DF ERR }
> > [5015721.851225] ata3.00: error: { ABRT }
> > [5015726.848684] ata3.00: qc timeout (cmd 0xec)
> > [5015726.848729] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x5)
> > [5015726.848763] ata3.00: revalidation failed (errno=-5)
> > [5015726.848798] ata3: hard resetting link
> > [5015734.501527] ata3: softreset failed (device not ready)
> > [5015734.501565] ata3: failed due to HW bug, retry pmp=0
> > [5015734.665530] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [5015734.707085] ata3.00: both IDENTIFYs aborted, assuming NODEV
> > [5015734.707089] ata3.00: revalidation failed (errno=-2)
> > [5015739.664923] ata3: hard resetting link
> > [5015740.148277] ata3: softreset failed (device not ready)
> > [5015740.148314] ata3: failed due to HW bug, retry pmp=0
> > [5015740.313532] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [5015740.337129] ata3.00: both IDENTIFYs aborted, assuming NODEV
> > [5015740.337132] ata3.00: revalidation failed (errno=-2)
> > [5015740.337167] ata3.00: disabled
> > [5015740.337231] ata3: EH complete
> > [5015740.337275] sd 2:0:0:0: [sdc] Unhandled error code
> > [5015740.337308] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > [5015740.337372] end_request: I/O error, dev sdc, sector 1250258495
> > [5015740.337410] end_request: I/O error, dev sdc, sector 1250258495
> > [5015740.337445] md: super_written gets error=-5, uptodate=0
> > [5015740.337479] raid5: Disk failure on sdc1, disabling device.
> > [5015740.337480] raid5: Operation continuing on 3 devices.
> > [5015740.337569] sd 2:0:0:0: [sdc] Unhandled error code
> > [5015740.337601] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > [5015740.337665] end_request: I/O error, dev sdc, sector 480014231
> > [5015740.337710] sd 2:0:0:0: [sdc] Unhandled error code
> > [5015740.337742] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > [5015740.337806] end_request: I/O error, dev sdc, sector 1186573399
> > [5015740.337840] sd 2:0:0:0: [sdc] Unhandled error code
> > [5015740.337872] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > [5015740.337936] end_request: I/O error, dev sdc, sector 404014999
> > [5015740.371191] RAID5 conf printout:
> > [5015740.371226]  --- rd:4 wd:3
> > [5015740.371258]  disk 0, o:0, dev:sdc1
> > [5015740.371290]  disk 1, o:1, dev:sda1
> > [5015740.371322]  disk 2, o:1, dev:sdb1
> > [5015740.371353]  disk 3, o:1, dev:sdd1
> > [5015740.393516] RAID5 conf printout:
> > [5015740.393551]  --- rd:4 wd:3
> > [5015740.393583]  disk 1, o:1, dev:sda1
> > [5015740.393615]  disk 2, o:1, dev:sdb1
> > [5015740.393647]  disk 3, o:1, dev:sdd1
> >
> > ran: echo x > /sys/bus/scsi/devices/2\:0\:0\:0/delete
> >
> > [5016224.932073] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
> > [5016224.932150] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> > [5016224.932216] sd 2:0:0:0: [sdc] Stopping disk
> > [5016224.933192] sd 2:0:0:0: [sdc] START_STOP FAILED
> > [5016224.933227] sd 2:0:0:0: [sdc] Result: hostbyte=DID_BAD_TARGET
> > driverbyte=DRIVER_OK,SUGGEST_OK
> >
> > ran: echo "0 0 0" > /sys/class/scsi_host/host2/scan
> >
> > [5016463.173706] ata3: hard resetting link
> > [5016463.657520] ata3: softreset failed (device not ready)
> > [5016463.657557] ata3: failed due to HW bug, retry pmp=0
> > [5016463.821535] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > [5016463.842475] ata3.00: both IDENTIFYs aborted, assuming NODEV
> > [5016463.842492] ata3: EH complete
> >
> > To be honest, I've been expecting this, I just had no idea which drive
> > was going to fail. For the past 6-12 months I've been hearing this
> > rather loud clicking noise coming from that machine, but I could never
> > pin it down, it only happened a couple times a day (and it wasn't heads
> > parking).
> >
> > I'm tempted to try and reboot the machine, to see if the disk comes
> > back. But I'm worried the array might not come back (for whatever
> > reason).
> 
> Looks like it but show smart stats as well-- they may show what/why is
> failing, if not try to run a smart test (short) and see what the error
>  is, or if it passes successfully.
> 
> smartctl -a /dev/disk
> 
> or smartctl -d (cciss,3ware),drive -a /dev/device
>

smartctl fails. even adding -T permissive doesn't help. At this point, the 
disk is completely inaccessible to the OS. Also, after removing the disk 
(using the sysfs interface) it won't come back, so there is no /dev device 
for it anymore, so I can't run smartctl on it again (I've lost the output 
from before, which was just smartctl saying it couldn't read from the 
device).

> Justin.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca

next prev parent reply	other threads:[~2009-11-01 23:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-01 19:16 raid disk failure, options? Thomas Fjellstrom
2009-11-01 23:19 ` Justin Piszcz
2009-11-01 23:45   ` Thomas Fjellstrom [this message]
2009-11-02 15:00 ` Bill Davidsen
2009-11-02 17:36   ` Thomas Fjellstrom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200911011645.47649.tfjellstrom@shaw.ca \
    --to=tfjellstrom@shaw.ca \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox