From: Carsten Aulbert <Carsten.Aulbert@aei.mpg.de>
To: Linux RAID <linux-raid@vger.kernel.org>
Subject: Recovering from two almost simultaneously failed devices in RAID1
Date: Sat, 10 Aug 2013 18:29:46 +0200 [thread overview]
Message-ID: <52066A7A.5050007@aei.mpg.de> (raw)
[-- Attachment #1: Type: text/plain, Size: 11332 bytes --]
Hi there
I fear one of our mainboards did not play nicely with our SSDs in RAID1
configuration:
mdadm --detail /dev/md2
/dev/md2:
Version : 1.2
Creation Time : Fri Jul 27 11:58:50 2012
Raid Level : raid1
Array Size : 250050533 (238.47 GiB 256.05 GB)
Used Dev Size : 250050533 (238.47 GiB 256.05 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Sat Aug 10 14:58:30 2013
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Number Major Minor RaidDevice State
0 8 49 0 active sync /dev/sdd1
1 0 0 1 removed
1 8 33 - faulty spare /dev/sdc1
It seems both drives experienced some problem at around the same time,
sdc was taken offline first, but then sdd also had problems (see log at
the end of the email).
The filesystem on top of it (ext4) of course had no way of coping with
this problem, other than going to read/only.
The big questions of course are
(a) how to retrieve as much data as possible from the disks
(b) how to revive the raid system again
Any thoughts of what I should try first?
I think to tackle (a) I'll use ddrescue first, just trying to cover a
possible mistake I make later on
Cheers
Carsten
Here's the start of the log:
Aug 10 14:57:30 gitmaster kernel: [10731321.352291] ata3.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Aug 10 14:57:30 gitmaster kernel: [10731321.352350] ata3.00: failed
command: WRITE FPDMA QUEUED
Aug 10 14:57:30 gitmaster kernel: [10731321.352380] ata3.00: cmd
61/02:00:47:00:00/00:00:00:00:00/40 tag 0 ncq 1024 out
Aug 10 14:57:30 gitmaster kernel: [10731321.352380] res
40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 10 14:57:30 gitmaster kernel: [10731321.352469] ata3.00: status: {
DRDY }
Aug 10 14:57:30 gitmaster kernel: [10731321.352495] ata3: hard resetting
link
Aug 10 14:57:30 gitmaster kernel: [10731321.352528] ata4.00: exception
Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Aug 10 14:57:30 gitmaster kernel: [10731321.352574] ata4.00: failed
command: WRITE FPDMA QUEUED
Aug 10 14:57:30 gitmaster kernel: [10731321.352604] ata4.00: cmd
61/02:00:47:00:00/00:00:00:00:00/40 tag 0 ncq 1024 out
Aug 10 14:57:30 gitmaster kernel: [10731321.352605] res
40/00:00:47:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Aug 10 14:57:30 gitmaster kernel: [10731321.352695] ata4.00: status: {
DRDY }
Aug 10 14:57:30 gitmaster kernel: [10731321.352721] ata4: hard resetting
link
Aug 10 14:57:35 gitmaster kernel: [10731326.709171] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:35 gitmaster kernel: [10731326.721137] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:40 gitmaster kernel: [10731331.354487] ata3: COMRESET
failed (errno=-16)
Aug 10 14:57:40 gitmaster kernel: [10731331.354518] ata3: hard resetting
link
Aug 10 14:57:40 gitmaster kernel: [10731331.370448] ata4: COMRESET
failed (errno=-16)
Aug 10 14:57:40 gitmaster kernel: [10731331.370480] ata4: hard resetting
link
Aug 10 14:57:45 gitmaster kernel: [10731336.715383] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:45 gitmaster kernel: [10731336.735346] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:50 gitmaster kernel: [10731341.360692] ata3: COMRESET
failed (errno=-16)
Aug 10 14:57:50 gitmaster kernel: [10731341.360723] ata3: hard resetting
link
Aug 10 14:57:50 gitmaster kernel: [10731341.388654] ata4: COMRESET
failed (errno=-16)
Aug 10 14:57:50 gitmaster kernel: [10731341.388686] ata4: hard resetting
link
Aug 10 14:57:55 gitmaster kernel: [10731346.721587] ata3: link is slow
to respond, please be patient (ready=0)
Aug 10 14:57:55 gitmaster kernel: [10731346.749571] ata4: link is slow
to respond, please be patient (ready=0)
Aug 10 14:58:01 gitmaster /USR/SBIN/CRON[10885]: (root) CMD (cd
/srv/gitorious && rake ultrasphinx:index RAILS_ENV=production >
/dev/null 2>&1)
Aug 10 14:58:25 gitmaster kernel: [10731376.344429] ata3: COMRESET
failed (errno=-16)
Aug 10 14:58:25 gitmaster kernel: [10731376.344464] ata3: limiting SATA
link speed to 1.5 Gbps
Aug 10 14:58:25 gitmaster kernel: [10731376.344497] ata3: hard resetting
link
Aug 10 14:58:25 gitmaster kernel: [10731376.424371] ata4: COMRESET
failed (errno=-16)
Aug 10 14:58:25 gitmaster kernel: [10731376.424403] ata4: limiting SATA
link speed to 1.5 Gbps
Aug 10 14:58:25 gitmaster kernel: [10731376.424436] ata4: hard resetting
link
Aug 10 14:58:30 gitmaster kernel: [10731381.365521] ata3: COMRESET
failed (errno=-16)
Aug 10 14:58:30 gitmaster kernel: [10731381.365554] ata3: reset failed,
giving up
Aug 10 14:58:30 gitmaster kernel: [10731381.365585] ata3.00: disabled
Aug 10 14:58:30 gitmaster kernel: [10731381.365610] ata3.00: device
reported invalid CHS sector 0
Aug 10 14:58:30 gitmaster kernel: [10731381.365643] ata3: EH complete
Aug 10 14:58:30 gitmaster kernel: [10731381.365675] sd 2:0:0:0: [sdc]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.365701] sd 2:0:0:0: [sdc]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.365748] sd 2:0:0:0: [sdc]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.365816] end_request: I/O
error, dev sdc, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.365844] end_request: I/O
error, dev sdc, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.365871] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.365900] md/raid1:md2: Disk
failure on sdc1, disabling device.
Aug 10 14:58:30 gitmaster kernel: [10731381.365900] md/raid1:md2:
Operation continuing on 1 devices.
Aug 10 14:58:30 gitmaster kernel: [10731381.453474] ata4: COMRESET
failed (errno=-16)
Aug 10 14:58:30 gitmaster kernel: [10731381.453505] ata4: reset failed,
giving up
Aug 10 14:58:30 gitmaster kernel: [10731381.453536] ata4.00: disabled
Aug 10 14:58:30 gitmaster kernel: [10731381.453565] ata4: EH complete
Aug 10 14:58:30 gitmaster kernel: [10731381.453596] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.453621] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.453669] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.453737] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.453765] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.453792] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.453867] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.453894] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.453941] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 00 47 00 00 02 00
Aug 10 14:58:30 gitmaster kernel: [10731381.454010] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.454036] end_request: I/O
error, dev sdd, sector 71
Aug 10 14:58:30 gitmaster kernel: [10731381.454064] md: super_written
gets error=-5, uptodate=0
Aug 10 14:58:30 gitmaster kernel: [10731381.454136] RAID1 conf printout:
Aug 10 14:58:30 gitmaster kernel: [10731381.454140] --- wd:1 rd:2
Aug 10 14:58:30 gitmaster kernel: [10731381.454143] disk 0, wo:0, o:1,
dev:sdd1
Aug 10 14:58:30 gitmaster kernel: [10731381.454146] disk 1, wo:1, o:0,
dev:sdc1
Aug 10 14:58:30 gitmaster kernel: [10731381.477438] RAID1 conf printout:
Aug 10 14:58:30 gitmaster kernel: [10731381.477442] --- wd:1 rd:2
Aug 10 14:58:30 gitmaster kernel: [10731381.477446] disk 0, wo:0, o:1,
dev:sdd1
Aug 10 14:58:30 gitmaster kernel: [10731381.477477] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477514] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477562] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 0e c7 da 6f 00 00 18 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477630] end_request: I/O
error, dev sdd, sector 247978607
Aug 10 14:58:30 gitmaster kernel: [10731381.477728] Aborting journal on
device md2-8.
Aug 10 14:58:30 gitmaster kernel: [10731381.477774] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477802] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477851] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 0e c4 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477922] end_request: I/O
error, dev sdd, sector 247728191
Aug 10 14:58:30 gitmaster kernel: [10731381.477944] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.477945] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.477947] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.477950] end_request: I/O
error, dev sdd, sector 2111
Aug 10 14:58:30 gitmaster kernel: [10731381.477982] Buffer I/O error on
device md2, logical block 0
Aug 10 14:58:30 gitmaster kernel: [10731381.477983] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.478011] EXT4-fs error
(device md2): ext4_journal_start_sb:327: Detected aborted journal
Aug 10 14:58:30 gitmaster kernel: [10731381.478013] EXT4-fs (md2):
Remounting filesystem read-only
Aug 10 14:58:30 gitmaster kernel: [10731381.478014] EXT4-fs (md2):
previous I/O error to superblock detected
Aug 10 14:58:30 gitmaster kernel: [10731381.478052] sd 3:0:0:0: [sdd]
Unhandled error code
Aug 10 14:58:30 gitmaster kernel: [10731381.478054] sd 3:0:0:0: [sdd]
Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Aug 10 14:58:30 gitmaster kernel: [10731381.478055] sd 3:0:0:0: [sdd]
CDB: Write(10): 2a 00 00 00 08 3f 00 00 08 00
Aug 10 14:58:30 gitmaster kernel: [10731381.478059] end_request: I/O
error, dev sdd, sector 2111
Aug 10 14:58:30 gitmaster kernel: [10731381.478078] Buffer I/O error on
device md2, logical block 0
Aug 10 14:58:30 gitmaster kernel: [10731381.478079] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.485182] Buffer I/O error on
device md2, logical block 30965760
Aug 10 14:58:30 gitmaster kernel: [10731381.485184] lost page write due
to I/O error on md2
Aug 10 14:58:30 gitmaster kernel: [10731381.485190] JBD2: I/O error
detected when updating journal superblock for md2-8.
Aug 10 14:58:30 gitmaster mdadm[1470]: Fail event detected on md device
/dev/md/2, component device /dev/sdc1
--
Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
Callinstrasse 38, 30167 Hannover, Germany
phone/fax: +49 511 762-17185 / -17193
https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2044 bytes --]
next reply other threads:[~2013-08-10 16:29 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-10 16:29 Carsten Aulbert [this message]
2013-08-10 16:33 ` Recovering from two almost simultaneously failed devices in RAID1 Carsten Aulbert
2013-08-10 17:39 ` Carsten Aulbert
2013-08-10 17:45 ` Mathias Burén
2013-08-10 18:05 ` Carsten Aulbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52066A7A.5050007@aei.mpg.de \
--to=carsten.aulbert@aei.mpg.de \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox