What just happened to my disks/RAID5 array?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* What just happened to my disks/RAID5 array?
@ 2011-09-13  8:27 Johannes Truschnigg
  2011-09-13 11:37 ` Phil Turmel
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2011-09-13  8:27 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 935 bytes --]

Dear list members,

my server at home just mailed in multiple FAIL events from members of 
the RAID5 array in it. I won't be able to get to the machine during the 
next ten or so hours, but I'd like to be prepared as best as I can when 
I face the disaster that apparently struck. I attached the relevant 
dmesg excerpt, as well as the current mdstat contents. Theories 
explaining what could have happened - and how to deal with such a 
scenario - are highly appreciated, as only some of the data on the array 
is actually backed up elsewhere. If you need any additional information 
about the system or its setup, please ask right away!

I do have SSH access to the box.

Thanks for your support!
-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www: http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: mdstat.txt --]
[-- Type: text/plain, Size: 266 bytes --]

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid5 sdf[5](F) sdd[2](F) sde[1](F)
      5860548608 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/0] [_____]
      bitmap: 2/11 pages [8KB], 65536KB chunk

unused devices: <none>

[-- Attachment #3: raidfail.txt --]
[-- Type: text/plain, Size: 26998 bytes --]

[147245.851744] ata7: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147245.851752] ata7: irq_stat 0x00400000, PHY RDY changed
[147245.851761] ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147245.851774] ata7: hard resetting link
[147246.568754] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147246.575779] ata7.00: failed to set xfermode (err_mask=0x100)
[147251.568674] ata7: hard resetting link
[147251.895632] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147251.909913] ata7.00: configured for UDMA/133
[147251.909925] ata7: EH complete
[147260.340033] ata7: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147260.340041] ata7: irq_stat 0x00400000, PHY RDY changed
[147260.340050] ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147260.340063] ata7: hard resetting link
[147261.418971] ata7: SATA link down (SStatus 20 SControl 300)
[147266.418724] ata7: hard resetting link
[147266.738832] ata7: SATA link down (SStatus 20 SControl 300)
[147266.738844] ata7: limiting SATA link speed to 1.5 Gbps
[147271.738667] ata7: hard resetting link
[147272.058700] ata7: SATA link down (SStatus 20 SControl 310)
[147272.058712] ata7.00: disabled
[147272.058722] ata7: limiting SATA link speed to 1.5 Gbps
[147272.058739] ata7: EH complete
[147272.058759] ata7.00: detaching (SCSI 6:0:0:0)
[147272.072254] sd 6:0:0:0: [sdf] Synchronizing SCSI cache
[147272.072326] sd 6:0:0:0: [sdf]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147272.072335] sd 6:0:0:0: [sdf] Stopping disk
[147272.072352] sd 6:0:0:0: [sdf] START_STOP FAILED
[147272.072357] sd 6:0:0:0: [sdf]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147272.086129] md/raid:md0: Disk failure on sdf, disabling device.
[147272.086133] md/raid:md0: Operation continuing on 4 devices.
[147272.087498] ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x1980000 action 0x6 frozen
[147272.087507] ata5.00: irq_stat 0x08000000, interface fatal error
[147272.087516] ata5: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147272.087524] ata5.00: failed command: WRITE DMA
[147272.087538] ata5.00: cmd ca/00:06:10:00:00/00:00:00:00:00/e0 tag 0 dma 3072 out
[147272.087542]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[147272.087549] ata5.00: status: { DRDY }
[147272.087561] ata5: hard resetting link
[147272.087580] ata4.00: exception Emask 0x10 SAct 0x0 SErr 0x1980000 action 0x6 frozen
[147272.087586] ata4.00: irq_stat 0x08000000, interface fatal error
[147272.087593] ata4: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147272.087599] ata4.00: failed command: WRITE DMA
[147272.087612] ata4.00: cmd ca/00:06:10:00:00/00:00:00:00:00/e0 tag 0 dma 3072 out
[147272.087616]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[147272.087622] ata4.00: status: { DRDY }
[147272.087630] ata4: hard resetting link
[147272.405565] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147272.618863] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147272.625883] ata4.00: failed to set xfermode (err_mask=0x100)
[147277.618667] ata4: hard resetting link
[147277.618706] ata5.00: qc timeout (cmd 0xec)
[147277.618719] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147277.618727] ata5.00: revalidation failed (errno=-5)
[147277.618737] ata5: hard resetting link
[147277.938774] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147277.952986] ata4.00: configured for UDMA/133
[147277.953008] ata4: EH complete
[147282.952226] ata5: link is slow to respond, please be patient (ready=0)
[147287.645624] ata5: COMRESET failed (errno=-16)
[147287.645633] ata5: hard resetting link
[147287.965323] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147292.968826] ata5.00: qc timeout (cmd 0x27)
[147292.968838] ata5.00: failed to read native max address (err_mask=0x4)
[147292.968844] ata5.00: HPA support seems broken, skipping HPA handling
[147292.968851] ata5.00: revalidation failed (errno=-5)
[147292.968858] ata5: limiting SATA link speed to 1.5 Gbps
[147292.968868] ata5: hard resetting link
[147293.288784] ata5: SATA link up <unknown> (SStatus 103 SControl 310)
[147302.438718] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x1980000 action 0x6 frozen
[147302.438729] ata6: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147302.438737] ata6.00: failed command: WRITE DMA
[147302.438752] ata6.00: cmd ca/00:06:10:00:00/00:00:00:00:00/e0 tag 0 dma 3072 out
[147302.438756]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[147302.438763] ata6.00: status: { DRDY }
[147302.438774] ata6: hard resetting link
[147302.438798] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x1980000 action 0x6 frozen
[147302.438809] ata3: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147302.438818] ata3.00: failed command: WRITE DMA
[147302.438833] ata3.00: cmd ca/00:06:10:00:00/00:00:00:00:00/e0 tag 0 dma 3072 out
[147302.438836]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[147302.438843] ata3.00: status: { DRDY }
[147302.438855] ata3: hard resetting link
[147302.758863] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147302.762125] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147302.773317] ata3.00: configured for UDMA/133
[147302.773328] ata3.00: device reported invalid CHS sector 0
[147302.773342] ata3: EH complete
[147302.775379] ata6.00: configured for UDMA/133
[147302.775387] ata6.00: device reported invalid CHS sector 0
[147302.775397] ata6: EH complete
[147303.288709] ata5.00: qc timeout (cmd 0xec)
[147303.288718] ata5.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147303.288725] ata5.00: revalidation failed (errno=-5)
[147303.288730] ata5.00: disabled
[147303.288748] ata5: hard resetting link
[147303.608971] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147303.608987] ata5: EH complete
[147303.609011] sd 4:0:0:0: [sdd] Unhandled error code
[147303.609017] sd 4:0:0:0: [sdd]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147303.609026] sd 4:0:0:0: [sdd] CDB: Write(10): 2a 00 00 00 00 10 00 00 06 00
[147303.609042] end_request: I/O error, dev sdd, sector 16
[147303.609050] end_request: I/O error, dev sdd, sector 16
[147303.609055] md: super_written gets error=-5, uptodate=0
[147303.609064] md/raid:md0: Disk failure on sdd, disabling device.
[147303.609067] md/raid:md0: Operation continuing on 3 devices.
[147338.034739] ata5: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147338.034747] ata5: irq_stat 0x00400000, PHY RDY changed
[147338.034755] ata5: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147338.034772] ata5: hard resetting link
[147338.752214] ata5: SATA link down (SStatus 10 SControl 300)
[147338.752231] ata5: EH complete
[147338.752249] ata5.00: detaching (SCSI 4:0:0:0)
[147338.762446] sd 4:0:0:0: [sdd] Synchronizing SCSI cache
[147338.762686] sd 4:0:0:0: [sdd]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147338.762695] sd 4:0:0:0: [sdd] Stopping disk
[147338.762712] sd 4:0:0:0: [sdd] START_STOP FAILED
[147338.762717] sd 4:0:0:0: [sdd]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147347.452182] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x1980000 action 0x6 frozen
[147347.452193] ata4: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147347.452200] ata4.00: failed command: FLUSH CACHE EXT
[147347.452214] ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[147347.452217]          res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[147347.452225] ata4.00: status: { DRDY }
[147347.452236] ata4: hard resetting link
[147352.825504] ata4: link is slow to respond, please be patient (ready=0)
[147357.465530] ata4: COMRESET failed (errno=-16)
[147357.465540] ata4: hard resetting link
[147357.785657] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147357.792665] ata4.00: n_sectors mismatch 2930277168 != 1927137455439861512
[147357.792672] ata4.00: revalidation failed (errno=-19)
[147357.792678] ata4: limiting SATA link speed to 1.5 Gbps
[147362.785605] ata4: hard resetting link
[147363.105612] ata4: SATA link up <unknown> (SStatus 103 SControl 310)
[147363.112206] ata4.00: failed to read native max address (err_mask=0x100)
[147363.112212] ata4.00: HPA support seems broken, skipping HPA handling
[147363.112219] ata4.00: revalidation failed (errno=-5)
[147363.112225] ata4.00: disabled
[147368.105572] ata4: hard resetting link
[147368.425538] ata4: SATA link up <unknown> (SStatus 103 SControl 300)
[147368.432154] ata4.00: ATA-14: SAMSUNG HD154UI                        `, 1AG01118, max MWDMA1
[147368.432163] ata4.00: 2509505962 sectors, multi 0: LBA NCQ (depth 31)
[147368.432171] ata4.00: applying bridge limits
[147368.432941] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[147368.432947] ata4.00: revalidation failed (errno=-5)
[147373.425473] ata4: hard resetting link
[147373.745335] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147378.745337] ata4.00: qc timeout (cmd 0xec)
[147378.745348] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147378.745355] ata4.00: revalidation failed (errno=-5)
[147378.745361] ata4: limiting SATA link speed to 1.5 Gbps
[147378.745369] ata4: hard resetting link
[147379.065602] ata4: SATA link up <unknown> (SStatus 103 SControl 310)
[147389.065604] ata4.00: qc timeout (cmd 0xec)
[147389.065616] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147389.065623] ata4.00: revalidation failed (errno=-5)
[147389.065629] ata4.00: disabled
[147389.065657] ata4: hard resetting link
[147389.598721] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147389.598766] sd 3:0:0:0: rejecting I/O to offline device
[147389.598780] ata4: EH complete
[147389.598801] sd 3:0:0:0: rejecting I/O to offline device
[147389.598813] end_request: I/O error, dev sdc, sector 8
[147389.598820] md: super_written gets error=-5, uptodate=0
[147389.598828] md/raid:md0: Disk failure on sdc, disabling device.
[147389.598831] md/raid:md0: Operation continuing on 2 devices.
[147389.598880] sd 3:0:0:0: rejecting I/O to offline device
[147389.598895] sd 3:0:0:0: rejecting I/O to offline device
[147389.598904] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
[147389.598910] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[147389.598919] sd 3:0:0:0: [sdc] Sense not available.
[147389.598928] sd 3:0:0:0: rejecting I/O to offline device
[147389.598941] sd 3:0:0:0: rejecting I/O to offline device
[147389.598953] sd 3:0:0:0: rejecting I/O to offline device
[147389.598962] sd 3:0:0:0: [sdc] READ CAPACITY failed
[147389.599002] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[147389.599010] sd 3:0:0:0: [sdc] Sense not available.
[147389.599021] sd 3:0:0:0: rejecting I/O to offline device
[147389.599034] sd 3:0:0:0: rejecting I/O to offline device
[147389.599063] sd 3:0:0:0: rejecting I/O to offline device
[147389.599079] ata3.00: exception Emask 0x12 SAct 0x0 SErr 0x1980400 action 0x6 frozen
[147389.599088] sd 3:0:0:0: rejecting I/O to offline device
[147389.599098] ata3.00: irq_stat 0x08000000, interface fatal error
[147389.599105] sd 3:0:0:0: [sdc] Asking for cache data failed
[147389.599113] sd 3:0:0:0: [sdc] Assuming drive cache: write through
[147389.599125] ata3: SError: { Proto 10B8B Dispar LinkSeq TrStaTrns }
[147389.599133] sdc: detected capacity change from 1500301910016 to 0
[147389.599145] ata3.00: failed command: WRITE DMA
[147389.599163] ata3.00: cmd ca/00:02:08:00:00/00:00:00:00:00/e0 tag 0 dma 1024 out
[147389.599167]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x12 (ATA bus error)
[147389.599174] ata3.00: status: { DRDY }
[147389.599186] ata3: hard resetting link
[147389.599249] ata4.00: detaching (SCSI 3:0:0:0)
[147389.612289] sd 3:0:0:0: [sdc] Stopping disk
[147389.612515] sd 3:0:0:0: [sdc] START_STOP FAILED
[147389.612521] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147390.135576] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147390.149879] ata3.00: configured for UDMA/133
[147390.149902] ata3: EH complete
[147393.089812] ata4: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147393.089819] ata4: irq_stat 0x00400000, PHY RDY changed
[147393.089828] ata4: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147393.089843] ata4: limiting SATA link speed to 1.5 Gbps
[147393.089851] ata4: hard resetting link
[147394.562403] ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147394.562409] ata6.00: irq_stat 0x00400000, PHY RDY changed
[147394.562417] ata6: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147394.562423] ata6.00: failed command: FLUSH CACHE EXT
[147394.562438] ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[147394.562441]          res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[147394.562448] ata6.00: status: { DRDY }
[147394.562457] ata6: hard resetting link
[147395.338671] ata4: COMRESET failed (errno=-32)
[147395.338678] ata4: reset failed (errno=-32), retrying in 8 secs
[147400.295627] ata6: link is slow to respond, please be patient (ready=0)
[147403.088895] ata4: hard resetting link
[147404.615638] ata6: COMRESET failed (errno=-16)
[147404.615646] ata6: hard resetting link
[147407.225318] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147407.239058] ata6.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[147407.239065] ata6.00: revalidation failed (errno=-5)
[147412.225599] ata6: hard resetting link
[147412.545517] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147412.559752] ata6.00: configured for UDMA/133
[147412.559759] ata6.00: retrying FLUSH 0xea Emask 0x10
[147412.560140] ata6.00: device reported invalid CHS sector 0
[147412.560156] ata6: EH complete
[147412.636929] RAID conf printout:
[147412.636938]  --- level:5 rd:5 wd:2
[147412.636946]  disk 0, o:0, dev:sdc
[147412.636951]  disk 1, o:1, dev:sde
[147412.636956]  disk 2, o:0, dev:sdd
[147412.636961]  disk 3, o:1, dev:sdb
[147412.636966]  disk 4, o:0, dev:sdf
[147412.637000] RAID conf printout:
[147412.637006]  --- level:5 rd:5 wd:2
[147412.637012]  disk 0, o:0, dev:sdc
[147412.637016]  disk 1, o:1, dev:sde
[147412.637022]  disk 2, o:0, dev:sdd
[147412.637026]  disk 3, o:1, dev:sdb
[147412.637041] RAID conf printout:
[147412.637045]  --- level:5 rd:5 wd:2
[147412.637050]  disk 0, o:0, dev:sdc
[147412.637054]  disk 1, o:1, dev:sde
[147412.637059]  disk 2, o:0, dev:sdd
[147412.637063]  disk 3, o:1, dev:sdb
[147412.639165] RAID conf printout:
[147412.639170]  --- level:5 rd:5 wd:2
[147412.639175]  disk 0, o:0, dev:sdc
[147412.639180]  disk 1, o:1, dev:sde
[147412.639185]  disk 3, o:1, dev:sdb
[147412.639200] RAID conf printout:
[147412.639204]  --- level:5 rd:5 wd:2
[147412.639208]  disk 0, o:0, dev:sdc
[147412.639213]  disk 1, o:1, dev:sde
[147412.639217]  disk 3, o:1, dev:sdb
[147412.639247] RAID conf printout:
[147412.639252]  --- level:5 rd:5 wd:2
[147412.639257]  disk 1, o:1, dev:sde
[147412.639262]  disk 3, o:1, dev:sdb
[147412.647225] md: unbind<sdc>
[147412.655439] md: export_rdev(sdc)
[147413.132017] ata4: COMRESET failed (errno=-16)
[147413.132034] ata4: hard resetting link
[147413.852109] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147413.859114] ata4.00: ATA-7: SAMSUNG HD154UI, 1AG01118, max UDMA7
[147413.859124] ata4.00: 2930277168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[147413.866547] ata4.00: configured for UDMA/133
[147413.866561] ata4: EH complete
[147413.866746] scsi 3:0:0:0: Direct-Access     ATA      SAMSUNG HD154UI  1AG0 PQ: 0 ANSI: 5
[147413.867111] sd 3:0:0:0: [sdc] 2930277168 512-byte logical blocks: (1.50 TB/1.36 TiB)
[147413.867121] sd 3:0:0:0: Attached scsi generic sg2 type 0
[147413.867226] sd 3:0:0:0: [sdc] Write Protect is off
[147413.867234] sd 3:0:0:0: [sdc] Mode Sense: 00 3a 00 00
[147413.867374] sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[147413.868029] sdc: detected capacity change from 0 to 1500301910016
[147413.883293]  sdc: unknown partition table
[147413.883663] sd 3:0:0:0: [sdc] Attached SCSI disk
[147421.162867] ata6: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147421.162875] ata6: irq_stat 0x00400000, PHY RDY changed
[147421.162884] ata6: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147421.162897] ata6: hard resetting link
[147427.855375] ata6: link is slow to respond, please be patient (ready=0)
[147431.215600] ata6: COMRESET failed (errno=-16)
[147431.215610] ata6: hard resetting link
[147431.935560] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147432.783678] ata3: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[147432.783684] ata3: irq_stat 0x00400000, PHY RDY changed
[147432.783693] ata3: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[147432.783706] ata3: hard resetting link
[147436.935545] ata6.00: qc timeout (cmd 0xec)
[147436.935559] ata6.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147436.935566] ata6.00: revalidation failed (errno=-5)
[147436.935576] ata6: hard resetting link
[147437.468869] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147437.475816] ata6.00: both IDENTIFYs aborted, assuming NODEV
[147437.475822] ata6.00: revalidation failed (errno=-2)
[147442.468949] ata6: hard resetting link
[147442.788695] ata3: COMRESET failed (errno=-16)
[147442.788706] ata3: hard resetting link
[147442.788731] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147442.803092] ata6.00: failed to read native max address (err_mask=0x100)
[147442.803099] ata6.00: HPA support seems broken, skipping HPA handling
[147442.803106] ata6.00: revalidation failed (errno=-5)
[147442.803112] ata6.00: disabled
[147442.803121] ata6: limiting SATA link speed to 1.5 Gbps
[147442.805300] sd 5:0:0:0: rejecting I/O to offline device
[147442.805331] ata6: hard resetting link
[147442.805347] sd 5:0:0:0: rejecting I/O to offline device
[147442.805365] sd 5:0:0:0: rejecting I/O to offline device
[147442.805379] sd 5:0:0:0: rejecting I/O to offline device
[147442.805390] sd 5:0:0:0: [sde] READ CAPACITY(16) failed
[147442.805399] sd 5:0:0:0: [sde]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[147442.805412] sd 5:0:0:0: [sde] Sense not available.
[147442.805424] sd 5:0:0:0: rejecting I/O to offline device
[147442.805438] sd 5:0:0:0: rejecting I/O to offline device
[147442.805453] sd 5:0:0:0: rejecting I/O to offline device
[147442.805464] sd 5:0:0:0: [sde] READ CAPACITY failed
[147442.805472] sd 5:0:0:0: [sde]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[147442.805485] sd 5:0:0:0: [sde] Sense not available.
[147442.805498] sd 5:0:0:0: rejecting I/O to offline device
[147442.805512] sd 5:0:0:0: rejecting I/O to offline device
[147442.805530] sd 5:0:0:0: rejecting I/O to offline device
[147442.805545] sd 5:0:0:0: rejecting I/O to offline device
[147442.805554] sd 5:0:0:0: [sde] Asking for cache data failed
[147442.805560] sd 5:0:0:0: [sde] Assuming drive cache: write through
[147442.805572] sde: detected capacity change from 1500301910016 to 0
[147443.125470] ata6: SATA link up <unknown> (SStatus 103 SControl 310)
[147443.508925] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147444.412118] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x1980000 action 0x6 frozen
[147444.412129] ata4: SError: { 10B8B Dispar LinkSeq TrStaTrns }
[147444.412137] ata4.00: failed command: READ FPDMA QUEUED
[147444.412152] ata4.00: cmd 60/08:00:20:7b:a8/00:00:ae:00:00/40 tag 0 ncq 4096 in
[147444.412156]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[147444.412163] ata4.00: status: { DRDY }
[147444.412175] ata4: hard resetting link
[147444.945585] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147444.960590] ata4.00: configured for UDMA/133
[147444.960601] ata4.00: device reported invalid CHS sector 0
[147444.960614] ata4: EH complete
[147448.125388] ata6.00: qc timeout (cmd 0xec)
[147448.125399] ata6.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147448.125408] ata6: hard resetting link
[147448.508741] ata3.00: qc timeout (cmd 0xec)
[147448.508750] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[147448.508757] ata3.00: revalidation failed (errno=-5)
[147448.508763] ata3: hard resetting link
[147448.658874] ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147448.658928] ata6.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[147453.658890] ata6: hard resetting link
[147453.768760] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[147453.781812] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[147453.781820] ata3.00: revalidation failed (errno=-5)
[147453.781827] ata3: limiting SATA link speed to 1.5 Gbps
[147458.768931] ata3: hard resetting link
[147458.878822] ata6: SATA link down (SStatus 10 SControl 310)
[147458.878840] ata6: EH complete
[147458.878859] ata6.00: detaching (SCSI 5:0:0:0)
[147458.889079] sd 5:0:0:0: [sde] Stopping disk
[147458.889130] sd 5:0:0:0: [sde] START_STOP FAILED
[147458.889135] sd 5:0:0:0: [sde]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147458.897620] md/raid:md0: Disk failure on sde, disabling device.
[147458.897625] md/raid:md0: Operation continuing on 1 devices.
[147459.088896] ata3: SATA link down (SStatus 20 SControl 310)
[147459.088910] ata3.00: disabled
[147459.088928] ata3: EH complete
[147459.088943] sd 2:0:0:0: rejecting I/O to offline device
[147459.088974] sd 2:0:0:0: rejecting I/O to offline device
[147459.088984] md: super_written gets error=-5, uptodate=0
[147459.088993] md/raid:md0: Disk failure on sdb, disabling device.
[147459.088996] md/raid:md0: Operation continuing on 0 devices.
[147459.089022] ata3.00: detaching (SCSI 2:0:0:0)
[147459.089130] RAID conf printout:
[147459.089139]  --- level:5 rd:5 wd:0
[147459.089145]  disk 1, o:0, dev:sde
[147459.089150]  disk 3, o:0, dev:sdb
[147459.098771] RAID conf printout:
[147459.098780]  --- level:5 rd:5 wd:0
[147459.098787]  disk 3, o:0, dev:sdb
[147459.098801] RAID conf printout:
[147459.098804]  --- level:5 rd:5 wd:0
[147459.098809]  disk 3, o:0, dev:sdb
[147459.102318] sd 2:0:0:0: [sdb] Synchronizing SCSI cache
[147459.102416] sd 2:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147459.102425] sd 2:0:0:0: [sdb] Stopping disk
[147459.103462] sd 2:0:0:0: [sdb] START_STOP FAILED
[147459.103469] sd 2:0:0:0: [sdb]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147459.108690] RAID conf printout:
[147459.108697]  --- level:5 rd:5 wd:0
[147459.116041] md: unbind<sdb>
[147459.125635] md: export_rdev(sdb)
[147475.345396] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x180000 action 0x6 frozen
[147475.345407] ata4: SError: { 10B8B Dispar }
[147475.345415] ata4.00: failed command: READ FPDMA QUEUED
[147475.345430] ata4.00: cmd 60/08:00:20:7b:a8/00:00:ae:00:00/40 tag 0 ncq 4096 in
[147475.345433]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[147475.345440] ata4.00: status: { DRDY }
[147475.345453] ata4: hard resetting link
[147475.908916] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147475.908931] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[147475.908937] ata4.00: revalidation failed (errno=-5)
[147480.908915] ata4: hard resetting link
[147481.495438] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147481.495451] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[147481.495457] ata4.00: revalidation failed (errno=-5)
[147486.495645] ata4: hard resetting link
[147486.815525] ata4: SATA link down (SStatus 10 SControl 310)
[147486.815536] ata4.00: disabled
[147486.815552] ata4.00: device reported invalid CHS sector 0
[147486.815575] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[147486.815583] sd 3:0:0:0: [sdc]  Sense Key : Aborted Command [current] [descriptor]
[147486.815593] Descriptor sense data with sense descriptors (in hex):
[147486.815598]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
[147486.815613]         00 00 00 00 
[147486.815620] sd 3:0:0:0: [sdc]  Add. Sense: No additional sense information
[147486.815629] sd 3:0:0:0: [sdc] CDB: Read(10): 28 00 ae a8 7b 20 00 00 08 00
[147486.815645] end_request: I/O error, dev sdc, sector 2930277152
[147486.815654] Buffer I/O error on device sdc, logical block 366284644
[147486.815685] ata4: EH complete
[147486.815713] ata4.00: detaching (SCSI 3:0:0:0)
[147486.828971] sd 3:0:0:0: [sdc] Synchronizing SCSI cache
[147486.829049] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147486.829058] sd 3:0:0:0: [sdc] Stopping disk
[147486.830153] sd 3:0:0:0: [sdc] START_STOP FAILED
[147486.830160] sd 3:0:0:0: [sdc]  Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[147486.976111] ata4: exception Emask 0x10 SAct 0x0 SErr 0x19d0000 action 0xe frozen
[147486.976119] ata4: irq_stat 0x00400000, PHY RDY changed
[147486.976128] ata4: SError: { PHYRdyChg CommWake 10B8B Dispar LinkSeq TrStaTrns }
[147486.976143] ata4: limiting SATA link speed to 1.5 Gbps
[147486.976151] ata4: hard resetting link
[147488.175607] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147488.175622] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[147493.175412] ata4: hard resetting link
[147493.708805] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147493.865505] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x100)
[147498.708711] ata4: hard resetting link
[147499.242063] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[147499.242078] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x40)
[147504.242060] ata4: hard resetting link
[147504.562106] ata4: SATA link down (SStatus 10 SControl 310)
[147504.562122] ata4: EH complete
[155065.805782] ata7: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[155065.805790] ata7: irq_stat 0x00400000, PHY RDY changed
[155065.805798] ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[155065.805815] ata7: hard resetting link
[155066.524554] ata7: SATA link down (SStatus 0 SControl 300)
[155066.524569] ata7: EH complete
[155449.780452] ata7: exception Emask 0x10 SAct 0x0 SErr 0x1990000 action 0xe frozen
[155449.780461] ata7: irq_stat 0x00400000, PHY RDY changed
[155449.780469] ata7: SError: { PHYRdyChg 10B8B Dispar LinkSeq TrStaTrns }
[155449.780486] ata7: hard resetting link
[155450.498168] ata7: SATA link down (SStatus 100 SControl 300)
[155450.498182] ata7: EH complete
[162311.577143] EXT4-fs warning (device dm-0): ext4_end_bio:259: I/O error writing to inode 3424267 (offset 0 size 4096 starting block 876653877)
[162311.577809] EXT4-fs warning (device dm-0): ext4_end_bio:259: I/O error writing to inode 3424266 (offset 0 size 4096 starting block 876648988)
[162317.344302] Aborting journal on device dm-0-8.
[162317.344353] Buffer I/O error on device dm-0, logical block 731938816
[162317.344360] lost page write due to I/O error on dm-0
[162317.344378] JBD2: I/O error detected when updating journal superblock for dm-0-8.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-13  8:27 What just happened to my disks/RAID5 array? Johannes Truschnigg
@ 2011-09-13 11:37 ` Phil Turmel
  2011-09-13 18:56   ` Johannes Truschnigg
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Turmel @ 2011-09-13 11:37 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

Good Morning Johannes,

On 09/13/2011 04:27 AM, Johannes Truschnigg wrote:
> Dear list members,
> 
> my server at home just mailed in multiple FAIL events from members of
> the RAID5 array in it. I won't be able to get to the machine during
> the next ten or so hours, but I'd like to be prepared as best as I
> can when I face the disaster that apparently struck. I attached the
> relevant dmesg excerpt, as well as the current mdstat contents.
> Theories explaining what could have happened - and how to deal with
> such a scenario - are highly appreciated, as only some of the data on
> the array is actually backed up elsewhere. If you need any additional
> information about the system or its setup, please ask right away!
> 
> I do have SSH access to the box.

From a brief review of your dmesg, it all looks like hardware.  Some ideas come to mind:

1)  Controller failure.
2)  Power supply failure (possibly partial failure of a multi-rail PS).
3)  Cooling failure.

Simultaneous failure of that many devices strains credulity, so I doubt you've lost your array.  One possible variant of "2" would be a failed drive that draws enough current to drop the voltage to its sibling drives.

Since some drives are still "alive", they'll have newer event counts than the devices that went offline.  When you fix the root cause, you may need to use "--assemble --force" to get mdadm to restart your array.

The output of "lsdrv" [1] would be helpful in offering more specific advice, along with "mdadm -D" of the array and "mdadm -E" of all of its components (when you get them back).

HTH,

Phil

[1] http://github.com/pturmel/lsdrv

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-13 11:37 ` Phil Turmel
@ 2011-09-13 18:56   ` Johannes Truschnigg
  2011-09-14 11:41     ` Phil Turmel
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2011-09-13 18:56 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1904 bytes --]

Hi Phil,

first of all, thanks for replying and providing both technical and moral
support ;) As it turned out today, I won't be able to get my hands on
the box for at least another 12 hours, so I can only speculate what 
happened (at the physical/hardware level, that is) still.

On 09/13/2011 01:37 PM, Phil Turmel wrote:
> Simultaneous failure of that many devices strains credulity, so I
> doubt you've lost your array.  One possible variant of "2" would be a
> failed drive that draws enough current to drop the voltage to its
> sibling drives.

All the drives are located in seperate hot-swap trays with a full,
unoccupied 5.25" slot in between them. If my appartment wasn't set on
fire with half the drives roasting in it, I think bad cooling can be 
ruled out - the drives never went over 40°C even with all case fans 
turned off.

The controller seems alive still - lsdrv (output attached) lists the
kernel still having registered some of the component devices.

> Since some drives are still "alive", they'll have newer event counts
>  than the devices that went offline.  When you fix the root cause,
> you may need to use "--assemble --force" to get mdadm to restart your
> array.

I see - I don't have the interim storage capacity to dump the drives
before trying to do so - is there any advice you can offer to do this
assembly procedure in the safest way possible?

> The output of "lsdrv" [1] would be helpful in offering more specific
>  advice, along with "mdadm -D" of the array and "mdadm -E" of all of
>  its components (when you get them back).

I will provide the components' info asap.

Thanks very much for sharing your input and expertise!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www: http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: lsdrv.txt --]
[-- Type: text/plain, Size: 884 bytes --]

PCI [pata_amd] 00:06.0 IDE interface: nVidia Corporation MCP78S [GeForce 8200] IDE (rev a1)
 â”œâ”€scsi 0:0:0:0 ATA TRANSCEND {20090625_D40D51BB}
 â”‚  â””â”€sda: [8:0] Partitioned (dos) 1.87g
 â”‚     â””â”€sda1: [8:1] (ext2) 1.87g 'VIRTUE' {ff586bcd-b1fd-4c08-a0ea-08e2e1c7b8f9}
 â”‚        â”œâ”€Mounted as /dev/root @ /
 â”‚        â””â”€Mounted as /dev/root @ /srv/web/virtue
 â””â”€scsi 1:x:x:x [Empty]
PCI [ahci] 00:09.0 SATA controller: nVidia Corporation MCP78S [GeForce 8200] AHCI Controller (rev a2)
 â”œâ”€scsi 2:x:x:x [Empty]
 â”œâ”€scsi 3:x:x:x [Empty]
 â””â”€scsi 7:x:x:x [Empty]
Other Block Devices
 â”œâ”€dm-0: [253:0] (ext4) 5.46t 'MAIN_STORAGE' {aff33f2a-1dac-47e5-a9ed-05e24d3bda15}
 â”‚  â”œâ”€Mounted as /dev/mapper/VG_STORAGE-LV_MAIN @ /media/virtue_main
 â”‚  â””â”€Mounted as /dev/mapper/VG_STORAGE-LV_MAIN @ /srv/files
 â”œâ”€md0: [9:0] Empty/Unknown 5.46t


[-- Attachment #3: md0-examine.txt --]
[-- Type: text/plain, Size: 979 bytes --]

/dev/md0:
        Version : 1.2
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
     Array Size : 5860548608 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 1465137152 (1397.26 GiB 1500.30 GB)
   Raid Devices : 5
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Tue Sep 13 10:15:49 2011
          State : active, FAILED
 Active Devices : 0
Working Devices : 0
 Failed Devices : 3
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       0        0        2      removed
       3       0        0        3      removed
       4       0        0        4      removed

       1       8       64        -      faulty spare
       2       8       48        -      faulty spare
       5       8       80        -      faulty spare

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-13 18:56   ` Johannes Truschnigg
@ 2011-09-14 11:41     ` Phil Turmel
  2011-09-14 18:17       ` Johannes Truschnigg
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Turmel @ 2011-09-14 11:41 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

Good Morning Johannes,

Sorry about the delay...  worked late yesterday.

On 09/13/2011 02:56 PM, Johannes Truschnigg wrote:
> The controller seems alive still - lsdrv (output attached) lists the
> kernel still having registered some of the component devices.

Actually, it doesn't.  None of the /dev/md0 components are present.  Ditto for the "mdadm -D" report.

There's also insufficient open controller ports shown in lsdrv to account for the five missing raid drives.  That strongly suggests that you've been using an add-on controller or port multiplier, and that controller has died.  A complete dmesg (from boot) would provide the details of the missing controller.  At least ports "scsi 4:x:x:x", "scsi 5:x:x:x", and "scsi 6:x:x:x" must have existed from boot, as they were interleaved with 2, 3, and 7.

>> Since some drives are still "alive", they'll have newer event counts
>>  than the devices that went offline.  When you fix the root cause,
>> you may need to use "--assemble --force" to get mdadm to restart your
>> array.
> 
> I see - I don't have the interim storage capacity to dump the drives
> before trying to do so - is there any advice you can offer to do this
> assembly procedure in the safest way possible?

"--assemble" is safe in all known cases.  Use it first.  With the whole controller gone, you probably have consistent event counts after all, and --assemble should just work.  "--assemble --force" is somewhat less safe, but I wouldn't hesitate to use it in a situation where the drives truly dropped out together.  You'll likely find some problems with fsck if files were actively being written when the array dropped out, but the vast majority of your filesystem(s) should be safe.

Other procedures are progressively less safe.  I prefer to not offer specifics until you've hooked your drives back up, and generated fresh "lsdrv" and "mdadm" reports.

>> The output of "lsdrv" [1] would be helpful in offering more specific
>>  advice, along with "mdadm -D" of the array and "mdadm -E" of all of
>>  its components (when you get them back).
> 
> I will provide the components' info asap.
> 
> Thanks very much for sharing your input and expertise!

You're welcome.

Phil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-14 11:41     ` Phil Turmel
@ 2011-09-14 18:17       ` Johannes Truschnigg
  2011-09-14 19:19         ` Phil Turmel
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2011-09-14 18:17 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3089 bytes --]

Hello again Phil (and of course alco possible bystanders :))!

On 09/14/2011 01:41 PM, Phil Turmel wrote:
> Good Morning Johannes,
> 
> Sorry about the delay...  worked late yesterday.

Really no need to be sorry about anything; actually I'm perfectly aware
that I'm not entitled to any kind of your support, and I greatly
appreciate it whenever you volunteer to share your insights with me. So
let me say thank you very, very much for getting back to me again in
this regard!

>> The controller seems alive still - lsdrv (output attached) lists 
>> the kernel still having registered some of the component devices.
> 
> Actually, it doesn't.  None of the /dev/md0 components are present. 
> Ditto for the "mdadm -D" report.

You are right; none of the disks were present once I got to the machine.
The lvm and fs on top seemed rather confused about what happened, and I
went on to kill all processes with file handles open on the fs in
question, unmounted the fs, and rebooted. The board's BIOS took an
awkwardly long time when scanning for SATA devices on the SB's ports,
but in the end showed all of them in the POST screen. After booting the
kernel, one of the drives popped out rather early in the process (about
two or three seconds after the kernel picked it up), and all subsequent
reboots (even when disconnecting the failed and/or all but one drive(s))
make the box hang indefinitely upon POSTing and scanning the SATA
controller. My guess is that the board/controller is fried.

> [...] "--assemble" is safe in all known cases.  Use it first.  With 
> the whole controller gone, you probably have consistent event counts 
> after all, and --assemble should just work.  "--assemble --force" is 
> somewhat less safe, but I wouldn't hesitate to use it in a situation 
> where the drives truly dropped out together.  You'll likely find some
> problems with fsck if files were actively being written when the 
> array dropped out, but the vast majority of your filesystem(s) should
> be safe.

Thanks, I will try that as soon as I can get my hands onto a machine
with enough free SATA ports - I might have to replace the whole system
(at least board, CPU and RAM) and will have to do some research before
settling for specific hardware. I can do without that part of my data
for a few days, probably even weeks, but losing it forever would be hard
to swallow still.

> Other procedures are progressively less safe.  I prefer to not offer 
> specifics until you've hooked your drives back up, and generated 
> fresh "lsdrv" and "mdadm" reports.

I promise I'll get back to the list if --assemble doesn't do its deed
right away once I got a system put together that can handle all the
array's member devices.

Again, thank you very much for your time and sharing your expertise!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www: http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp: johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-14 18:17       ` Johannes Truschnigg
@ 2011-09-14 19:19         ` Phil Turmel
  2012-01-06 10:51           ` Johannes Truschnigg
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Turmel @ 2011-09-14 19:19 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

On 09/14/2011 02:17 PM, Johannes Truschnigg wrote:
> Hello again Phil (and of course alco possible bystanders :))!

[...]

> I promise I'll get back to the list if --assemble doesn't do its deed
> right away once I got a system put together that can handle all the
> array's member devices.

OK.

> Again, thank you very much for your time and sharing your expertise!

You're welcome.

Phil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2011-09-14 19:19         ` Phil Turmel
@ 2012-01-06 10:51           ` Johannes Truschnigg
  2012-01-06 13:16             ` Phil Turmel
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2012-01-06 10:51 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 3629 bytes --]

Hello again Phil and everyone else who's having a peek,

you see, I finally had the chance to migrate all the disks to a new machine,
and figured I'd try my luck at getting back the data on my precious array.
It's been a while since I had access to it, but having that data available all
the time is not as important as having it at all, as I use the box mostly to
store old(er) backups. I definitely would like to have them back at some point
in time, however ;)

So yesterday, I upgraded all the software on the boot drive (running Gentoo),
and now I have Kernel 3.2.0 and mdadm 3.1.5, and all the drives attached to an
AMD SB850 in AHCI mode. Drive-wise, everything looks as expected - all device
nodes are there, fdisk reports the correct size, and SMART data can be read
w/o problems. Assemling the array, however, fails, and I promised in a
previous mail in this thread that I were to come back to the list and post the
info I got before venturing forth. Well, here I am now:

I have the array in stopped state, so /proc/mdstat contains no arrays at this
time. Now I run the following command which yields this output:

--- snip ---
# mdadm -v --assemble -u "19e260e6:db3cad86:0541487d:a1bae605" /dev/md0 
mdadm: looking for devices for /dev/md0
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: /dev/sda1 has wrong uuid.
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sdf is identified as a member of /dev/md0, slot 3.
mdadm: /dev/sde is identified as a member of /dev/md0, slot 1.
mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
mdadm: /dev/sdc is identified as a member of /dev/md0, slot 4.
mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
mdadm: added /dev/sdb to /dev/md0 as 0
mdadm: added /dev/sdd to /dev/md0 as 2
mdadm: added /dev/sdf to /dev/md0 as 3
mdadm: added /dev/sdc to /dev/md0 as 4
mdadm: added /dev/sde to /dev/md0 as 1
mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.
--- snip ---


It seems that mdadm would be able to identify all five original components of
my array, but later decides that it found only two of them, and therefore
can't start the array. /proc/mdstat, at this point in time, shows the
following:

--- snip ---
md0 : inactive sde[1](S) sdc[5](S) sdf[3](S) sdd[2](S) sdb[0](S)
      7325687800 blocks super 1.2
--- snip ---

The (S) should indicate the component being marked as "spare", right (mdstat
really should have a manpage with a short overview of the most commonly
observed abbreviations, symbols and terms - I guess I'll volunteer if you
don't tell me that's already documented somewhere)?

Shall I just try "-A --force" and that's supposed to kick the array enough to
start again? Or is there anything else you could and would recommend before
resorting to that?

One thing I forgot to mention is that I cannot guarantee that the order of the
drives is still the same as it was in the old box (device node names for the
component disks could have changed), but I'm convinced that's not a problem
and I mention it only for the sake of completeness.

I have attached a file with the output of `mdadm -E` for each of the
components for your viewing pleasure - thanks in advance for anyone's time and
effort who's looking into this!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #1.2: mdadm-examine-disks.txt --]
[-- Type: text/plain, Size: 4447 bytes --]

# mdadm -E /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 19e260e6:db3cad86:0541487d:a1bae605
           Name : virtue:0  (local to host virtue)
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 11721097216 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 2930274304 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a1a06197:c5a7727d:5f527b15:01941ba2

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Sep 11 20:11:23 2011
       Checksum : 29ad30c - correct
         Events : 3926

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAAA ('A' == active, '.' == missing)


# mdadm -E /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 19e260e6:db3cad86:0541487d:a1bae605
           Name : virtue:0  (local to host virtue)
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 11721097216 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 2930274304 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 67a52c17:6f69b41c:696ce995:5b845991

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Sep 11 20:11:23 2011
       Checksum : 9153a3e4 - correct
         Events : 3926

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAAAA ('A' == active, '.' == missing)


# mdadm -E /dev/sdd
/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 19e260e6:db3cad86:0541487d:a1bae605
           Name : virtue:0  (local to host virtue)
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 11721097216 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 2930274304 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 697750ec:0391a119:1bac7a98:1cb374d6

Internal Bitmap : 8 sectors from superblock
    Update Time : Sun Sep 11 20:11:23 2011
       Checksum : ab110beb - correct
         Events : 3926

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAAAA ('A' == active, '.' == missing)


# mdadm -E /dev/sde
/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 19e260e6:db3cad86:0541487d:a1bae605
           Name : virtue:0  (local to host virtue)
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 11721097216 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 2930274304 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : bd1fc7fb:00ed1072:1fd7d01a:415255a0

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Sep 13 06:07:24 2011
       Checksum : 5f2bb793 - correct
         Events : 3929

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .A.A. ('A' == active, '.' == missing)


# mdadm -E /dev/sdf
/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x1
     Array UUID : 19e260e6:db3cad86:0541487d:a1bae605
           Name : virtue:0  (local to host virtue)
  Creation Time : Tue Dec 21 10:25:32 2010
     Raid Level : raid5
   Raid Devices : 5

 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 11721097216 (5589.05 GiB 6001.20 GB)
  Used Dev Size : 2930274304 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2a781a8a:ed3a6a97:29df74d2:bfcbe831

Internal Bitmap : 8 sectors from superblock
    Update Time : Tue Sep 13 06:07:24 2011
       Checksum : 5c0fdf77 - correct
         Events : 3929

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : .A.A. ('A' == active, '.' == missing)

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2012-01-06 10:51           ` Johannes Truschnigg
@ 2012-01-06 13:16             ` Phil Turmel
  2012-01-06 13:46               ` Johannes Truschnigg
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Turmel @ 2012-01-06 13:16 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Good Morning, Johannes,

On 01/06/2012 05:51 AM, Johannes Truschnigg wrote:
> Hello again Phil and everyone else who's having a peek,
> 
> you see, I finally had the chance to migrate all the disks to a new machine,
> and figured I'd try my luck at getting back the data on my precious array.
> It's been a while since I had access to it, but having that data available all
> the time is not as important as having it at all, as I use the box mostly to
> store old(er) backups. I definitely would like to have them back at some point
> in time, however ;)
> 
> So yesterday, I upgraded all the software on the boot drive (running Gentoo),
> and now I have Kernel 3.2.0 and mdadm 3.1.5, and all the drives attached to an
> AMD SB850 in AHCI mode. Drive-wise, everything looks as expected - all device
> nodes are there, fdisk reports the correct size, and SMART data can be read
> w/o problems. Assemling the array, however, fails, and I promised in a
> previous mail in this thread that I were to come back to the list and post the
> info I got before venturing forth. Well, here I am now:

Warning!  I saw bug report on LKML yesterday involving LVM and the brand new
kernel v3.2, so you might want to pull back.  v3.1.5 was known good in that
report.

> I have the array in stopped state, so /proc/mdstat contains no arrays at this
> time. Now I run the following command which yields this output:
> 
> --- snip ---
> # mdadm -v --assemble -u "19e260e6:db3cad86:0541487d:a1bae605" /dev/md0 
> mdadm: looking for devices for /dev/md0
> mdadm: cannot open device /dev/sda1: Device or resource busy
> mdadm: /dev/sda1 has wrong uuid.
> mdadm: cannot open device /dev/sda: Device or resource busy
> mdadm: /dev/sda has wrong uuid.

I'm guessing that /dev/sda contains your boot and root filesystems, and that
this isn't an error.

> mdadm: /dev/sdf is identified as a member of /dev/md0, slot 3.
> mdadm: /dev/sde is identified as a member of /dev/md0, slot 1.
> mdadm: /dev/sdd is identified as a member of /dev/md0, slot 2.
> mdadm: /dev/sdc is identified as a member of /dev/md0, slot 4.
> mdadm: /dev/sdb is identified as a member of /dev/md0, slot 0.
> mdadm: added /dev/sdb to /dev/md0 as 0
> mdadm: added /dev/sdd to /dev/md0 as 2
> mdadm: added /dev/sdf to /dev/md0 as 3
> mdadm: added /dev/sdc to /dev/md0 as 4
> mdadm: added /dev/sde to /dev/md0 as 1
> mdadm: /dev/md0 assembled from 2 drives - not enough to start the array.
> --- snip ---

Those slot numbers are *really* important.

> It seems that mdadm would be able to identify all five original components of
> my array, but later decides that it found only two of them, and therefore
> can't start the array. /proc/mdstat, at this point in time, shows the
> following:
> 
> --- snip ---
> md0 : inactive sde[1](S) sdc[5](S) sdf[3](S) sdd[2](S) sdb[0](S)
>       7325687800 blocks super 1.2
> --- snip ---
> 
> The (S) should indicate the component being marked as "spare", right (mdstat
> really should have a manpage with a short overview of the most commonly
> observed abbreviations, symbols and terms - I guess I'll volunteer if you
> don't tell me that's already documented somewhere)?
> 
> Shall I just try "-A --force" and that's supposed to kick the array enough to
> start again? Or is there anything else you could and would recommend before
> resorting to that?

Yes, --assemble --force.

> One thing I forgot to mention is that I cannot guarantee that the order of the
> drives is still the same as it was in the old box (device node names for the
> component disks could have changed), but I'm convinced that's not a problem
> and I mention it only for the sake of completeness.

May I suggest getting an lsdrv [1] report, which will give you the serial numbers
of your disks versus the device assignments, for later reference.  And again
after it's all running, for completeness.

> I have attached a file with the output of `mdadm -E` for each of the
> components for your viewing pleasure - thanks in advance for anyone's time and
> effort who's looking into this!

HTH,

Phil

[1] http://github.com/pturmel/lsdrv
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8G9A0ACgkQBP+iHzflm3BlUACcCoUX1YdI0vM/GmNITIRAXz5q
EsIAn3FDUd92X4CG8YPNWEpc/2AC/icG
=R2R2
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2012-01-06 13:16             ` Phil Turmel
@ 2012-01-06 13:46               ` Johannes Truschnigg
  2012-01-06 14:51                 ` Phil Turmel
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2012-01-06 13:46 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid


[-- Attachment #1.1: Type: text/plain, Size: 1615 bytes --]

Good morning Phil,

thanks for tuning in again! :)

On Fri, Jan 06, 2012 at 08:16:00AM -0500, Phil Turmel wrote:
> Warning!  I saw bug report on LKML yesterday involving LVM and the brand new
> kernel v3.2, so you might want to pull back.  v3.1.5 was known good in that
> report.

Ok, thanks for the warning - will try to find the thread where the bug is
described, and consider downgrading to 3.1.x instead!

> I'm guessing that /dev/sda contains your boot and root filesystems, and that
> this isn't an error.

You are correct, sorry I did not mention that initially. These devices are not
supposed to end up as parts of any md arrays.

> Those slot numbers are *really* important.

What is the significance of the individual slot numbers there?


> Yes, --assemble --force.

Ok, will fire that command as soon as I checked out the LVM bug you mentioned.

> May I suggest getting an lsdrv [1] report, which will give you the serial numbers
> of your disks versus the device assignments, for later reference.  And again
> after it's all running, for completeness.

I have attached a file with the relevant lsdrv (great little program btw) to
this message - should I expect changes to it after my array is up (well,
except for the components not to be recognized as spare, of course)?

> HTH,

It sure did - thanks a bunch! :)

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #1.2: lsdrv-output.txt --]
[-- Type: text/plain, Size: 1899 bytes --]

# ./lsdrv 
PCI [pata_via] 04:00.0 IDE interface: VIA Technologies, Inc. VT6415 PATA IDE Host Controller
├scsi 0:0:0:0 ATA TRANSCEND {20090625_D40D51BB}
│└sda 1.87g [8:0] Partitioned (dos)
│ └sda1 1.87g [8:1] ext2 'VIRTUE' {ff586bcd-b1fd-4c08-a0ea-08e2e1c7b8f9}
│  └Mounted as /dev/root @ /
└scsi 1:x:x:x [Empty]
PCI [ahci] 00:11.0 SATA controller: Advanced Micro Devices [AMD] nee ATI SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40)
├scsi 2:0:0:0 ATA SAMSUNG HD154UI {S1XWJDWS803396}
│└sdb 1.36t [8:16] MD raid5 (none/5) (w/ sdd,sdf,sdc,sde) spare 'virtue:0' {19e260e6-db3c-ad86-0541-487da1bae605}
│ └md0 0.00k [9:0] MD vNone None (None) inactive, None (None) 0/sec
│                  Empty/Unknown
├scsi 3:0:0:0 ATA SAMSUNG HD154UI {S1XWJDWZ419775}
│└sdc 1.36t [8:32] MD raid5 (none/5) (w/ sdb,sdd,sdf,sde) spare 'virtue:0' {19e260e6-db3c-ad86-0541-487da1bae605}
│ └md0 0.00k [9:0] MD vNone None (None) inactive, None (None) 0/sec
│                  Empty/Unknown
├scsi 4:0:0:0 ATA SAMSUNG HD154UI {S1XWJDWS803467}
│└sdd 1.36t [8:48] MD raid5 (none/5) (w/ sdb,sdf,sdc,sde) spare 'virtue:0' {19e260e6-db3c-ad86-0541-487da1bae605}
│ └md0 0.00k [9:0] MD vNone None (None) inactive, None (None) 0/sec
│                  Empty/Unknown
├scsi 5:0:0:0 ATA SAMSUNG HD154UI {S1XWJDWS803469}
│└sde 1.36t [8:64] MD raid5 (none/5) (w/ sdb,sdd,sdf,sdc) spare 'virtue:0' {19e260e6-db3c-ad86-0541-487da1bae605}
│ └md0 0.00k [9:0] MD vNone None (None) inactive, None (None) 0/sec
│                  Empty/Unknown
├scsi 6:0:0:0 ATA SAMSUNG HD154UI {S1XWJDWS803405}
│└sdf 1.36t [8:80] MD raid5 (none/5) (w/ sdb,sdd,sdc,sde) spare 'virtue:0' {19e260e6-db3c-ad86-0541-487da1bae605}
│ └md0 0.00k [9:0] MD vNone None (None) inactive, None (None) 0/sec
│                  Empty/Unknown
└scsi 7:x:x:x [Empty]

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2012-01-06 13:46               ` Johannes Truschnigg
@ 2012-01-06 14:51                 ` Phil Turmel
  2012-01-06 15:28                   ` Johannes Truschnigg
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Turmel @ 2012-01-06 14:51 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/06/2012 08:46 AM, Johannes Truschnigg wrote:
> Good morning Phil,
> 
> thanks for tuning in again! :)
> 
> On Fri, Jan 06, 2012 at 08:16:00AM -0500, Phil Turmel wrote:
>> Warning!  I saw bug report on LKML yesterday involving LVM and the brand new
>> kernel v3.2, so you might want to pull back.  v3.1.5 was known good in that
>> report.
> 
> Ok, thanks for the warning - will try to find the thread where the bug is
> described, and consider downgrading to 3.1.x instead!

https://lkml.org/lkml/2012/1/5/76

Note that it involves ext4 and LVM snapshots, so may not apply to you.  But the
bug hasn't been clearly identified, so my paranoid approach to kernels would
keep me off of it. (On production systems, of course.  I regularly run -rc
kernels on my laptop.)

>> I'm guessing that /dev/sda contains your boot and root filesystems, and that
>> this isn't an error.
> 
> You are correct, sorry I did not mention that initially. These devices are not
> supposed to end up as parts of any md arrays.
> 
>> Those slot numbers are *really* important.
> 
> What is the significance of the individual slot numbers there?

You would need them if you ever ran into some catastrophic problem where
"--create --assume-clean" was needed.

>> Yes, --assemble --force.
> 
> Ok, will fire that command as soon as I checked out the LVM bug you mentioned.

Actually, given the /proc/mdstat contents you reported, you might be able to just
do "mdadm --run /dev/md0"

>> May I suggest getting an lsdrv [1] report, which will give you the serial numbers
>> of your disks versus the device assignments, for later reference.  And again
>> after it's all running, for completeness.
> 
> I have attached a file with the relevant lsdrv (great little program btw) to
> this message - should I expect changes to it after my array is up (well,
> except for the components not to be recognized as spare, of course)?

The output will change substantially, as it attempts to map the entire
storage tree, reporting all serials, uuids, and labels, along with other
useful information.  I have a future number of enhancements in mind, but
it's a spare-time project.  I'm glad you like it, though.

You don't need the report for this incident, though.  Just stick it with your
backups.  And make a new one any time you change your setup.

Phil
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8HCnIACgkQBP+iHzflm3BN/gCfawVFbIvdeHhC9Z1BkTu6VcWu
7VMAn3SmwyK3WCdF0Fl+z1QNSCMrWu63
=sYlI
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2012-01-06 14:51                 ` Phil Turmel
@ 2012-01-06 15:28                   ` Johannes Truschnigg
  2012-01-07 14:23                     ` John Robinson
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Truschnigg @ 2012-01-06 15:28 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2419 bytes --]

On Fri, Jan 06, 2012 at 09:51:32AM -0500, Phil Turmel wrote:
> [...]
> https://lkml.org/lkml/2012/1/5/76
> 
> Note that it involves ext4 and LVM snapshots, so may not apply to you.  But the
> bug hasn't been clearly identified, so my paranoid approach to kernels would
> keep me off of it. (On production systems, of course.  I regularly run -rc
> kernels on my laptop.)

I arrived at the same conclusion, and decided to stick with 3.2.0 for now.

> [...]
> You would need them if you ever ran into some catastrophic problem where
> "--create --assume-clean" was needed.

Thanks, I will keep that information in a secure place then.

> [...]
> Actually, given the /proc/mdstat contents you reported, you might be able to just
> do "mdadm --run /dev/md0"

Probably, yes. I stopped the array, --assemble --force'd it, and everything
was up again. The LVM metadata proved consistent, and a read-only fsck of the
filesystem on top of the only LV defined on the array told me that there were
some (very) minor fs inconsistencies. They were corrected by a secind fsck
run, and everything seems perfectly fine again :) I don't keep checksums of
the inodes' data, but I tested with a bunch of archives stored in the fs, and
there were no CRC errors or the like to be found.

> The output will change substantially, as it attempts to map the entire
> storage tree, reporting all serials, uuids, and labels, along with other
> useful information.  I have a future number of enhancements in mind, but
> it's a spare-time project.  I'm glad you like it, though.

I noticed it did, yes - it's even more nifty than I thought ;) I noticed that
it would die with an uncaught exception (sorry, I had noted it down in a file
in /tmp, but I involuntarily lost it upon rebooting) if you had an LVM VG/LV
showing up in `lvs` that wasn't "active"/had no corresponding device nodes in
/dev; you may want to look into that some time.

> You don't need the report for this incident, though.  Just stick it with your
> backups.  And make a new one any time you change your setup.

Will do. Thanks again for your guidance and support! Have a nice day!

-- 
with best regards:
- Johannes Truschnigg ( johannes@truschnigg.info )

www:   http://johannes.truschnigg.info/
phone: +43 650 2 133337
xmpp:  johannes@truschnigg.info

Please do not bother me with HTML-eMail or attachments. Thank you.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: What just happened to my disks/RAID5 array?
  2012-01-06 15:28                   ` Johannes Truschnigg
@ 2012-01-07 14:23                     ` John Robinson
  0 siblings, 0 replies; 12+ messages in thread
From: John Robinson @ 2012-01-07 14:23 UTC (permalink / raw)
  To: Johannes Truschnigg; +Cc: linux-raid

On 06/01/2012 15:28, Johannes Truschnigg wrote:
> On Fri, Jan 06, 2012 at 09:51:32AM -0500, Phil Turmel wrote:
>> [...]
>> You would need them if you ever ran into some catastrophic problem where
>> "--create --assume-clean" was needed.
>
> Thanks, I will keep that information in a secure place then.

You already did - there are now thousands of copies of it on the 'net, 
including dozens of web archives of this list :-)

Cheers,

John.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-01-07 14:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-13  8:27 What just happened to my disks/RAID5 array? Johannes Truschnigg
2011-09-13 11:37 ` Phil Turmel
2011-09-13 18:56   ` Johannes Truschnigg
2011-09-14 11:41     ` Phil Turmel
2011-09-14 18:17       ` Johannes Truschnigg
2011-09-14 19:19         ` Phil Turmel
2012-01-06 10:51           ` Johannes Truschnigg
2012-01-06 13:16             ` Phil Turmel
2012-01-06 13:46               ` Johannes Truschnigg
2012-01-06 14:51                 ` Phil Turmel
2012-01-06 15:28                   ` Johannes Truschnigg
2012-01-07 14:23                     ` John Robinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).