* Reconstruct a RAID 6 that has failed in a non typical manner [not found] <404650428.13997384.1446132658661.JavaMail.zimbra@inria.fr> @ 2015-10-29 15:59 ` Clement Parisot 2015-10-30 18:31 ` Phil Turmel 0 siblings, 1 reply; 8+ messages in thread From: Clement Parisot @ 2015-10-29 15:59 UTC (permalink / raw) To: linux-raid Hi everyone, we've got a problem with our old RAID 6. root@ftalc2.nancy.grid5000.fr(physical):~# uname -a Linux ftalc2.nancy.grid5000.fr 2.6.32-5-amd64 #1 SMP Mon Sep 23 22:14:43 UTC 2013 x86_64 GNU/Linux root@ftalc2.nancy.grid5000.fr(physical):~# cat /etc/debian_version 6.0.8 root@ftalc2.nancy.grid5000.fr(physical):~# mdadm -V mdadm - v3.1.4 - 31st August 2010 After an electrical maintenance, 2 of our HDD came in fail state. An alert was sent that said everything was reconstructing. g5kadmin@ftalc2.nancy.grid5000.fr(physical):~$ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md2 : active raid6 sda[0] sdp[15] sdo[14] sdn[13] sdm[12] sdl[11] sdk[18] sdj[9] sdi[8] sdh[16] sdg[6] sdf[5] sde[4] sdd[17] sdc[2] sdb[1](F) 13666978304 blocks super 1.2 level 6, 128k chunk, algorithm 2 [16/15] [U_UUUUUUUUUUUUUU] [>....................] resync = 0.0% (916936/976212736) finish=16851.9min speed=964K/sec md1 : active raid1 sdq2[0] sdr2[2] 312276856 blocks super 1.2 [2/2] [UU] [===>.................] resync = 18.4% (57566208/312276856) finish=83.2min speed=50956K/sec md0 : active raid1 sdq1[0] sdr1[2] 291828 blocks super 1.2 [2/2] [UU] unused devices: <none> md1 reconstruction works but md2 failed as a 3rd HDD seems to be broked. A new disk has been successfully added to replace a failed one. All of the disks of md2 changed to Spare state. We rebooted the server but it was worse. mdadm --detail command show that 13 disks left on the array and 3 are removed. /dev/md2: Version : 1.2 Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Used Dev Size : 976212736 (930.99 GiB 999.64 GB) Raid Devices : 16 Total Devices : 13 Persistence : Superblock is persistent Update Time : Wed Oct 28 13:46:13 2015 State : active, FAILED, Not Started Active Devices : 13 Working Devices : 13 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 128K Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Events : 5834052 Number Major Minor RaidDevice State 0 0 0 0 removed 1 0 0 1 removed 2 8 16 2 active sync /dev/sdb 17 8 32 3 active sync /dev/sdc 4 8 48 4 active sync /dev/sdd 5 8 64 5 active sync /dev/sde 6 0 0 6 removed 16 8 96 7 active sync /dev/sdg 8 8 112 8 active sync /dev/sdh 9 8 128 9 active sync /dev/sdi 18 8 144 10 active sync /dev/sdj 11 8 160 11 active sync /dev/sdk 13 8 192 13 active sync /dev/sdm 14 8 208 14 active sync /dev/sdn As you can see, RAID is in "active, FAILED, Not Started" State. We tried to add the new disk, re-add the previously removed disks as they appears to have no errors. 2/3 of the disks should still contains the datas. We want to recover it. But there is a problem, devices /dev/sda and /dev/sdf can't be re-added: mdadm: failed to add /dev/sda to /dev/md/2: Device or resource busy mdadm: failed to add /dev/sdf to /dev/md/2: Device or resource busy mdadm: /dev/md/2 assembled from 13 drives and 1 spare - not enough to start the array. I tried procedure on RAID_Recovery wiki mdadm --assemble --force /dev/md2 /dev/sda /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp but it failed. mdadm: failed to add /dev/sdg to /dev/md2: Device or resource busy mdadm: failed to RUN_ARRAY /dev/md2: Input/output error mdadm: Not enough devices to start the array. Any help or tips on how to diagnose better the situation or solve it would be higly appreciated :-) Thanks in advance, Best regards, Clément and Marc -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-10-29 15:59 ` Reconstruct a RAID 6 that has failed in a non typical manner Clement Parisot @ 2015-10-30 18:31 ` Phil Turmel 2015-11-05 10:35 ` Clement Parisot 0 siblings, 1 reply; 8+ messages in thread From: Phil Turmel @ 2015-10-30 18:31 UTC (permalink / raw) To: Clement Parisot, linux-raid Good afternoon, Clement, Marc, On 10/29/2015 11:59 AM, Clement Parisot wrote: > we've got a problem with our old RAID 6. > After an electrical maintenance, 2 of our HDD came in fail state. An alert was sent that said everything was reconstructing. > md1 reconstruction works but md2 failed as a 3rd HDD seems to be broked. A new disk has been successfully added to replace a failed one. > All of the disks of md2 changed to Spare state. We rebooted the server but it was worse. > As you can see, RAID is in "active, FAILED, Not Started" State. We tried to add the new disk, re-add the previously removed disks as they appears to have no errors. > 2/3 of the disks should still contains the datas. We want to recover it. Your subject is inaccurate. You've described a situation that is extraordinarily common when using green drives. Or any modern desktop drive -- they aren't rated for use in raid arrays. Please read the references in the post-script. > I tried procedure on RAID_Recovery wiki > mdadm --assemble --force /dev/md2 /dev/sda /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp > but it failed. > mdadm: failed to add /dev/sdg to /dev/md2: Device or resource busy > mdadm: failed to RUN_ARRAY /dev/md2: Input/output error > mdadm: Not enough devices to start the array. Did you run "mdadm --stop /dev/md2" first? That would explain the "busy" reports. Before proceeding, please supply more information: for x in /dev/sd[a-p] ; mdadm -E $x ; smartctl -i -A -l scterc $x ; done Paste the output inline in your response. Phil [1] http://marc.info/?l=linux-raid&m=139050322510249&w=2 [2] http://marc.info/?l=linux-raid&m=135863964624202&w=2 [3] http://marc.info/?l=linux-raid&m=135811522817345&w=1 [4] http://marc.info/?l=linux-raid&m=133761065622164&w=2 [5] http://marc.info/?l=linux-raid&m=132477199207506 [6] http://marc.info/?l=linux-raid&m=133665797115876&w=2 [7] http://marc.info/?l=linux-raid&m=142487508806844&w=3 [8] http://marc.info/?l=linux-raid&m=144535576302583&w=2 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-10-30 18:31 ` Phil Turmel @ 2015-11-05 10:35 ` Clement Parisot 2015-11-05 13:34 ` Phil Turmel 0 siblings, 1 reply; 8+ messages in thread From: Clement Parisot @ 2015-11-05 10:35 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid Hello, First of all, thanks for your answer. Here is an update of what we did: We got surprised to see two drives that were announced in 'failed' state back in 'working order' after a reboot. At least they were not considered in failed state anymore. So we tried something a bit tricky. We removed the drive we changed and re-introduced the old one (supposed to be broken) Thanks to this, we were able to re-create the array with "mdadm --assemble --force /dev/md2", restart the volume group and mount read-only the logical volume. Sadly, trying to rsync data into a safer place, most of it failed with I/O error, often ending killing the array. We still have two drives that were not physicaly removed, so that theorically contains datas, but that appears as spare in mdadm --examine, probably because of the 're-add' attempt we made. > Your subject is inaccurate. You've described a situation that is > extraordinarily common when using green drives. Or any modern desktop > drive -- they aren't rated for use in raid arrays. Please read the > references in the post-script. After reading your links, it seems that indeed, the situation we experiment is what is described in link [3] or link [6]. > Did you run "mdadm --stop /dev/md2" first? That would explain the > "busy" reports. Yes we did. This is why the 'busy' is surprising. It seems to come from drives: # mdadm --verbose --assemble /dev/md2 [...] mdadm: /dev/sdp is identified as a member of /dev/md2, slot 15. mdadm: /dev/sdo is identified as a member of /dev/md2, slot 14. mdadm: /dev/sdn is identified as a member of /dev/md2, slot 13. mdadm: /dev/sdm is identified as a member of /dev/md2, slot 12. mdadm: /dev/sdl is identified as a member of /dev/md2, slot 11. mdadm: /dev/sdk is identified as a member of /dev/md2, slot 10. mdadm: /dev/sdj is identified as a member of /dev/md2, slot 9. mdadm: /dev/sdi is identified as a member of /dev/md2, slot 8. mdadm: /dev/sdh is identified as a member of /dev/md2, slot 7. mdadm: /dev/sdg is identified as a member of /dev/md2, slot -1. mdadm: /dev/sdf is identified as a member of /dev/md2, slot 5. mdadm: /dev/sde is identified as a member of /dev/md2, slot 4. mdadm: /dev/sdc is identified as a member of /dev/md2, slot 2. mdadm: /dev/sdd is identified as a member of /dev/md2, slot 3. mdadm: /dev/sdb is identified as a member of /dev/md2, slot -1. mdadm: /dev/sda is identified as a member of /dev/md2, slot -1. mdadm: no uptodate device for slot 0 of /dev/md2 mdadm: no uptodate device for slot 1 of /dev/md2 mdadm: added /dev/sdd to /dev/md2 as 3 mdadm: added /dev/sde to /dev/md2 as 4 mdadm: added /dev/sdf to /dev/md2 as 5 mdadm: no uptodate device for slot 6 of /dev/md2 mdadm: added /dev/sdh to /dev/md2 as 7 mdadm: added /dev/sdi to /dev/md2 as 8 mdadm: added /dev/sdj to /dev/md2 as 9 mdadm: added /dev/sdk to /dev/md2 as 10 mdadm: added /dev/sdl to /dev/md2 as 11 mdadm: added /dev/sdm to /dev/md2 as 12 mdadm: added /dev/sdn to /dev/md2 as 13 mdadm: added /dev/sdo to /dev/md2 as 14 mdadm: added /dev/sdp to /dev/md2 as 15 mdadm: added /dev/sdg to /dev/md2 as -1 mdadm: failed to add /dev/sdb to /dev/md2: Device or resource busy mdadm: failed to add /dev/sda to /dev/md2: Device or resource busy > Before proceeding, please supply more information: > > for x in /dev/sd[a-p] ; mdadm -E $x ; smartctl -i -A -l scterc $x ; done > > Paste the output inline in your response. I couldn't get smartctl to work successfully. The version supported on debian squeeze doesn't support aacraid. I tried from a chroot in a debootstrap with a more recent debian version, but only got: # smartctl --all -d aacraid,0,0,0 /dev/sda smartctl 6.4 2014-10-07 r4002 [x86_64-linux-2.6.32-5-amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org Smartctl open device: /dev/sda [aacraid_disk_00_00_0] [SCSI/SAT] failed: INQUIRY [SAT]: aacraid result: 0.0 = 22/0 Here is the output for mdadm -E: $ for x in /dev/sd[a-p] ; do sudo mdadm -E $x ; done /dev/sda: Magic : a92b4efc Version : 1.2 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 27a0fe11:278b30d3:3251ee70:66b015d0 Update Time : Wed Oct 28 13:46:13 2015 Checksum : 5b99bd5 - correct Events : 0 Layout : left-symmetric Chunk Size : 128K Device Role : spare Array State : ..AAAA.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdb: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : b58fb9e7:72e48374:44a9862c:5b8de755 Update Time : Wed Nov 4 10:31:19 2015 Checksum : be982cb8 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 2 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 :mdadm: No md superblock detected on /dev/sdd. Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 1aff07a9:0ac3fa0c:6bb5e685:bac7893e Update Time : Wed Nov 4 10:31:19 2015 Checksum : 5a5fc14a - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 3 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sde: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 30bfa9d2:2a483372:5a489324:c2f5f729 Update Time : Wed Nov 4 10:31:19 2015 Checksum : 7354c76b - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 5 Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 93fd1f09:6ca19143:002a3e5c:17813675 Update Time : Wed Oct 28 13:46:13 2015 Checksum : fdacb903 - correct Events : 0 Layout : left-symmetric Chunk Size : 128K Device Role : spare Array State : ..AAAA.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdg: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425472 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Data Offset : 512 sectors Super Offset : 8 sectors State : clean Device UUID : d656d255:5ece759c:2deca760:3ae659c3 Update Time : Wed Nov 4 10:31:19 2015 Checksum : f636719b - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 7 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdh: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : d93661b8:40996a0b:b373cfd8:df0e2bd6 Update Time : Wed Nov 4 10:31:19 2015 Checksum : 52b2d4a4 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 8 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdi: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : cf9d8d29:42956b39:79841196:9d3281e4 Update Time : Wed Nov 4 10:31:19 2015 Checksum : bd786c40 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 9 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdj: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : d9ae5754:4b1fffcb:b76d34e4:fed2f192 Update Time : Wed Nov 4 10:31:19 2015 Checksum : 776990dc - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 10 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdk: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : e44e950f:09456ec5:35463869:13663a98 Update Time : Wed Nov 4 10:31:19 2015 Checksum : b662c230 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 11 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdl: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 51b3c930:27332156:535ec2d3:a77cc127 Update Time : Wed Nov 4 10:31:19 2015 Checksum : 625b436e - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 12 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdm: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 83fa2210:26f430cf:6ef35e86:13be77c8 Update Time : Wed Nov 4 10:31:19 2015 Checksum : e172228 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 13 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdn: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 6700962b:ed334ee5:98e00751:79f25fb9 Update Time : Wed Nov 4 10:31:19 2015 Checksum : fb388963 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 14 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdo: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : 9b099832:da80cf49:d62f76d9:7681a6a5 Update Time : Wed Nov 4 10:31:19 2015 Checksum : db70bdc0 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 15 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) /dev/sdp: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 2d0b91e8:a0b10f4c:3fa285f9:3198a918 Name : ftalc2.nancy.grid5000.fr:2 (local to host ftalc2.nancy.grid5000.fr) Creation Time : Tue Oct 2 16:28:23 2012 Raid Level : raid6 Raid Devices : 16 Avail Dev Size : 1952425984 (930.99 GiB 999.64 GB) Array Size : 27333956608 (13033.85 GiB 13994.99 GB) Used Dev Size : 1952425472 (930.99 GiB 999.64 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : clean Device UUID : df2bcc6a:5d7e060c:6ab4ac39:b11a631f Update Time : Wed Nov 4 10:31:19 2015 Checksum : afcefb47 - correct Events : 5834314 Layout : left-symmetric Chunk Size : 128K Device Role : Active device 1 Array State : .AAA.A.AAAAAAAAA ('A' == active, '.' == missing) Regards, Clément and Marc -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-11-05 10:35 ` Clement Parisot @ 2015-11-05 13:34 ` Phil Turmel 2015-11-17 12:30 ` Marc Pinhede 2015-12-21 3:40 ` NeilBrown 0 siblings, 2 replies; 8+ messages in thread From: Phil Turmel @ 2015-11-05 13:34 UTC (permalink / raw) To: Clement Parisot; +Cc: linux-raid Good morning Clément, Marc, On 11/05/2015 05:35 AM, Clement Parisot wrote: > We got surprised to see two drives that were announced in 'failed' > state back in 'working order' after a reboot. At least they were not > considered in failed state anymore. So we tried something a bit > tricky. > We removed the drive we changed and re-introduced the old one > (supposed to be broken) > Thanks to this, we were able to re-create the array with "mdadm > --assemble --force /dev/md2", restart the volume group and mount > read-only the logical volume. Strictly speaking, you didn't re-create the array. Simply re-assembled it. The terminology is important here. Re-creating an array is much more dangerous. > Sadly, trying to rsync data into a safer place, most of it failed > with I/O error, often ending killing the array. Yes, with latent Unrecoverable Read Errors, you will need properly working redundancy and no timeout mismatches. I recommend you repeatedly use --assemble --force to restore your array, skip the last file that failed, and continue copying critical files as possible. You should at least run this command every reboot until you replace your drives or otherwise script the work-arounds: for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done > We still have two drives that were not physicaly removed, so that > theorically contains datas, but that appears as spare in mdadm > --examine, probably because of the 're-add' attempt we made. The only way to activate these, I think, is to re-create your array. That is a last resort after you've copied everything possible with the forced assembly state. >> Your subject is inaccurate. You've described a situation that is >> extraordinarily common when using green drives. Or any modern >> desktop drive -- they aren't rated for use in raid arrays. Please >> read the references in the post-script. > After reading your links, it seems that indeed, the situation we > experiment is what is described in link [3] or link [6]. >> Did you run "mdadm --stop /dev/md2" first? That would explain the >> "busy" reports. [trim /] There's *something* holding access to sda and sdb -- please obtain and run "lsdrv" [1] and post its output. >> Before proceeding, please supply more information: >> >> for x in /dev/sd[a-p] ; mdadm -E $x ; smartctl -i -A -l scterc $x ; >> done >> >> Paste the output inline in your response. > > > I couldn't get smartctl to work successfully. The version supported > on debian squeeze doesn't support aacraid. > I tried from a chroot in a debootstrap with a more recent debian > version, but only got: > > # smartctl --all -d aacraid,0,0,0 /dev/sda > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-2.6.32-5-amd64] (local > build) > Copyright (C) 2002-14, Bruce Allen, Christian Franke, > www.smartmontools.org > > Smartctl open device: /dev/sda [aacraid_disk_00_00_0] [SCSI/SAT] > failed: INQUIRY [SAT]: aacraid result: 0.0 = 22/0 It's possible the 0,0,0 isn't correct. The output of lsdrv would help with this. Also, please use the smartctl options I requested. '--all' omits the scterc information I want to see, and shows a bunch of data I don't need to see. If you want all possible data for your own use, '-x' is the correct option. [trim /] It's very important that we get a map of drive serial numbers to current device names and the "Device Role" from "mdadm --examine". As an alternative, post the output of "ls -l /dev/disk/by-id/". This is critical information for any future re-create attempts. The rest of the information from smartctl is important, and you should upgrade your system to a level that supports it, but it can wait for later. It might be best to boot into a newer environment strictly for this recovery task. Newer kernels and utilities have more bugfixes and are much more robust in emergencies. I normally use SystemRescueCD [2] for emergencies like this. Phil [1] https://github.com/pturmel/lsdrv [2] http://www.sysresccd.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-11-05 13:34 ` Phil Turmel @ 2015-11-17 12:30 ` Marc Pinhede 2015-11-17 13:25 ` Phil Turmel 2015-12-21 3:40 ` NeilBrown 1 sibling, 1 reply; 8+ messages in thread From: Marc Pinhede @ 2015-11-17 12:30 UTC (permalink / raw) To: Phil Turmel; +Cc: Clement Parisot, linux-raid Hello, Thanks for your answer. Update since our last mail: We saved many data thanks to long and boring rsyncs, with countless reboots: during rsync, sometime a drive was suddenly considered in 'failed' state by the array. The array was still active (with 13 or 12 / 16 disks) but 100% of files failed with I/O after that. We were then forced to reboot, reassemble the array and restart rsync. During those long operation, we have been advised to re-tighten our storage bay's screws (carri bay). And this is were the magic happened. After screwing them back on, no more problem with drive considered failed. We only had 4 file copy failures with I/O, but it didn't correspond to a drive failing in the array (still working with 14/16 drives). We can't guarantee than the problem is fixed, but we moved from about 10 reboot a day to 5 days of work without problems. We now plan to reset and re-introduce one by one the two drive that were not recognize by the array, and let the array synchronize, rewriting data on those drive. Does it sounds like a good idea to you, or do you think it may fails due to some errors? > Yes, with latent Unrecoverable Read Errors, you will need properly > working redundancy and no timeout mismatches. I recommend you > repeatedly use --assemble --force to restore your array, skip the last > file that failed, and continue copying critical files as possible. > > You should at least run this command every reboot until you replace your > drives or otherwise script the work-arounds: > > for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done Thanks for the tip. Made at every reboot, but we still had failures. > > We still have two drives that were not physicaly removed, so that > > theorically contains datas, but that appears as spare in mdadm > > --examine, probably because of the 're-add' attempt we made. > > The only way to activate these, I think, is to re-create your array. > That is a last resort after you've copied everything possible with the > forced assembly state. We will keep this as a last resort, but with updates above, we should not have to use this. > >> Did you run "mdadm --stop /dev/md2" first? That would explain the > >> "busy" reports. > > [trim /] > > There's *something* holding access to sda and sdb -- please obtain and > run "lsdrv" [1] and post its output. > PCI [aacraid] 01:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09) ├scsi 0:0:0:0 Adaptec LogicalDrv 0 {6F7C0529} │└sda 930.99g [8:0] MD raid6 (16) inactive 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} ├scsi 0:0:2:0 Adaptec LogicalDrv 2 {81A40529} │└sdb 930.99g [8:16] MD raid6 (2/16) (w/ sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} │ └VG baie 12.73t 33.84g free {7krzHX-Lz48-7ibY-RKTb-IZaX-zZlz-8ju8MM} │ ├dm-3 4.50t [253:3] LV data1 ext4 {83ddded0-d457-4fdc-8eab-9fbb2c195bdc} │ │└Mounted as /dev/mapper/baie-data1 @ /export/data1 │ ├dm-4 200.00g [253:4] LV grid5000 ext4 {c442ffe7-b34d-42c8-800d-ba21bf2ed8ec} │ │└Mounted as /dev/mapper/baie-grid5000 @ /export/grid5000 │ └dm-2 8.00t [253:2] LV home ext4 {c4ebcfd0-e5c2-4420-8a03-d0d5799cf747} │ └Mounted as /dev/mapper/baie-home @ /export/home ├scsi 0:0:3:0 Adaptec LogicalDrv 3 {156214AB} │└sdc 930.99g [8:32] MD raid6 (3/16) (w/ sdb,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:4:0 Adaptec LogicalDrv 4 {82C40529} │└sdd 930.99g [8:48] MD raid6 (4/16) (w/ sdb,sdc,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:5:0 Adaptec LogicalDrv 5 {8F341529} │└sde 930.99g [8:64] MD raid6 (5/16) (w/ sdb,sdc,sdd,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:6:0 Adaptec LogicalDrv 6 {5E4C1529} │└sdf 930.99g [8:80] MD raid6 (16) inactive 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} ├scsi 0:0:7:0 Adaptec LogicalDrv 7 {FF88E4AC} │└sdg 930.99g [8:96] MD raid6 (7/16) (w/ sdb,sdc,sdd,sde,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:8:0 Adaptec LogicalDrv 8 {84B41529} │└sdh 930.99g [8:112] MD raid6 (8/16) (w/ sdb,sdc,sdd,sde,sdg,sdi,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:9:0 Adaptec LogicalDrv 9 {70C41529} │└sdi 930.99g [8:128] MD raid6 (9/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdj,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:10:0 Adaptec LogicalDrv 10 {897976AC} │└sdj 930.99g [8:144] MD raid6 (10/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdk,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:11:0 Adaptec LogicalDrv 11 {6DEC1529} │└sdk 930.99g [8:160] MD raid6 (11/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdl,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:12:0 Adaptec LogicalDrv 12 {71142529} │└sdl 930.99g [8:176] MD raid6 (12/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdm,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:13:0 Adaptec LogicalDrv 13 {14242529} │└sdm 930.99g [8:192] MD raid6 (13/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdn,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:14:0 Adaptec LogicalDrv 14 {2D382529} │└sdn 930.99g [8:208] MD raid6 (14/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdo,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} ├scsi 0:0:15:0 Adaptec LogicalDrv 15 {B4542529} │└sdo 930.99g [8:224] MD raid6 (15/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdp) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} │ └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} │ PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} └scsi 0:0:16:0 Adaptec LogicalDrv 1 {8E940529} └sdp 930.99g [8:240] MD raid6 (1/16) (w/ sdb,sdc,sdd,sde,sdg,sdh,sdi,sdj,sdk,sdl,sdm,sdn,sdo) in_sync 'ftalc2.nancy.grid5000.fr:2' {2d0b91e8-a0b1-0f4c-3fa2-85f93198a918} └md2 12.73t [9:2] MD v1.2 raid6 (16) clean DEGRADEDx2, 128k Chunk {2d0b91e8:a0b10f4c:3fa285f9:3198a918} PV LVM2_member 12.70t used, 33.84g free {G8XPQ1-E3y0-82Wz-UUpg-hGWC-UvHm-pAbi30} PCI [ahci] 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA AHCI Controller (rev 09) ├scsi 1:0:0:0 ATA Hitachi HDP72503 {GEAC34RF2T8SLA} │└sdq 298.09g [65:0] Partitioned (dos) │ ├sdq1 285.00m [65:1] MD raid1 (0/2) (w/ sdr1) in_sync 'ftalc2:0' {791b53cf-4800-7f45-1dc0-ae5f8cedc958} │ │└md0 284.99m [9:0] MD v1.2 raid1 (2) clean {791b53cf:48007f45:1dc0ae5f:8cedc958} │ │ │ ext3 {135f2572-81a4-462f-8ce6-11ee0c9a8074} │ │ └Mounted as /dev/md0 @ /boot │ └sdq2 297.81g [65:2] MD raid1 (0/2) (w/ sdr2) in_sync 'ftalc2:1' {819ab09a-8402-6762-9e1f-6278f5bbda51} │ └md1 297.81g [9:1] MD v1.2 raid1 (2) clean {819ab09a:84026762:9e1f6278:f5bbda51} │ │ PV LVM2_member 22.24g used, 275.57g free {XGX5zq-EcVb-nbK7-BKc6-cxMy-7oe0-B5DKJW} │ └VG rootvg 297.81g 275.57g free {oWuOGP-c6Bt-lreb-YWwf-Kkwt-eqUG-fmgRuf} │ ├dm-0 4.66g [253:0] LV dom0-root ext3 {dbf8f715-dc51-40a2-9d7d-db2d24cc3aba} │ │└Mounted as /dev/mapper/rootvg-dom0--root @ / │ ├dm-1 1.86g [253:1] LV dom0-swap swap {82f0fe85-34ae-4da7-afb3-e161396a3494} │ ├dm-6 952.00m [253:6] LV dom0-tmp ext3 {31585de5-61d1-4e7b-977d-ba6df01b3a4a} │ │└Mounted as /dev/mapper/rootvg-dom0--tmp @ /tmp │ ├dm-5 4.79g [253:5] LV dom0-var ext3 {c0826eb6-e535-4d57-a501-9dfb503732e0} │ │└Mounted as /dev/mapper/rootvg-dom0--var @ /var │ └dm-7 10.00g [253:7] LV false_root ext4 {519238c6-22d4-4d1b-88ed-9af71aed8a88} ├scsi 2:0:0:0 ATA Hitachi HDP72503 {GEAC34RF2T8G0A} │└sdr 298.09g [65:16] Partitioned (dos) │ ├sdr1 285.00m [65:17] MD raid1 (1/2) (w/ sdq1) in_sync 'ftalc2:0' {791b53cf-4800-7f45-1dc0-ae5f8cedc958} │ │└md0 284.99m [9:0] MD v1.2 raid1 (2) clean {791b53cf:48007f45:1dc0ae5f:8cedc958} │ │ ext3 {135f2572-81a4-462f-8ce6-11ee0c9a8074} │ └sdr2 297.81g [65:18] MD raid1 (1/2) (w/ sdq2) in_sync 'ftalc2:1' {819ab09a-8402-6762-9e1f-6278f5bbda51} │ └md1 297.81g [9:1] MD v1.2 raid1 (2) clean {819ab09a:84026762:9e1f6278:f5bbda51} │ PV LVM2_member 22.24g used, 275.57g free {XGX5zq-EcVb-nbK7-BKc6-cxMy-7oe0-B5DKJW} ├scsi 3:x:x:x [Empty] ├scsi 4:x:x:x [Empty] ├scsi 5:x:x:x [Empty] └scsi 6:x:x:x [Empty] PCI [ata_piix] 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 09) ├scsi 7:x:x:x [Empty] └scsi 8:x:x:x [Empty] Other Block Devices ├loop0 0.00k [7:0] Empty/Unknown ├loop1 0.00k [7:1] Empty/Unknown ├loop2 0.00k [7:2] Empty/Unknown ├loop3 0.00k [7:3] Empty/Unknown ├loop4 0.00k [7:4] Empty/Unknown ├loop5 0.00k [7:5] Empty/Unknown ├loop6 0.00k [7:6] Empty/Unknown └loop7 0.00k [7:7] Empty/Unknown > >> Before proceeding, please supply more information: > >> > >> for x in /dev/sd[a-p] ; mdadm -E $x ; smartctl -i -A -l scterc $x ; > >> done > >> > >> Paste the output inline in your response. > > > > > > I couldn't get smartctl to work successfully. The version supported > > on debian squeeze doesn't support aacraid. > > > I tried from a chroot in a debootstrap with a more recent debian > > version, but only got: > > > > # smartctl --all -d aacraid,0,0,0 /dev/sda > > > smartctl 6.4 2014-10-07 r4002 [x86_64-linux-2.6.32-5-amd64] (local > > build) > > > Copyright (C) 2002-14, Bruce Allen, Christian Franke, > > www.smartmontools.org > > > > Smartctl open device: /dev/sda [aacraid_disk_00_00_0] [SCSI/SAT] > > failed: INQUIRY [SAT]: aacraid result: 0.0 = 22/0 > > It's possible the 0,0,0 isn't correct. The output of lsdrv would help > with this. > > Also, please use the smartctl options I requested. '--all' omits the > scterc information I want to see, and shows a bunch of data I don't need > to see. If you want all possible data for your own use, '-x' is the > correct option. Yes, I will use this option to filter if I get smartctl to work. > > [trim /] > > It's very important that we get a map of drive serial numbers to current > device names and the "Device Role" from "mdadm --examine". As an > alternative, post the output of "ls -l /dev/disk/by-id/". This is > critical information for any future re-create attempts. lrwxrwxrwx 1 root root 9 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8G0A -> ../../sdr lrwxrwxrwx 1 root root 10 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8G0A-part1 -> ../../sdr1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8G0A-part2 -> ../../sdr2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8SLA -> ../../sdq lrwxrwxrwx 1 root root 10 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8SLA-part1 -> ../../sdq1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 ata-Hitachi_HDP725032GLA360_GEAC34RF2T8SLA-part2 -> ../../sdq2 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-baie-data1 -> ../../dm-3 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-baie-grid5000 -> ../../dm-4 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-baie-home -> ../../dm-2 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-rootvg-dom0--root -> ../../dm-0 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-rootvg-dom0--swap -> ../../dm-1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-rootvg-dom0--tmp -> ../../dm-6 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-rootvg-dom0--var -> ../../dm-5 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-name-rootvg-false_root -> ../../dm-7 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-7krzHXLz487ibYRKTbIZaXzZlz8ju8MM4QRfpRFoJ9EJDP7Nar3SLNj53t7urGbk -> ../../dm-4 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-7krzHXLz487ibYRKTbIZaXzZlz8ju8MMICvtF5UTbncSUMC9f0PyK5zHGmmEa8GD -> ../../dm-2 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-7krzHXLz487ibYRKTbIZaXzZlz8ju8MMkzJJGdeMc0QDg4B1r2hsq5bCnS7Ktk4u -> ../../dm-3 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-oWuOGPc6BtlrebYWwfKkwteqUGfmgRufCqs0FclHYC6O5RNOSEpeRZ3xJ3kXCOG0 -> ../../dm-7 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-oWuOGPc6BtlrebYWwfKkwteqUGfmgRufGm4mzDQtuUTShTEyWgXEo8BXt1d2S4Qu -> ../../dm-1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-oWuOGPc6BtlrebYWwfKkwteqUGfmgRufMGhnq5OTr3pyXgyc2CqDE5ibq9xaOSUf -> ../../dm-5 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-oWuOGPc6BtlrebYWwfKkwteqUGfmgRufOD5FJuWOVLYk7wnRPOvlQOLEb0zffl2X -> ../../dm-0 lrwxrwxrwx 1 root root 10 Nov 12 10:19 dm-uuid-LVM-oWuOGPc6BtlrebYWwfKkwteqUGfmgRufuMkGACbZV71GDBcRVxXnAMf7NkWFWezw -> ../../dm-6 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-name-ftalc2:0 -> ../../md0 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-name-ftalc2:1 -> ../../md1 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-name-ftalc2.nancy.grid5000.fr:2 -> ../../md2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-uuid-2d0b91e8:a0b10f4c:3fa285f9:3198a918 -> ../../md2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-uuid-791b53cf:48007f45:1dc0ae5f:8cedc958 -> ../../md0 lrwxrwxrwx 1 root root 9 Nov 12 10:19 md-uuid-819ab09a:84026762:9e1f6278:f5bbda51 -> ../../md1 lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_0_6F7C0529 -> ../../sda lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_10_897976AC -> ../../sdj lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_11_6DEC1529 -> ../../sdk lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_12_71142529 -> ../../sdl lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_13_14242529 -> ../../sdm lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_14_2D382529 -> ../../sdn lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_15_B4542529 -> ../../sdo lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_1_8E940529 -> ../../sdp lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_2_81A40529 -> ../../sdb lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_3_156214AB -> ../../sdc lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_4_82C40529 -> ../../sdd lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_5_8F341529 -> ../../sde lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_6_5E4C1529 -> ../../sdf lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_7_FF88E4AC -> ../../sdg lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_8_84B41529 -> ../../sdh lrwxrwxrwx 1 root root 9 Nov 17 10:18 scsi-SAdaptec_LogicalDrv_9_70C41529 -> ../../sdi lrwxrwxrwx 1 root root 9 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8G0A -> ../../sdr lrwxrwxrwx 1 root root 10 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8G0A-part1 -> ../../sdr1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8G0A-part2 -> ../../sdr2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8SLA -> ../../sdq lrwxrwxrwx 1 root root 10 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8SLA-part1 -> ../../sdq1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 scsi-SATA_Hitachi_HDP7250_GEAC34RF2T8SLA-part2 -> ../../sdq2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 wwn-0x5000cca34de737a4 -> ../../sdr lrwxrwxrwx 1 root root 10 Nov 12 10:19 wwn-0x5000cca34de737a4-part1 -> ../../sdr1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 wwn-0x5000cca34de737a4-part2 -> ../../sdr2 lrwxrwxrwx 1 root root 9 Nov 12 10:19 wwn-0x5000cca34de738cd -> ../../sdq lrwxrwxrwx 1 root root 10 Nov 12 10:19 wwn-0x5000cca34de738cd-part1 -> ../../sdq1 lrwxrwxrwx 1 root root 10 Nov 12 10:19 wwn-0x5000cca34de738cd-part2 -> ../../sdq2 It seems that the mapping changes at each reboot (two drives that host the operating system had different name across reboots). Since we re-tighten screws, we didn't reboot though. > The rest of the information from smartctl is important, and you should > upgrade your system to a level that supports it, but it can wait for later. > > It might be best to boot into a newer environment strictly for this > recovery task. Newer kernels and utilities have more bugfixes and are > much more robust in emergencies. I normally use SystemRescueCD [2] for > emergencies like this. Ok, if I get stuck on some operations, I'll try with SystemRescueCD. Regards, Clément and Marc -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-11-17 12:30 ` Marc Pinhede @ 2015-11-17 13:25 ` Phil Turmel 0 siblings, 0 replies; 8+ messages in thread From: Phil Turmel @ 2015-11-17 13:25 UTC (permalink / raw) To: Marc Pinhede; +Cc: Clement Parisot, linux-raid Good morning Marc, Clément, On 11/17/2015 07:30 AM, Marc Pinhede wrote: > Hello, > > Thanks for your answer. Update since our last mail: We saved many > data thanks to long and boring rsyncs, with countless reboots: during > rsync, sometime a drive was suddenly considered in 'failed' state by > the array. The array was still active (with 13 or 12 / 16 disks) but > 100% of files failed with I/O after that. We were then forced to > reboot, reassemble the array and restart rsync. Yes, a miserable task on a large array. Good to know you saved most (?) of your data. > During those long operation, we have been advised to re-tighten our > storage bay's screws (carri bay). And this is were the magic > happened. After screwing them back on, no more problem with drive > considered failed. We only had 4 file copy failures with I/O, but it > didn't correspond to a drive failing in the array (still working with > 14/16 drives). > We can't guarantee than the problem is fixed, but we moved from about > 10 reboot a day to 5 days of work without problems. Very good news. Finding a root cause for a problem greatly raises the odds future efforts will succeed. > We now plan to reset and re-introduce one by one the two drive that > were not recognize by the array, and let the array synchronize, > rewriting data on those drive. Does it sounds like a good idea to > you, or do you think it may fails due to some errors? Since you've identified a real hardware issue that impacted the entire array, I wouldn't trust it until every drive is thoroughly wiped and retested. Use "badblocks -w -p 2" or similar. Then construct a new array and restore your saved data. [trim /] >> It's very important that we get a map of drive serial numbers to >> current device names and the "Device Role" from "mdadm --examine". >> As an alternative, post the output of "ls -l /dev/disk/by-id/". >> This is critical information for any future re-create attempts. If you look close at the lsdrv output, you'll see it successfully acquired drive serial numbers for all drives. However, they are reported as Adaptec Logical drives -- these might be generated by the adaptec firmware, not the real serial numbers. > It seems that the mapping changes at each reboot (two drives that > host the operating system had different name across reboots). Since > we re-tighten screws, we didn't reboot though. Device names are dependent on device discovery order, which can change somewhat randomly. What I've seen with lsdrv is that order doesn't change within a single controller -- the scsi addresses {host:bus:target:lun} have consistent bus:target:lun for a given port on a controller. I don't have much experience with adaptec devices, so I'd be curious if it holds true for them. >> The rest of the information from smartctl is important, and you >> should upgrade your system to a level that supports it, but it can >> wait for later. Consider compiling a local copy of the latest smartctl instead of using a chroot. Supply the scsi address shown in lsdrv to the -d aacraid, option. Regards, Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-11-05 13:34 ` Phil Turmel 2015-11-17 12:30 ` Marc Pinhede @ 2015-12-21 3:40 ` NeilBrown 2015-12-21 12:20 ` Phil Turmel 1 sibling, 1 reply; 8+ messages in thread From: NeilBrown @ 2015-12-21 3:40 UTC (permalink / raw) To: Phil Turmel, Clement Parisot; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 402 bytes --] On Fri, Nov 06 2015, Phil Turmel wrote: > > for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done > Would it make sense for mdadm to automagically do something like this? i.e. whenever it adds a device to an array (with redundancy) it write 180 (or something configurable) to the 'timeout' file if there is one? Why do we pick 180? Can this cause problems on some drives? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Reconstruct a RAID 6 that has failed in a non typical manner 2015-12-21 3:40 ` NeilBrown @ 2015-12-21 12:20 ` Phil Turmel 0 siblings, 0 replies; 8+ messages in thread From: Phil Turmel @ 2015-12-21 12:20 UTC (permalink / raw) To: NeilBrown, Clement Parisot; +Cc: linux-raid Good morning Neil, On 12/20/2015 10:40 PM, NeilBrown wrote: > On Fri, Nov 06 2015, Phil Turmel wrote: >> >> for x in /sys/block/*/device/timeout ; do echo 180 > $x ; done >> > > Would it make sense for mdadm to automagically do something like this? > i.e. whenever it adds a device to an array (with redundancy) it write > 180 (or something configurable) to the 'timeout' file if there is one? Yes, I've been thinking this should be automagic, but I'm not sure if it really belongs at the MD layer. > Why do we pick 180? I empirically determined that 120 was sufficient on the Seagate drives that kicked my tail when I first figured this out. Someone else (I'm afraid I don't remember) found that to be not quite enough and suggested 180. > Can this cause problems on some drives? Not that I'm aware of, but it does make for rather troublesome *application* stalls. Considering that this aggressively long error recovery behavior is *intended* for desktop drives or any non-redundant usage, I believe linux shouldn't time out at 30 seconds by default. It cuts off any opportunity for these drives to report a good sector that is reconstructed in more than 30 seconds. Meanwhile, any device that *does* support scterc and/or has scterc enabled out of the gate arguably should have a timeout just a few seconds longer than the larger of the two error recovery settings. I propose: 1) The kernel default timeout be set to 180 (or some number cooperatively established with the drive manufacturers.) 2) the initial probe sequence that retrieves the drive's parameter pages also pick up the SCT page and if ERC is enabled, adjust the timeout downward. I believe these capabilities should be reflected in sysfs for use by udev. 3) mdadm should inspect member device ERC capabilities during creation and assembly and enable it for drives that have it available but disabled. In light of your maintainership notice, I will pursue this directly. Phil ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-12-21 12:20 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <404650428.13997384.1446132658661.JavaMail.zimbra@inria.fr>
2015-10-29 15:59 ` Reconstruct a RAID 6 that has failed in a non typical manner Clement Parisot
2015-10-30 18:31 ` Phil Turmel
2015-11-05 10:35 ` Clement Parisot
2015-11-05 13:34 ` Phil Turmel
2015-11-17 12:30 ` Marc Pinhede
2015-11-17 13:25 ` Phil Turmel
2015-12-21 3:40 ` NeilBrown
2015-12-21 12:20 ` Phil Turmel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).