* RAID6 Array crash during reshape.....now will not re-assemble. @ 2016-03-02 3:46 Another Sillyname 2016-03-02 13:20 ` Wols Lists ` (2 more replies) 0 siblings, 3 replies; 23+ messages in thread From: Another Sillyname @ 2016-03-02 3:46 UTC (permalink / raw) To: Linux-RAID I have a 30TB RAID6 array using 7 x 6TB drives that I wanted to migrate to RAID5 to take one of the drives offline and use in a new array for a migration. sudo mdadm --grow /dev/md127 --level=raid5 --raid-device=6 --backup-file=mdadm_backupfile I watched this using cat /proc/mdstat and even after an hour the percentage of the reshape was still 0.0%. I know from previous experience that reshaping can be slow, but did not expect it to be this slow frankly. But erring on the side of caution I decided to leave the array for 12 hours and see what was happening then. Sure enough, 12 hours later cat /proc/mdstat still shows reshape at 0.0% Looking at CPU usage the reshape process is using 0% of the CPU. So reading a bit more......if you reboot a server the reshape should continue. Reboot..... Array will not come back online at all. Bring the server up without the array trying to automount. cat /proc/mdstat shows the array offline. Personalities : md127 : inactive sdf1[2](S) sde1[3](S) sdg1[0](S) sdb1[8](S) sdh1[7](S) sdc1[1](S) sdd1[6](S) 41022733300 blocks super 1.2 unused devices: <none> Try to reassemble the array. >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 mdadm: /dev/sdg1 is busy - skipping mdadm: /dev/sdh1 is busy - skipping mdadm: Merging with already-assembled /dev/md/server187.internallan.com:1 mdadm: Failed to restore critical section for reshape, sorry. Possibly you needed to specify the --backup-file Have no idea where the server187 stuff has come from. stop the array. >sudo mdadm --stop /dev/md127 try to re-assemble >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 mdadm: Failed to restore critical section for reshape, sorry. Possibly you needed to specify the --backup-file try to re-assemble using the backup file >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 --backup-file=mdadm_backupfile mdadm: Failed to restore critical section for reshape, sorry. have a look at the individual drives >sudo mdadm --examine /dev/sd[b-h]1 /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 1152bdeb:15546156:1918b67d:37d68b1f Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 3a66db58 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 4 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdc1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 140e09af:56e14b4e:5035d724:c2005f0b Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 88916c56 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 1 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdd1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : a50dd0a1:eeb0b3df:76200476:818e004d Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 9f8eb46a - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 6 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sde1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 7d0b65b3:d2ba2023:4625c287:1db2de9b Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 552ce48f - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 3 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdf1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : cda4f5e5:a489dbb9:5c1ab6a0:b257c984 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 2056e75c - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdg1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : df5af6ce:9017c863:697da267:046c9709 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : fefea2b5 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdh1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 9d98af83:243c3e02:94de20c7:293de111 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : b9f6375e - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 5 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) As all the drives are showing Reshape pos'n 0 I'm assuming the reshape never got started (even though cat /proc/mdstat showed the array reshaping)? So now I'm well out of my comfort zone so instead of flapping around have decided to sleep for a few hours before revisiting this. Any help and guidance would be appreciated, the drives showing clean gives me comfort that the data is likely intact and complete (crossed fingers) however I can't re-assemble the array as I keep getting the 'critical information for reshape, sorry' warning. Help??? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-02 3:46 RAID6 Array crash during reshape.....now will not re-assemble Another Sillyname @ 2016-03-02 13:20 ` Wols Lists [not found] ` <CAOS+5GHof=F94x58SKqFojV26hGpDSLF85dFfm8Xc6M43sN6jA@mail.gmail.com> 2016-03-05 10:47 ` Andreas Klauer 2016-03-09 0:23 ` NeilBrown 2 siblings, 1 reply; 23+ messages in thread From: Wols Lists @ 2016-03-02 13:20 UTC (permalink / raw) To: Another Sillyname, Linux-RAID On 02/03/16 03:46, Another Sillyname wrote: > Any help and guidance would be appreciated, the drives showing clean > gives me comfort that the data is likely intact and complete (crossed > fingers) however I can't re-assemble the array as I keep getting the > 'critical information for reshape, sorry' warning. > > Help??? Someone else will chip in what to do, but this doesn't seem alarming at all. Reshapes stuck at zero is a recent bug, but all the data is probably safe and sound. Wait for one of the experts to chip in what to do, but you might find mdadm --resume --invalid-backup will get it going again. Otherwise it's likely to be an "upgrade your kernel and mdadm" job ... Cheers, Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <CAOS+5GHof=F94x58SKqFojV26hGpDSLF85dFfm8Xc6M43sN6jA@mail.gmail.com>]
* Fwd: RAID6 Array crash during reshape.....now will not re-assemble. [not found] ` <CAOS+5GHof=F94x58SKqFojV26hGpDSLF85dFfm8Xc6M43sN6jA@mail.gmail.com> @ 2016-03-02 13:42 ` Another Sillyname 2016-03-02 15:59 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-02 13:42 UTC (permalink / raw) To: Linux-RAID Kernel is latest Fedora x86_64 4.3.5-300, can't get too much newer then that (latest is 4.4.x), mdadm is 3.3.4-2. I agree that the data is likely still intact, doesn't stop me being nervous till I see it though!! On 2 March 2016 at 13:20, Wols Lists <antlists@youngman.org.uk> wrote: > On 02/03/16 03:46, Another Sillyname wrote: >> Any help and guidance would be appreciated, the drives showing clean >> gives me comfort that the data is likely intact and complete (crossed >> fingers) however I can't re-assemble the array as I keep getting the >> 'critical information for reshape, sorry' warning. >> >> Help??? > > Someone else will chip in what to do, but this doesn't seem alarming at > all. Reshapes stuck at zero is a recent bug, but all the data is > probably safe and sound. > > Wait for one of the experts to chip in what to do, but you might find > mdadm --resume --invalid-backup will get it going again. > > Otherwise it's likely to be an "upgrade your kernel and mdadm" job ... > > Cheers, > Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-02 13:42 ` Fwd: " Another Sillyname @ 2016-03-02 15:59 ` Another Sillyname 2016-03-03 11:37 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-02 15:59 UTC (permalink / raw) To: Linux-RAID I've found out more info....and now have a theory.......but do not know how best to proceed. >sudo mdadm -A --scan --verbose mdadm: looking for devices for further assembly mdadm: No super block found on /dev/sdh (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdh mdadm: No super block found on /dev/sdg (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdg mdadm: No super block found on /dev/sdf (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdf mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sde mdadm: No super block found on /dev/sdd (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdd mdadm: No super block found on /dev/sdc (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdc mdadm: No super block found on /dev/sdb (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sdb mdadm: No super block found on /dev/sda6 (Expected magic a92b4efc, got 00000000) mdadm: no RAID superblock on /dev/sda6 mdadm: No super block found on /dev/sda5 (Expected magic a92b4efc, got 75412023) mdadm: no RAID superblock on /dev/sda5 mdadm: /dev/sda4 is too small for md: size is 2 sectors. mdadm: no RAID superblock on /dev/sda4 mdadm: No super block found on /dev/sda3 (Expected magic a92b4efc, got 00000401) mdadm: no RAID superblock on /dev/sda3 mdadm: No super block found on /dev/sda2 (Expected magic a92b4efc, got 00000401) mdadm: no RAID superblock on /dev/sda2 mdadm: No super block found on /dev/sda1 (Expected magic a92b4efc, got 0000007e) mdadm: no RAID superblock on /dev/sda1 mdadm: No super block found on /dev/sda (Expected magic a92b4efc, got e71e974a) mdadm: no RAID superblock on /dev/sda mdadm: /dev/sdh1 is identified as a member of /dev/md/server187:1, slot 5. mdadm: /dev/sdg1 is identified as a member of /dev/md/server187:1, slot 0. mdadm: /dev/sdf1 is identified as a member of /dev/md/server187:1, slot 2. mdadm: /dev/sde1 is identified as a member of /dev/md/server187:1, slot 3. mdadm: /dev/sdd1 is identified as a member of /dev/md/server187:1, slot 6. mdadm: /dev/sdc1 is identified as a member of /dev/md/server187:1, slot 1. mdadm: /dev/sdb1 is identified as a member of /dev/md/server187:1, slot 4. mdadm: /dev/md/server187:1 has an active reshape - checking if critical section needs to be restored mdadm: Failed to find backup of critical section mdadm: Failed to restore critical section for reshape, sorry. Possibly you needed to specify the --backup-file mdadm: looking for devices for further assembly mdadm: No arrays found in config file or automatically As I stated in my original posting I do not know where the server187 stuff came from when I tried the original assemble and two of the drives (sdg & sdh) reported as busy. So my theory is this...... This 30TB array has been up and active since about August 2015, fully functional without any major issues, except performance was sometimes a bit iffy. It is possible that drives sdg and sdh were used in a temporary box in a different array that was only active for about 10 days, before they were moved to the new 30TB array that was cleanly built. That array may well have been called server187 (it was a temp box so no reason to remember it). When the reshape of the current array 'died' during initialisation or immediately thereafter, even though cat /proc/mdstat showed the reshape active after 12 hours it was still stuck on 0.0%. When the machine was rebooted and the array didn't come up...is it possible that drives sdh and sdg still thought they were in the old server187 array and that is why they reported themselves busy? I'm not sure why this would happen, but am just theorising. When I tried the assemble command where it reported it was merging with the already existing server187 array, even though there wasn't/isn't a server187 array as prior to that assemble cat /proc/mdstat reported the offline md127 array. Somehow therefore the array names have got confused/transposed and that's why the backup file is now not seen as the correct one? This would seem to be borne out by all the drives now seeing themselves as part of server187 array rather then md127 array and also the reshape seems to be attached to this server187 array. I still believe/hope the data is all still intact and complete, however I am averse to just hacking around using google to 'try commands' hoping I hit a solution before someone with much more experience casts an eye over this to give me a little guidance. Help!! On 2 March 2016 at 13:42, Another Sillyname <anothersname@googlemail.com> wrote: > Kernel is latest Fedora x86_64 4.3.5-300, can't get too much newer > then that (latest is 4.4.x), mdadm is 3.3.4-2. > > I agree that the data is likely still intact, doesn't stop me being > nervous till I see it though!! > > > > On 2 March 2016 at 13:20, Wols Lists <antlists@youngman.org.uk> wrote: >> On 02/03/16 03:46, Another Sillyname wrote: >>> Any help and guidance would be appreciated, the drives showing clean >>> gives me comfort that the data is likely intact and complete (crossed >>> fingers) however I can't re-assemble the array as I keep getting the >>> 'critical information for reshape, sorry' warning. >>> >>> Help??? >> >> Someone else will chip in what to do, but this doesn't seem alarming at >> all. Reshapes stuck at zero is a recent bug, but all the data is >> probably safe and sound. >> >> Wait for one of the experts to chip in what to do, but you might find >> mdadm --resume --invalid-backup will get it going again. >> >> Otherwise it's likely to be an "upgrade your kernel and mdadm" job ... >> >> Cheers, >> Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-02 15:59 ` Another Sillyname @ 2016-03-03 11:37 ` Another Sillyname 2016-03-03 12:56 ` Wols Lists 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-03 11:37 UTC (permalink / raw) To: Linux-RAID Just to add a bit more to this..... It looks like the backup file is just full of EOLs (I haven't looked at it with a bit editor admittedly). So I'm absolutely stuck now and would really appreciate some help. I'd even be happy to just bring the array up in readonly mode and transfer the data off, but it will NOT let me reassemble the array without the 'need data backup file to finish reshape, sorry' error and will not reassemble. Anyone? On 2 March 2016 at 15:59, Another Sillyname <anothersname@googlemail.com> wrote: > I've found out more info....and now have a theory.......but do not > know how best to proceed. > >>sudo mdadm -A --scan --verbose > > mdadm: looking for devices for further assembly > mdadm: No super block found on /dev/sdh (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdh > mdadm: No super block found on /dev/sdg (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdg > mdadm: No super block found on /dev/sdf (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdf > mdadm: No super block found on /dev/sde (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sde > mdadm: No super block found on /dev/sdd (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdd > mdadm: No super block found on /dev/sdc (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdc > mdadm: No super block found on /dev/sdb (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sdb > mdadm: No super block found on /dev/sda6 (Expected magic a92b4efc, got 00000000) > mdadm: no RAID superblock on /dev/sda6 > mdadm: No super block found on /dev/sda5 (Expected magic a92b4efc, got 75412023) > mdadm: no RAID superblock on /dev/sda5 > mdadm: /dev/sda4 is too small for md: size is 2 sectors. > mdadm: no RAID superblock on /dev/sda4 > mdadm: No super block found on /dev/sda3 (Expected magic a92b4efc, got 00000401) > mdadm: no RAID superblock on /dev/sda3 > mdadm: No super block found on /dev/sda2 (Expected magic a92b4efc, got 00000401) > mdadm: no RAID superblock on /dev/sda2 > mdadm: No super block found on /dev/sda1 (Expected magic a92b4efc, got 0000007e) > mdadm: no RAID superblock on /dev/sda1 > mdadm: No super block found on /dev/sda (Expected magic a92b4efc, got e71e974a) > mdadm: no RAID superblock on /dev/sda > mdadm: /dev/sdh1 is identified as a member of /dev/md/server187:1, slot 5. > mdadm: /dev/sdg1 is identified as a member of /dev/md/server187:1, slot 0. > mdadm: /dev/sdf1 is identified as a member of /dev/md/server187:1, slot 2. > mdadm: /dev/sde1 is identified as a member of /dev/md/server187:1, slot 3. > mdadm: /dev/sdd1 is identified as a member of /dev/md/server187:1, slot 6. > mdadm: /dev/sdc1 is identified as a member of /dev/md/server187:1, slot 1. > mdadm: /dev/sdb1 is identified as a member of /dev/md/server187:1, slot 4. > mdadm: /dev/md/server187:1 has an active reshape - checking if > critical section needs to be restored > mdadm: Failed to find backup of critical section > mdadm: Failed to restore critical section for reshape, sorry. > Possibly you needed to specify the --backup-file > mdadm: looking for devices for further assembly > mdadm: No arrays found in config file or automatically > > As I stated in my original posting I do not know where the server187 > stuff came from when I tried the original assemble and two of the > drives (sdg & sdh) reported as busy. > > So my theory is this...... > > This 30TB array has been up and active since about August 2015, fully > functional without any major issues, except performance was sometimes > a bit iffy. > > It is possible that drives sdg and sdh were used in a temporary box in > a different array that was only active for about 10 days, before they > were moved to the new 30TB array that was cleanly built. That array > may well have been called server187 (it was a temp box so no reason to > remember it). > > When the reshape of the current array 'died' during initialisation or > immediately thereafter, even though cat /proc/mdstat showed the > reshape active after 12 hours it was still stuck on 0.0%. > > When the machine was rebooted and the array didn't come up...is it > possible that drives sdh and sdg still thought they were in the old > server187 array and that is why they reported themselves busy? I'm > not sure why this would happen, but am just theorising. > > When I tried the assemble command where it reported it was merging > with the already existing server187 array, even though there > wasn't/isn't a server187 array as prior to that assemble cat > /proc/mdstat reported the offline md127 array. > > Somehow therefore the array names have got confused/transposed and > that's why the backup file is now not seen as the correct one? This > would seem to be borne out by all the drives now seeing themselves as > part of server187 array rather then md127 array and also the reshape > seems to be attached to this server187 array. > > I still believe/hope the data is all still intact and complete, > however I am averse to just hacking around using google to 'try > commands' hoping I hit a solution before someone with much more > experience casts an eye over this to give me a little guidance. > > Help!! > > > > On 2 March 2016 at 13:42, Another Sillyname <anothersname@googlemail.com> wrote: >> Kernel is latest Fedora x86_64 4.3.5-300, can't get too much newer >> then that (latest is 4.4.x), mdadm is 3.3.4-2. >> >> I agree that the data is likely still intact, doesn't stop me being >> nervous till I see it though!! >> >> >> >> On 2 March 2016 at 13:20, Wols Lists <antlists@youngman.org.uk> wrote: >>> On 02/03/16 03:46, Another Sillyname wrote: >>>> Any help and guidance would be appreciated, the drives showing clean >>>> gives me comfort that the data is likely intact and complete (crossed >>>> fingers) however I can't re-assemble the array as I keep getting the >>>> 'critical information for reshape, sorry' warning. >>>> >>>> Help??? >>> >>> Someone else will chip in what to do, but this doesn't seem alarming at >>> all. Reshapes stuck at zero is a recent bug, but all the data is >>> probably safe and sound. >>> >>> Wait for one of the experts to chip in what to do, but you might find >>> mdadm --resume --invalid-backup will get it going again. >>> >>> Otherwise it's likely to be an "upgrade your kernel and mdadm" job ... >>> >>> Cheers, >>> Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 11:37 ` Another Sillyname @ 2016-03-03 12:56 ` Wols Lists [not found] ` <CAOS+5GH1Rcu8zGk1dQ+aSNmVzjo=irH65KfPuq1ZGruzqX_=vg@mail.gmail.com> 0 siblings, 1 reply; 23+ messages in thread From: Wols Lists @ 2016-03-03 12:56 UTC (permalink / raw) To: Another Sillyname, Linux-RAID On 03/03/16 11:37, Another Sillyname wrote: > I'd even be happy to just bring the array up in readonly mode and > transfer the data off, but it will NOT let me reassemble the array > without the 'need data backup file to finish reshape, sorry' error and > will not reassemble. Seeing as no-one else has joined in, search the list archives for that error message, and you should get plenty of hits ... Cheers, Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
[parent not found: <CAOS+5GH1Rcu8zGk1dQ+aSNmVzjo=irH65KfPuq1ZGruzqX_=vg@mail.gmail.com>]
* Fwd: RAID6 Array crash during reshape.....now will not re-assemble. [not found] ` <CAOS+5GH1Rcu8zGk1dQ+aSNmVzjo=irH65KfPuq1ZGruzqX_=vg@mail.gmail.com> @ 2016-03-03 14:07 ` Another Sillyname 2016-03-03 17:48 ` Sarah Newman 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-03 14:07 UTC (permalink / raw) To: Linux-RAID Plenty of hits....but no clear fixes and as I said I'm not willing to 'try' things from google hits until someone with better insight can give me a view. Trying things is fine in many instances, but not with a 20+TB data set as I'm sure you can understand. On 3 March 2016 at 12:56, Wols Lists <antlists@youngman.org.uk> wrote: > On 03/03/16 11:37, Another Sillyname wrote: >> I'd even be happy to just bring the array up in readonly mode and >> transfer the data off, but it will NOT let me reassemble the array >> without the 'need data backup file to finish reshape, sorry' error and >> will not reassemble. > > Seeing as no-one else has joined in, search the list archives for that > error message, and you should get plenty of hits ... > > Cheers, > Wol ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 14:07 ` Fwd: " Another Sillyname @ 2016-03-03 17:48 ` Sarah Newman 2016-03-03 17:59 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: Sarah Newman @ 2016-03-03 17:48 UTC (permalink / raw) To: Another Sillyname, Linux-RAID On 03/03/2016 06:07 AM, Another Sillyname wrote: > Plenty of hits....but no clear fixes and as I said I'm not willing to > 'try' things from google hits until someone with better insight can > give me a view. > > Trying things is fine in many instances, but not with a 20+TB data set > as I'm sure you can understand. > I have not tried this personally, but it may be of interest: https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file I found this explanation of the dmsetup commands to be more easy to follow: http://www.flaterco.com/kb/sandbox.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 17:48 ` Sarah Newman @ 2016-03-03 17:59 ` Another Sillyname 2016-03-03 20:47 ` John Stoffel 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-03 17:59 UTC (permalink / raw) To: Linux-RAID Sarah Thanks for the suggestion. I'd read that a couple of days back and while it's an interesting idea I don't believe it will address my specific issue of the array not allowing re-assembly (even with force and read only flags set) as mdadm reports the 'need backup file for reshape, sorry' error no matter what I've tried. Even trying the above flags with invalid-backup does not work so I need someone to have a eureka moment and say "....it's this....". I believe that mdadm 3.3 incorporated the recover during reshape functionality, however I've read elsewhere it only applies to expansion into a new drive...I was going RAID6 to RAID5 (even though it never really got started) and the backup file looks like 20mb of EOLs. So at he moment I'm pretty much stuck unless someone can tell me how to clear down the reshape flag, even in read only mode so I can copy the data off. On 3 March 2016 at 17:48, Sarah Newman <srn@prgmr.com> wrote: > On 03/03/2016 06:07 AM, Another Sillyname wrote: >> Plenty of hits....but no clear fixes and as I said I'm not willing to >> 'try' things from google hits until someone with better insight can >> give me a view. >> >> Trying things is fine in many instances, but not with a 20+TB data set >> as I'm sure you can understand. >> > > I have not tried this personally, but it may be of interest: > https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file I found this explanation > of the dmsetup commands to be more easy to follow: http://www.flaterco.com/kb/sandbox.html > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 17:59 ` Another Sillyname @ 2016-03-03 20:47 ` John Stoffel 2016-03-03 22:19 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: John Stoffel @ 2016-03-03 20:47 UTC (permalink / raw) To: Another Sillyname; +Cc: Linux-RAID Have you tried pulling down the latest version of mdadm from Neil's site with: git clone git://neil.brown.name/mdadm/ mdadm cd mdadm ./configure make and seeing if that custom build does the trick for you? I know he's done some newer patches which might help in this case. Another> I'd read that a couple of days back and while it's an interesting idea Another> I don't believe it will address my specific issue of the array not Another> allowing re-assembly (even with force and read only flags set) as Another> mdadm reports the 'need backup file for reshape, sorry' error no Another> matter what I've tried. Another> Even trying the above flags with invalid-backup does not work so I Another> need someone to have a eureka moment and say "....it's this....". Another> I believe that mdadm 3.3 incorporated the recover during reshape Another> functionality, however I've read elsewhere it only applies to Another> expansion into a new drive...I was going RAID6 to RAID5 (even though Another> it never really got started) and the backup file looks like 20mb of Another> EOLs. Another> So at he moment I'm pretty much stuck unless someone can tell me how Another> to clear down the reshape flag, even in read only mode so I can copy Another> the data off. Another> On 3 March 2016 at 17:48, Sarah Newman <srn@prgmr.com> wrote: >> On 03/03/2016 06:07 AM, Another Sillyname wrote: >>> Plenty of hits....but no clear fixes and as I said I'm not willing to >>> 'try' things from google hits until someone with better insight can >>> give me a view. >>> >>> Trying things is fine in many instances, but not with a 20+TB data set >>> as I'm sure you can understand. >>> >> >> I have not tried this personally, but it may be of interest: >> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file I found this explanation >> of the dmsetup commands to be more easy to follow: http://www.flaterco.com/kb/sandbox.html >> Another> -- Another> To unsubscribe from this list: send the line "unsubscribe linux-raid" in Another> the body of a message to majordomo@vger.kernel.org Another> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 20:47 ` John Stoffel @ 2016-03-03 22:19 ` Another Sillyname 2016-03-03 22:42 ` John Stoffel 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-03 22:19 UTC (permalink / raw) To: John Stoffel; +Cc: Linux-RAID John Thanks for the suggestion but that's still 'trying' things rather then an analytical approach. I also do not want to reboot this machine until I absolutely have to incase I need to capture any data needed to identify and thereby resolve the problem. Given I'm not getting much joy here I think I'll have to post a bug tomorrow and see where that goes. Thanks again. On 3 March 2016 at 20:47, John Stoffel <john@stoffel.org> wrote: > > Have you tried pulling down the latest version of mdadm from Neil's > site with: > > git clone git://neil.brown.name/mdadm/ mdadm > cd mdadm > ./configure > make > > and seeing if that custom build does the trick for you? I know he's > done some newer patches which might help in this case. > > > Another> I'd read that a couple of days back and while it's an interesting idea > Another> I don't believe it will address my specific issue of the array not > Another> allowing re-assembly (even with force and read only flags set) as > Another> mdadm reports the 'need backup file for reshape, sorry' error no > Another> matter what I've tried. > > Another> Even trying the above flags with invalid-backup does not work so I > Another> need someone to have a eureka moment and say "....it's this....". > > Another> I believe that mdadm 3.3 incorporated the recover during reshape > Another> functionality, however I've read elsewhere it only applies to > Another> expansion into a new drive...I was going RAID6 to RAID5 (even though > Another> it never really got started) and the backup file looks like 20mb of > Another> EOLs. > > Another> So at he moment I'm pretty much stuck unless someone can tell me how > Another> to clear down the reshape flag, even in read only mode so I can copy > Another> the data off. > > > Another> On 3 March 2016 at 17:48, Sarah Newman <srn@prgmr.com> wrote: >>> On 03/03/2016 06:07 AM, Another Sillyname wrote: >>>> Plenty of hits....but no clear fixes and as I said I'm not willing to >>>> 'try' things from google hits until someone with better insight can >>>> give me a view. >>>> >>>> Trying things is fine in many instances, but not with a 20+TB data set >>>> as I'm sure you can understand. >>>> >>> >>> I have not tried this personally, but it may be of interest: >>> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file I found this explanation >>> of the dmsetup commands to be more easy to follow: http://www.flaterco.com/kb/sandbox.html >>> > Another> -- > Another> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > Another> the body of a message to majordomo@vger.kernel.org > Another> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 22:19 ` Another Sillyname @ 2016-03-03 22:42 ` John Stoffel 2016-03-04 19:01 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: John Stoffel @ 2016-03-03 22:42 UTC (permalink / raw) To: Another Sillyname; +Cc: John Stoffel, Linux-RAID Another> Thanks for the suggestion but that's still 'trying' things Another> rather then an analytical approach. Well... since Neil is the guy who knows the code, and I've been several emails in the past about re-shapes gone wrong, and pulling down Neil's latest version was the solution. So that's what I'd go with. Another> I also do not want to reboot this machine until I absolutely Another> have to incase I need to capture any data needed to identify Another> and thereby resolve the problem. Reboot won't make a difference, all the data is on the disks. Another> Given I'm not getting much joy here I think I'll have to post Another> a bug tomorrow and see where that goes. I'd also argue that removing a disk from a RAID6 of 30Tb in size is crazy, but you know the risks I'm sure. It might have been better to just fail one disk, then zero it's super-block and use that new disk formatted by hand into a plain xfs or ext4 filesystem for you travels. Then when done, you'd just re-add the disk into the array and let it rebuild the second parity stripes. Also, I jsut dug into my archives, have you tried: --assemble --update=revert-reshape on your array? John ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-03 22:42 ` John Stoffel @ 2016-03-04 19:01 ` Another Sillyname 2016-03-04 19:11 ` Alireza Haghdoost 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-04 19:01 UTC (permalink / raw) To: John Stoffel; +Cc: Linux-RAID Hi John Yes I had already tried the revert-reshape option with no effect. It was when I found that option I also found the comment suggesting it only applied to reshapes that are growing rather then shrinking. Thanks for the suggestion but I'm still stuck and there is no bug tracker on the mdadm git website so I have to wait here. Ho Huum On 3 March 2016 at 22:42, John Stoffel <john@stoffel.org> wrote: > > Another> Thanks for the suggestion but that's still 'trying' things > Another> rather then an analytical approach. > > Well... since Neil is the guy who knows the code, and I've been > several emails in the past about re-shapes gone wrong, and pulling > down Neil's latest version was the solution. So that's what I'd go > with. > > Another> I also do not want to reboot this machine until I absolutely > Another> have to incase I need to capture any data needed to identify > Another> and thereby resolve the problem. > > Reboot won't make a difference, all the data is on the disks. > > Another> Given I'm not getting much joy here I think I'll have to post > Another> a bug tomorrow and see where that goes. > > I'd also argue that removing a disk from a RAID6 of 30Tb in size is > crazy, but you know the risks I'm sure. > > It might have been better to just fail one disk, then zero it's > super-block and use that new disk formatted by hand into a plain xfs > or ext4 filesystem for you travels. Then when done, you'd just re-add > the disk into the array and let it rebuild the second parity stripes. > > Also, I jsut dug into my archives, have you tried: > > --assemble --update=revert-reshape > > on your array? > > John ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 19:01 ` Another Sillyname @ 2016-03-04 19:11 ` Alireza Haghdoost 2016-03-04 20:30 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: Alireza Haghdoost @ 2016-03-04 19:11 UTC (permalink / raw) To: Another Sillyname; +Cc: John Stoffel, Linux-RAID On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname <anothersname@googlemail.com> wrote: > > > Thanks for the suggestion but I'm still stuck and there is no bug > tracker on the mdadm git website so I have to wait here. > > Ho Huum > > Looks like it is going to be a long wait. I think you are waiting to do something that might not be inplace/available at all. That thing is the capability to reset reshape flag when the array metadata is not consistent. You had an old array in two of these drives and it seems mdadm confused when it observes the drives metadata are not consistent. Hope someone chip in some tricks to do so without a need to develop such a functionality in mdadm. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 19:11 ` Alireza Haghdoost @ 2016-03-04 20:30 ` Another Sillyname 2016-03-04 21:02 ` Alireza Haghdoost 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-04 20:30 UTC (permalink / raw) To: Alireza Haghdoost; +Cc: John Stoffel, Linux-RAID That's possibly true, however there are lessons to be learnt here even if my array is not recoverable. I don't know the process order of doing a reshape....but I would suspect it's something along the lines of. Examine existing array. Confirm command can be run against existing array configuration (i.e. It's a valid command for this array setup). Do backup file (if specified) Set reshape flag high Start reshape I would suggest.... There needs to be another step in the process Before 'Set reshape flag high' that the backup file needs to be checked for consistency. My backup file appears to be just full of EOLs (now for all I know the backup file actually gets 'created' during the process and therefore starts out as EOLs). But once the flag is set high you are then committing the array before you know if the backup is good. Also The drives in this array had been working correctly for 6 months and undergone a number of reboots. If, as we are theorising, there was some metadata from a previous array setup on two of the drives that as a result of the reshape somehow became the 'valid' metadata regarding those two drives RAID status then I would suggest that during any mdadm raid create process there is an extensive and thorough check of any drives being used to identify and remove any possible previously existing RAID metadata information...thus making the drives 'clean'. On 4 March 2016 at 19:11, Alireza Haghdoost <alireza@cs.umn.edu> wrote: > On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname > <anothersname@googlemail.com> wrote: >> >> >> Thanks for the suggestion but I'm still stuck and there is no bug >> tracker on the mdadm git website so I have to wait here. >> >> Ho Huum >> >> > > Looks like it is going to be a long wait. I think you are waiting to > do something that might not be inplace/available at all. That thing is > the capability to reset reshape flag when the array metadata is not > consistent. You had an old array in two of these drives and it seems > mdadm confused when it observes the drives metadata are not > consistent. > > Hope someone chip in some tricks to do so without a need to develop > such a functionality in mdadm. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 20:30 ` Another Sillyname @ 2016-03-04 21:02 ` Alireza Haghdoost 2016-03-04 21:52 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: Alireza Haghdoost @ 2016-03-04 21:02 UTC (permalink / raw) To: Another Sillyname; +Cc: John Stoffel, Linux-RAID On Fri, Mar 4, 2016 at 2:30 PM, Another Sillyname <anothersname@googlemail.com> wrote: > That's possibly true, however there are lessons to be learnt here even > if my array is not recoverable. > > I don't know the process order of doing a reshape....but I would > suspect it's something along the lines of. > > Examine existing array. > Confirm command can be run against existing array configuration (i.e. > It's a valid command for this array setup). > Do backup file (if specified) > Set reshape flag high > Start reshape > > I would suggest.... > > There needs to be another step in the process > > Before 'Set reshape flag high' that the backup file needs to be > checked for consistency. > > My backup file appears to be just full of EOLs (now for all I know the > backup file actually gets 'created' during the process and therefore > starts out as EOLs). But once the flag is set high you are then > committing the array before you know if the backup is good. > > Also > > The drives in this array had been working correctly for 6 months and > undergone a number of reboots. > > If, as we are theorising, there was some metadata from a previous > array setup on two of the drives that as a result of the reshape > somehow became the 'valid' metadata regarding those two drives RAID > status then I would suggest that during any mdadm raid create process > there is an extensive and thorough check of any drives being used to > identify and remove any possible previously existing RAID metadata > information...thus making the drives 'clean'. > > > > > > > On 4 March 2016 at 19:11, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >> On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname >> <anothersname@googlemail.com> wrote: >>> >>> >>> Thanks for the suggestion but I'm still stuck and there is no bug >>> tracker on the mdadm git website so I have to wait here. >>> >>> Ho Huum >>> >>> >> >> Looks like it is going to be a long wait. I think you are waiting to >> do something that might not be inplace/available at all. That thing is >> the capability to reset reshape flag when the array metadata is not >> consistent. You had an old array in two of these drives and it seems >> mdadm confused when it observes the drives metadata are not >> consistent. >> >> Hope someone chip in some tricks to do so without a need to develop >> such a functionality in mdadm. Do you know the metadata version that is used on those two drives ? For example, if the version is < 1.0 then we could easily erase the old metadata since it has been recorded in the end of the drive. Newer metada versions after 1.0 are stored in the beginning of the drive. Therefore, there is no risk to erase your current array metadata ! ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 21:02 ` Alireza Haghdoost @ 2016-03-04 21:52 ` Another Sillyname 2016-03-04 22:07 ` John Stoffel 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-04 21:52 UTC (permalink / raw) To: Alireza Haghdoost, Linux-RAID I have no clue, they were used in a temporary system for 10 days about 8 months ago, they were then used in the new array that was built back in August. Even if the metadata was removed from those two drives the 'merge' that happened, without warning or requiring verification, seems to now have 'contaminated' all the drives possibly. I'm still reasonably convinced the data is there and intact, just need an analytical approach to how to recover it. On 4 March 2016 at 21:02, Alireza Haghdoost <alireza@cs.umn.edu> wrote: > On Fri, Mar 4, 2016 at 2:30 PM, Another Sillyname > <anothersname@googlemail.com> wrote: >> That's possibly true, however there are lessons to be learnt here even >> if my array is not recoverable. >> >> I don't know the process order of doing a reshape....but I would >> suspect it's something along the lines of. >> >> Examine existing array. >> Confirm command can be run against existing array configuration (i.e. >> It's a valid command for this array setup). >> Do backup file (if specified) >> Set reshape flag high >> Start reshape >> >> I would suggest.... >> >> There needs to be another step in the process >> >> Before 'Set reshape flag high' that the backup file needs to be >> checked for consistency. >> >> My backup file appears to be just full of EOLs (now for all I know the >> backup file actually gets 'created' during the process and therefore >> starts out as EOLs). But once the flag is set high you are then >> committing the array before you know if the backup is good. >> >> Also >> >> The drives in this array had been working correctly for 6 months and >> undergone a number of reboots. >> >> If, as we are theorising, there was some metadata from a previous >> array setup on two of the drives that as a result of the reshape >> somehow became the 'valid' metadata regarding those two drives RAID >> status then I would suggest that during any mdadm raid create process >> there is an extensive and thorough check of any drives being used to >> identify and remove any possible previously existing RAID metadata >> information...thus making the drives 'clean'. >> >> >> >> >> >> >> On 4 March 2016 at 19:11, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >>> On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname >>> <anothersname@googlemail.com> wrote: >>>> >>>> >>>> Thanks for the suggestion but I'm still stuck and there is no bug >>>> tracker on the mdadm git website so I have to wait here. >>>> >>>> Ho Huum >>>> >>>> >>> >>> Looks like it is going to be a long wait. I think you are waiting to >>> do something that might not be inplace/available at all. That thing is >>> the capability to reset reshape flag when the array metadata is not >>> consistent. You had an old array in two of these drives and it seems >>> mdadm confused when it observes the drives metadata are not >>> consistent. >>> >>> Hope someone chip in some tricks to do so without a need to develop >>> such a functionality in mdadm. > > Do you know the metadata version that is used on those two drives ? > For example, if the version is < 1.0 then we could easily erase the > old metadata since it has been recorded in the end of the drive. Newer > metada versions after 1.0 are stored in the beginning of the drive. > > Therefore, there is no risk to erase your current array metadata ! ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 21:52 ` Another Sillyname @ 2016-03-04 22:07 ` John Stoffel 2016-03-05 10:28 ` Another Sillyname 0 siblings, 1 reply; 23+ messages in thread From: John Stoffel @ 2016-03-04 22:07 UTC (permalink / raw) To: Another Sillyname; +Cc: Alireza Haghdoost, Linux-RAID Can you post the output of mdadm -E /dev/sd?1 for all your drives? And did you pull down the latest version of mdadm from neil's repo and build it and use that to undo the re-shape? John Another> I have no clue, they were used in a temporary system for 10 days about Another> 8 months ago, they were then used in the new array that was built back Another> in August. Another> Even if the metadata was removed from those two drives the 'merge' Another> that happened, without warning or requiring verification, seems to now Another> have 'contaminated' all the drives possibly. Another> I'm still reasonably convinced the data is there and intact, just need Another> an analytical approach to how to recover it. Another> On 4 March 2016 at 21:02, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >> On Fri, Mar 4, 2016 at 2:30 PM, Another Sillyname >> <anothersname@googlemail.com> wrote: >>> That's possibly true, however there are lessons to be learnt here even >>> if my array is not recoverable. >>> >>> I don't know the process order of doing a reshape....but I would >>> suspect it's something along the lines of. >>> >>> Examine existing array. >>> Confirm command can be run against existing array configuration (i.e. >>> It's a valid command for this array setup). >>> Do backup file (if specified) >>> Set reshape flag high >>> Start reshape >>> >>> I would suggest.... >>> >>> There needs to be another step in the process >>> >>> Before 'Set reshape flag high' that the backup file needs to be >>> checked for consistency. >>> >>> My backup file appears to be just full of EOLs (now for all I know the >>> backup file actually gets 'created' during the process and therefore >>> starts out as EOLs). But once the flag is set high you are then >>> committing the array before you know if the backup is good. >>> >>> Also >>> >>> The drives in this array had been working correctly for 6 months and >>> undergone a number of reboots. >>> >>> If, as we are theorising, there was some metadata from a previous >>> array setup on two of the drives that as a result of the reshape >>> somehow became the 'valid' metadata regarding those two drives RAID >>> status then I would suggest that during any mdadm raid create process >>> there is an extensive and thorough check of any drives being used to >>> identify and remove any possible previously existing RAID metadata >>> information...thus making the drives 'clean'. >>> >>> >>> >>> >>> >>> >>> On 4 March 2016 at 19:11, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >>>> On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname >>>> <anothersname@googlemail.com> wrote: >>>>> >>>>> >>>>> Thanks for the suggestion but I'm still stuck and there is no bug >>>>> tracker on the mdadm git website so I have to wait here. >>>>> >>>>> Ho Huum >>>>> >>>>> >>>> >>>> Looks like it is going to be a long wait. I think you are waiting to >>>> do something that might not be inplace/available at all. That thing is >>>> the capability to reset reshape flag when the array metadata is not >>>> consistent. You had an old array in two of these drives and it seems >>>> mdadm confused when it observes the drives metadata are not >>>> consistent. >>>> >>>> Hope someone chip in some tricks to do so without a need to develop >>>> such a functionality in mdadm. >> >> Do you know the metadata version that is used on those two drives ? >> For example, if the version is < 1.0 then we could easily erase the >> old metadata since it has been recorded in the end of the drive. Newer >> metada versions after 1.0 are stored in the beginning of the drive. >> >> Therefore, there is no risk to erase your current array metadata ! Another> -- Another> To unsubscribe from this list: send the line "unsubscribe linux-raid" in Another> the body of a message to majordomo@vger.kernel.org Another> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Fwd: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-04 22:07 ` John Stoffel @ 2016-03-05 10:28 ` Another Sillyname 0 siblings, 0 replies; 23+ messages in thread From: Another Sillyname @ 2016-03-05 10:28 UTC (permalink / raw) To: John Stoffel; +Cc: Alireza Haghdoost, Linux-RAID John As I said in a previous reply I'm not willing to just 'try' things (such as using a later mdadm) as in my opinion that's not an analytical approach and nothing will be learnt from a success. I want to understand both why this happened and also what specifically needs to be done to recover it (if it is a later version of mdadm what in that later version addesses this problem), only then can any subsequent user with a similar problem be able to to follow this example to fix their array. I'd already posted the mdadm examine in the OP, I've copied the original OP below again for completeness. Thanks for your thoughts. The original post. ----------------------------------------------------------------------------- I have a 30TB RAID6 array using 7 x 6TB drives that I wanted to migrate to RAID5 to take one of the drives offline and use in a new array for a migration. sudo mdadm --grow /dev/md127 --level=raid5 --raid-device=6 --backup-file=mdadm_backupfile I watched this using cat /proc/mdstat and even after an hour the percentage of the reshape was still 0.0%. I know from previous experience that reshaping can be slow, but did not expect it to be this slow frankly. But erring on the side of caution I decided to leave the array for 12 hours and see what was happening then. Sure enough, 12 hours later cat /proc/mdstat still shows reshape at 0.0% Looking at CPU usage the reshape process is using 0% of the CPU. So reading a bit more......if you reboot a server the reshape should continue. Reboot..... Array will not come back online at all. Bring the server up without the array trying to automount. cat /proc/mdstat shows the array offline. Personalities : md127 : inactive sdf1[2](S) sde1[3](S) sdg1[0](S) sdb1[8](S) sdh1[7](S) sdc1[1](S) sdd1[6](S) 41022733300 blocks super 1.2 unused devices: <none> Try to reassemble the array. >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 mdadm: /dev/sdg1 is busy - skipping mdadm: /dev/sdh1 is busy - skipping mdadm: Merging with already-assembled /dev/md/server187.internallan.com:1 mdadm: Failed to restore critical section for reshape, sorry. Possibly you needed to specify the --backup-file Have no idea where the server187 stuff has come from. stop the array. >sudo mdadm --stop /dev/md127 try to re-assemble >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 mdadm: Failed to restore critical section for reshape, sorry. Possibly you needed to specify the --backup-file try to re-assemble using the backup file >sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 --backup-file=mdadm_backupfile mdadm: Failed to restore critical section for reshape, sorry. have a look at the individual drives >sudo mdadm --examine /dev/sd[b-h]1 /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 1152bdeb:15546156:1918b67d:37d68b1f Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 3a66db58 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 4 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdc1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 140e09af:56e14b4e:5035d724:c2005f0b Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 88916c56 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 1 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdd1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : a50dd0a1:eeb0b3df:76200476:818e004d Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 9f8eb46a - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 6 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sde1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 7d0b65b3:d2ba2023:4625c287:1db2de9b Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 552ce48f - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 3 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdf1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : cda4f5e5:a489dbb9:5c1ab6a0:b257c984 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 2056e75c - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 2 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdg1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : df5af6ce:9017c863:697da267:046c9709 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : fefea2b5 - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 0 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) /dev/sdh1: Magic : a92b4efc Version : 1.2 Feature Map : 0x5 Array UUID : da29a06f:f8cf1409:bc52afb2:6945ba08 Name : server187.internallan.com:1 Creation Time : Sun May 10 14:47:51 2015 Raid Level : raid6 Raid Devices : 7 Avail Dev Size : 11720780943 (5588.90 GiB 6001.04 GB) Array Size : 29301952000 (27944.52 GiB 30005.20 GB) Used Dev Size : 11720780800 (5588.90 GiB 6001.04 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=143 sectors State : clean Device UUID : 9d98af83:243c3e02:94de20c7:293de111 Internal Bitmap : 8 sectors from superblock Reshape pos'n : 0 New Layout : left-symmetric-6 Update Time : Wed Mar 2 01:19:42 2016 Bad Block Log : 512 entries available at offset 72 sectors Checksum : b9f6375e - correct Events : 369282 Layout : left-symmetric Chunk Size : 512K Device Role : Active device 5 Array State : AAAAAAA ('A' == active, '.' == missing, 'R' == replacing) As all the drives are showing Reshape pos'n 0 I'm assuming the reshape never got started (even though cat /proc/mdstat showed the array reshaping)? So now I'm well out of my comfort zone so instead of flapping around have decided to sleep for a few hours before revisiting this. Any help and guidance would be appreciated, the drives showing clean gives me comfort that the data is likely intact and complete (crossed fingers) however I can't re-assemble the array as I keep getting the 'critical information for reshape, sorry' warning. Help??? -------------------------------------------------------------------------------------------------------- On 4 March 2016 at 22:07, John Stoffel <john@stoffel.org> wrote: > > Can you post the output of mdadm -E /dev/sd?1 for all your drives? > And did you pull down the latest version of mdadm from neil's repo and > build it and use that to undo the re-shape? > > John > > > Another> I have no clue, they were used in a temporary system for 10 days about > Another> 8 months ago, they were then used in the new array that was built back > Another> in August. > > Another> Even if the metadata was removed from those two drives the 'merge' > Another> that happened, without warning or requiring verification, seems to now > Another> have 'contaminated' all the drives possibly. > > Another> I'm still reasonably convinced the data is there and intact, just need > Another> an analytical approach to how to recover it. > > > > Another> On 4 March 2016 at 21:02, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >>> On Fri, Mar 4, 2016 at 2:30 PM, Another Sillyname >>> <anothersname@googlemail.com> wrote: >>>> That's possibly true, however there are lessons to be learnt here even >>>> if my array is not recoverable. >>>> >>>> I don't know the process order of doing a reshape....but I would >>>> suspect it's something along the lines of. >>>> >>>> Examine existing array. >>>> Confirm command can be run against existing array configuration (i.e. >>>> It's a valid command for this array setup). >>>> Do backup file (if specified) >>>> Set reshape flag high >>>> Start reshape >>>> >>>> I would suggest.... >>>> >>>> There needs to be another step in the process >>>> >>>> Before 'Set reshape flag high' that the backup file needs to be >>>> checked for consistency. >>>> >>>> My backup file appears to be just full of EOLs (now for all I know the >>>> backup file actually gets 'created' during the process and therefore >>>> starts out as EOLs). But once the flag is set high you are then >>>> committing the array before you know if the backup is good. >>>> >>>> Also >>>> >>>> The drives in this array had been working correctly for 6 months and >>>> undergone a number of reboots. >>>> >>>> If, as we are theorising, there was some metadata from a previous >>>> array setup on two of the drives that as a result of the reshape >>>> somehow became the 'valid' metadata regarding those two drives RAID >>>> status then I would suggest that during any mdadm raid create process >>>> there is an extensive and thorough check of any drives being used to >>>> identify and remove any possible previously existing RAID metadata >>>> information...thus making the drives 'clean'. >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 4 March 2016 at 19:11, Alireza Haghdoost <alireza@cs.umn.edu> wrote: >>>>> On Fri, Mar 4, 2016 at 1:01 PM, Another Sillyname >>>>> <anothersname@googlemail.com> wrote: >>>>>> >>>>>> >>>>>> Thanks for the suggestion but I'm still stuck and there is no bug >>>>>> tracker on the mdadm git website so I have to wait here. >>>>>> >>>>>> Ho Huum >>>>>> >>>>>> >>>>> >>>>> Looks like it is going to be a long wait. I think you are waiting to >>>>> do something that might not be inplace/available at all. That thing is >>>>> the capability to reset reshape flag when the array metadata is not >>>>> consistent. You had an old array in two of these drives and it seems >>>>> mdadm confused when it observes the drives metadata are not >>>>> consistent. >>>>> >>>>> Hope someone chip in some tricks to do so without a need to develop >>>>> such a functionality in mdadm. >>> >>> Do you know the metadata version that is used on those two drives ? >>> For example, if the version is < 1.0 then we could easily erase the >>> old metadata since it has been recorded in the end of the drive. Newer >>> metada versions after 1.0 are stored in the beginning of the drive. >>> >>> Therefore, there is no risk to erase your current array metadata ! > Another> -- > Another> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > Another> the body of a message to majordomo@vger.kernel.org > Another> More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-02 3:46 RAID6 Array crash during reshape.....now will not re-assemble Another Sillyname 2016-03-02 13:20 ` Wols Lists @ 2016-03-05 10:47 ` Andreas Klauer 2016-03-09 0:23 ` NeilBrown 2 siblings, 0 replies; 23+ messages in thread From: Andreas Klauer @ 2016-03-05 10:47 UTC (permalink / raw) To: Another Sillyname; +Cc: Linux-RAID On Wed, Mar 02, 2016 at 03:46:48AM +0000, Another Sillyname wrote: > As all the drives are showing Reshape pos'n 0 I'm assuming the reshape > never got started (even though cat /proc/mdstat showed the array > reshaping)? I fixed such a thing by editing RAID metadata to no longer be in reshape state... incomplete instructions: https://bpaste.net/show/2231e697431d The example above used 1.0 superblock I think; in order to adapt this successfully to your situation you should refer to https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#The_version-1_superblock_format_on-disk_layout If current mdadm/kernel has a way to get out of this ditch directly of course that would be so much better... Apart from that, the other alternative that comes to mind is using --create but for that to be successful, you have to be sure to get all variables right (superblock version, data offset, raid level, chunk size, disk order, ...) and use --assume-clean and/or missing disks to prevent resyncs and verify the results in read-only mode. Regards Andreas ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-02 3:46 RAID6 Array crash during reshape.....now will not re-assemble Another Sillyname 2016-03-02 13:20 ` Wols Lists 2016-03-05 10:47 ` Andreas Klauer @ 2016-03-09 0:23 ` NeilBrown 2016-03-12 11:38 ` Another Sillyname 2 siblings, 1 reply; 23+ messages in thread From: NeilBrown @ 2016-03-09 0:23 UTC (permalink / raw) To: Another Sillyname, Linux-RAID [-- Attachment #1: Type: text/plain, Size: 7510 bytes --] On Wed, Mar 02 2016, Another Sillyname wrote: > I have a 30TB RAID6 array using 7 x 6TB drives that I wanted to > migrate to RAID5 to take one of the drives offline and use in a new > array for a migration. > > sudo mdadm --grow /dev/md127 --level=raid5 --raid-device=6 > --backup-file=mdadm_backupfile First observation: Don't use --backup-file unless mdadm tell you that you have to. New mdadm on new kernel with newly create arrays don't need a backup file at all. Your array is sufficiently newly created and I think your mdadm/kernel are new enough too. Note in the --examine output: > Unused Space : before=262056 sectors, after=143 sectors This means there is (nearly) 128M of free space in the start of each device. md can perform the reshape by copying a few chunks down into this space, then the next few chunks into the space just freed, then the next few chunks ... and so on. No backup file needed. That is providing the chunk size is quite a bit smaller than the space, and your 512K chunk size certainly is. A reshape which increases the size of the array needs 'before' space, a reshape which decreases the size of the array needs 'after' space. A reshape which doesn't change the size of the array (like yours) can use either. > > I watched this using cat /proc/mdstat and even after an hour the > percentage of the reshape was still 0.0%. A more useful number to watch is the (xxx/yyy) after the percentage. The first number should change at least every few seconds. > > Reboot..... > > Array will not come back online at all. > > Bring the server up without the array trying to automount. > > cat /proc/mdstat shows the array offline. > > Personalities : > md127 : inactive sdf1[2](S) sde1[3](S) sdg1[0](S) sdb1[8](S) > sdh1[7](S) sdc1[1](S) sdd1[6](S) > 41022733300 blocks super 1.2 > > unused devices: <none> > > Try to reassemble the array. > >>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 > mdadm: /dev/sdg1 is busy - skipping > mdadm: /dev/sdh1 is busy - skipping > mdadm: Merging with already-assembled /dev/md/server187.internallan.com:1 It looks like you are getting races with udev. mdadm is detecting the race and says that it is "Merging" rather than creating a separate array but still the result isn't very useful... When you run "mdadm --assemble /dev/md127 ...." mdadm notices that /dev/md127 already exists but isn't active, so it stops it properly so that all the devices become available to be assembled. As the devices become available they tell udev "Hey, I've changed status" and udev says "Hey, you look like part of an md array, let's put you back together".... or something like that. I might have the details a little wrong - it is a while since I looked at this. Anyway it seems that udev called "mdadm -I" to put some of the devices together so they were busy when your "mdadm --assemble" looked at them. > mdadm: Failed to restore critical section for reshape, sorry. > Possibly you needed to specify the --backup-file > > > Have no idea where the server187 stuff has come from. That is in the 'Name' field in the metadata, which must have been put there when the array was created > Name : server187.internallan.com:1 > Creation Time : Sun May 10 14:47:51 2015 It is possible to change it after-the-fact, but unlikely unless someone explicitly tried. I doesn't really matter how it got there as all the devices are the same. When "mdadm -I /dev/sdb1" etc is run by udev, mdadm needs to deduce a name for the array. It looks in the Name filed and creates /dev/md/server187.internallan.com:1 > > stop the array. > >>sudo mdadm --stop /dev/md127 > > try to re-assemble > >>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 > > mdadm: Failed to restore critical section for reshape, sorry. > Possibly you needed to specify the --backup-file > > > try to re-assemble using the backup file > >>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 --backup-file=mdadm_backupfile > > mdadm: Failed to restore critical section for reshape, sorry. As you have noted else where, the backup file contains nothing useful. That is causing the problem. When an in-place reshape like yours (not changing the size of the array, just changing the configuration) starts the sequence is something like: - make sure reshape doesn't progress at all (set md/sync_max to zero) - tell the kernel about the new shape of the array - start the reshape (this won't make any progress, but will update the metadata) Start: - suspend user-space writes to the next few stripes - read the next few stripes and write to the backup file - tell the kernel that it is allowed to progress to the end of those 'few stripes' - wait for the kernel to do that - invalidate the backup - resume user-space writes to those next few stripes - goto Start (the process is actually 'double-buffered' so it is more complex, but this gives the idea close enough) If the system crashes or is shut down, on restart the kernel cannot know if the "next few stripes" started reshaping or not, so it depends on mdadm to load the backup file, check if there is valid data, and write it out. I suspect that part of the problem is that mdadm --grow doesn't initialize the backup file in quite the right way, so when mdadm --assemble looks at it it doesn't see "Nothing has been written yet" but instead sees "confusion" and gives up. If you --stop and then run the same --assemble command, including the --backup, but this time add --invalid-backup (a bit like Wol suggested) it should assemble and restart the reshape. --invalid-backup tells mdadm "I know the backup file is invalid, I know that means there could be inconsistent data which won't be restored, but I know what is going on and I'm willing to take that risk. Just don't restore anything, it'll be find. Really". I don't actually recommend doing that though. It would be better to revert the current reshape and start again with no --backup file. This will use the new mechanism of changing the "Data Offset" which is easier to work with and should be faster. If you have the very latest mdadm (3.4) you can add --update=revert-reshape together with --invalid-backup and in your case this will cancel the reshape and let you start again. You can test this out fairly safely if you want to. mkdir /tmp/foo mdadm --dump /tmp/foo /dev/.... list of all devices in the array This will create sparse files in /tmp/foo containing just the md metadata from those devices. Use "losetup /dev/loop0 /tmp/foo/sdb1" etc to create loop-back device for all those files (there are multiple hard links to each file - just choose 1 each). Then you can experiment with mdadm on those /dev/loopXX files to see what happens. Once you have the array reverted, you can start a new --grow, but don't specify a --backup file. That should DoTheRightThing. This still leaves the question of why it didn't start a reshape in the first place. If someone would like to experiment (probably with loop-back files) and produce a test case that reliably (or even just occasionally) hangs, then I'm happy to have a look at it. It also doesn't answer the question of why mdadm doesn't create the backup file in a format that it knows is safe to ignore. Maybe someone could look into that. Good luck :-) NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-09 0:23 ` NeilBrown @ 2016-03-12 11:38 ` Another Sillyname 2016-03-14 1:08 ` NeilBrown 0 siblings, 1 reply; 23+ messages in thread From: Another Sillyname @ 2016-03-12 11:38 UTC (permalink / raw) To: NeilBrown; +Cc: Linux-RAID Neil Thanks for the insight, much appreciated. I've tried what you suggested and still get stuck. >:losetup /dev/loop0 /tmp/foo/sdb1 >:losetup /dev/loop1 /tmp/foo/sdc1 >:losetup /dev/loop2 /tmp/foo/sdd1 >:losetup /dev/loop3 /tmp/foo/sde1 >:losetup /dev/loop4 /tmp/foo/sdf1 >:losetup /dev/loop5 /tmp/foo/sdg1 >:losetup /dev/loop6 /tmp/foo/sdh1 >:mdadm --assemble --force --update=revert-reshape --invalid-backup /dev/md127 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 mdadm: /dev/md127: Need a backup file to complete reshape of this array. mdadm: Please provided one with "--backup-file=..." mdadm: (Don't specify --update=revert-reshape again, that part succeeded.) As you can see it 'seems' to have accepted the revert command, but even though I've told it the backup is invalid it's still insisting on the backup being made available. Any further thoughts or insights would be gratefully received. On 9 March 2016 at 00:23, NeilBrown <nfbrown@novell.com> wrote: > On Wed, Mar 02 2016, Another Sillyname wrote: > >> I have a 30TB RAID6 array using 7 x 6TB drives that I wanted to >> migrate to RAID5 to take one of the drives offline and use in a new >> array for a migration. >> >> sudo mdadm --grow /dev/md127 --level=raid5 --raid-device=6 >> --backup-file=mdadm_backupfile > > First observation: Don't use --backup-file unless mdadm tell you that > you have to. New mdadm on new kernel with newly create arrays don't > need a backup file at all. Your array is sufficiently newly created and > I think your mdadm/kernel are new enough too. Note in the --examine output: > >> Unused Space : before=262056 sectors, after=143 sectors > > This means there is (nearly) 128M of free space in the start of each > device. md can perform the reshape by copying a few chunks down into > this space, then the next few chunks into the space just freed, then the > next few chunks ... and so on. No backup file needed. That is > providing the chunk size is quite a bit smaller than the space, and your > 512K chunk size certainly is. > > A reshape which increases the size of the array needs 'before' space, a > reshape which decreases the size of the array needs 'after' space. A > reshape which doesn't change the size of the array (like yours) can use > either. > >> >> I watched this using cat /proc/mdstat and even after an hour the >> percentage of the reshape was still 0.0%. > > A more useful number to watch is the (xxx/yyy) after the percentage. > The first number should change at least every few seconds. > >> >> Reboot..... >> >> Array will not come back online at all. >> >> Bring the server up without the array trying to automount. >> >> cat /proc/mdstat shows the array offline. >> >> Personalities : >> md127 : inactive sdf1[2](S) sde1[3](S) sdg1[0](S) sdb1[8](S) >> sdh1[7](S) sdc1[1](S) sdd1[6](S) >> 41022733300 blocks super 1.2 >> >> unused devices: <none> >> >> Try to reassemble the array. >> >>>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 >> mdadm: /dev/sdg1 is busy - skipping >> mdadm: /dev/sdh1 is busy - skipping >> mdadm: Merging with already-assembled /dev/md/server187.internallan.com:1 > > It looks like you are getting races with udev. mdadm is detecting the > race and says that it is "Merging" rather than creating a separate array > but still the result isn't very useful... > > > When you run "mdadm --assemble /dev/md127 ...." mdadm notices that /dev/md127 > already exists but isn't active, so it stops it properly so that all the > devices become available to be assembled. > As the devices become available they tell udev "Hey, I've changed > status" and udev says "Hey, you look like part of an md array, let's put > you back together".... or something like that. I might have the details > a little wrong - it is a while since I looked at this. > Anyway it seems that udev called "mdadm -I" to put some of the devices > together so they were busy when your "mdadm --assemble" looked at them. > > >> mdadm: Failed to restore critical section for reshape, sorry. >> Possibly you needed to specify the --backup-file >> >> >> Have no idea where the server187 stuff has come from. > > That is in the 'Name' field in the metadata, which must have been put > there when the array was created >> Name : server187.internallan.com:1 >> Creation Time : Sun May 10 14:47:51 2015 > > It is possible to change it after-the-fact, but unlikely unless someone > explicitly tried. > I doesn't really matter how it got there as all the devices are the > same. > When "mdadm -I /dev/sdb1" etc is run by udev, mdadm needs to deduce a > name for the array. It looks in the Name filed and creates > > /dev/md/server187.internallan.com:1 > >> >> stop the array. >> >>>sudo mdadm --stop /dev/md127 >> >> try to re-assemble >> >>>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 >> >> mdadm: Failed to restore critical section for reshape, sorry. >> Possibly you needed to specify the --backup-file >> >> >> try to re-assemble using the backup file >> >>>sudo mdadm --assemble /dev/md127 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 --backup-file=mdadm_backupfile >> >> mdadm: Failed to restore critical section for reshape, sorry. > > As you have noted else where, the backup file contains nothing useful. > That is causing the problem. > > When an in-place reshape like yours (not changing the size of the array, > just changing the configuration) starts the sequence is something like: > > - make sure reshape doesn't progress at all (set md/sync_max to zero) > - tell the kernel about the new shape of the array > - start the reshape (this won't make any progress, but will update the > metadata) > Start: > - suspend user-space writes to the next few stripes > - read the next few stripes and write to the backup file > - tell the kernel that it is allowed to progress to the end of those > 'few stripes' > - wait for the kernel to do that > - invalidate the backup > - resume user-space writes to those next few stripes > - goto Start > > (the process is actually 'double-buffered' so it is more complex, but > this gives the idea close enough) > > If the system crashes or is shut down, on restart the kernel cannot know > if the "next few stripes" started reshaping or not, so it depends on > mdadm to load the backup file, check if there is valid data, and write > it out. > > I suspect that part of the problem is that mdadm --grow doesn't initialize the > backup file in quite the right way, so when mdadm --assemble looks at it > it doesn't see "Nothing has been written yet" but instead sees > "confusion" and gives up. > > If you --stop and then run the same --assemble command, including the > --backup, but this time add --invalid-backup (a bit like Wol > suggested) it should assemble and restart the reshape. --invalid-backup > tells mdadm "I know the backup file is invalid, I know that means there > could be inconsistent data which won't be restored, but I know what is > going on and I'm willing to take that risk. Just don't restore anything, > it'll be find. Really". > > I don't actually recommend doing that though. > > It would be better to revert the current reshape and start again with no > --backup file. This will use the new mechanism of changing the "Data > Offset" which is easier to work with and should be faster. > > If you have the very latest mdadm (3.4) you can add > --update=revert-reshape together with --invalid-backup and in your case > this will cancel the reshape and let you start again. > > You can test this out fairly safely if you want to. > > mkdir /tmp/foo > mdadm --dump /tmp/foo /dev/.... list of all devices in the array > > This will create sparse files in /tmp/foo containing just the md > metadata from those devices. Use "losetup /dev/loop0 /tmp/foo/sdb1" etc > to create loop-back device for all those files (there are multiple hard > links to each file - just choose 1 each). > Then you can experiment with mdadm on those /dev/loopXX files to see > what happens. > > Once you have the array reverted, you can start a new --grow, but don't > specify a --backup file. That should DoTheRightThing. > > This still leaves the question of why it didn't start a reshape in the > first place. If someone would like to experiment (probably with > loop-back files) and produce a test case that reliably (or even just > occasionally) hangs, then I'm happy to have a look at it. > > It also doesn't answer the question of why mdadm doesn't create the > backup file in a format that it knows is safe to ignore. Maybe someone > could look into that. > > > Good luck :-) > > NeilBrown ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: RAID6 Array crash during reshape.....now will not re-assemble. 2016-03-12 11:38 ` Another Sillyname @ 2016-03-14 1:08 ` NeilBrown 0 siblings, 0 replies; 23+ messages in thread From: NeilBrown @ 2016-03-14 1:08 UTC (permalink / raw) To: Another Sillyname; +Cc: Linux-RAID [-- Attachment #1: Type: text/plain, Size: 1252 bytes --] On Sat, Mar 12 2016, Another Sillyname wrote: > Neil > > Thanks for the insight, much appreciated. > > I've tried what you suggested and still get stuck. > >>:losetup /dev/loop0 /tmp/foo/sdb1 >>:losetup /dev/loop1 /tmp/foo/sdc1 >>:losetup /dev/loop2 /tmp/foo/sdd1 >>:losetup /dev/loop3 /tmp/foo/sde1 >>:losetup /dev/loop4 /tmp/foo/sdf1 >>:losetup /dev/loop5 /tmp/foo/sdg1 >>:losetup /dev/loop6 /tmp/foo/sdh1 > > >>:mdadm --assemble --force --update=revert-reshape --invalid-backup /dev/md127 /dev/loop0 /dev/loop1 /dev/loop2 /dev/loop3 /dev/loop4 /dev/loop5 /dev/loop6 > mdadm: /dev/md127: Need a backup file to complete reshape of this array. > mdadm: Please provided one with "--backup-file=..." > mdadm: (Don't specify --update=revert-reshape again, that part succeeded.) > > As you can see it 'seems' to have accepted the revert command, but > even though I've told it the backup is invalid it's still insisting on > the backup being made available. > > Any further thoughts or insights would be gratefully received. Try giving it a backup file too. It doesn't matter what the contents of the file are because you have told it the file is invalid. But I guess it thinks it might still need a file to write new backups to temporarily. NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 818 bytes --] ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2016-03-14 1:08 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-02 3:46 RAID6 Array crash during reshape.....now will not re-assemble Another Sillyname 2016-03-02 13:20 ` Wols Lists [not found] ` <CAOS+5GHof=F94x58SKqFojV26hGpDSLF85dFfm8Xc6M43sN6jA@mail.gmail.com> 2016-03-02 13:42 ` Fwd: " Another Sillyname 2016-03-02 15:59 ` Another Sillyname 2016-03-03 11:37 ` Another Sillyname 2016-03-03 12:56 ` Wols Lists [not found] ` <CAOS+5GH1Rcu8zGk1dQ+aSNmVzjo=irH65KfPuq1ZGruzqX_=vg@mail.gmail.com> 2016-03-03 14:07 ` Fwd: " Another Sillyname 2016-03-03 17:48 ` Sarah Newman 2016-03-03 17:59 ` Another Sillyname 2016-03-03 20:47 ` John Stoffel 2016-03-03 22:19 ` Another Sillyname 2016-03-03 22:42 ` John Stoffel 2016-03-04 19:01 ` Another Sillyname 2016-03-04 19:11 ` Alireza Haghdoost 2016-03-04 20:30 ` Another Sillyname 2016-03-04 21:02 ` Alireza Haghdoost 2016-03-04 21:52 ` Another Sillyname 2016-03-04 22:07 ` John Stoffel 2016-03-05 10:28 ` Another Sillyname 2016-03-05 10:47 ` Andreas Klauer 2016-03-09 0:23 ` NeilBrown 2016-03-12 11:38 ` Another Sillyname 2016-03-14 1:08 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).