* mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) @ 2015-07-12 6:02 Edward Kuns 2015-07-12 13:45 ` Phil Turmel 0 siblings, 1 reply; 7+ messages in thread From: Edward Kuns @ 2015-07-12 6:02 UTC (permalink / raw) To: linux-raid I experienced a total drive failure. Looking into it, I discovered that the particular hard drive model that failed is a particularly bad one. So I replaced not only the failed drive, but another of the same model. In the process, I ran into a problem where on reboot the RAID device was inactive. I finally found a solution to my problem in the earlier thread "raid5 reshape is stuck" that started on 15 May. By the way, I am on Fedora 21 > rpm -q mdadm mdadm-3.3.2-1.fc21.x86_64 > uname -srvmpio Linux 4.0.4-202.fc21.x86_64 #1 SMP Wed May 27 22:28:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux The short version of the story is that I replaced the dead drive and let the raid5 partition rebuild. Then I added a new drive and let the partition rebuild. Then I removed the not-yet-dead drive and here is where I ran into the same problem as the other poster. Basically, I did this to replace the still-working-but-suspect device, after the partition completed rebuilding when I replaced the actually-dead drive: mdadm --manage /dev/md125 --add /dev/sdf1 mdadm --grow --raid-devices=5 /dev/md125 ... wait for the rebuild to complete mdadm --fail /dev/md125 /dev/sdd2 mdadm --remove /dev/md125 /dev/sdd2 mdadm --grow --raid-devices=4 /dev/md125 mdadm: this change will reduce the size of the array. use --grow --array-size first to truncate array. e.g. mdadm --grow /dev/md125 --array-size 118964736 mdadm --grow /dev/md125 --array-size 118964736 mdadm --grow --raid-devices=4 /dev/md125 ... this failed with a mysterious complaint about my first partition (Cannot set new_offset). Research got me to try: mdadm --grow --raid-devices=4 /dev/md125 --backup-file /root/md125.backup .... here everything ground to a halt. The reshape was at 0% and there was no disk activity. The solution was to edit /lib/systemd/system/mdadm-grow-continue@.service to look like this (it was important that the backup file was placed in /tmp and not in /root or anywhere else. SELinux allowed mdadm to create a file in /tmp by not anywhere else I tried): # This file is part of mdadm. # # mdadm is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. [Unit] Description=Manage MD Reshape on /dev/%I DefaultDependencies=no [Service] ExecStart=/usr/sbin/mdadm --grow --continue /dev/%I --backup-file=/tmp/raid-backup-file StandardInput=null #StandardOutput=null #StandardError=null KillMode=none I had to comment out the standard out and error lines to see why the service was failing. I was pulling out my hair. The raid device failed to initialize, so my computer dumped me into runlevel 1. When the process finished after the above fix, I ended up in a weird state: Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 17 1 active sync /dev/sdb1 5 8 33 2 active sync /dev/sdc1 6 0 0 6 removed 6 8 49 - spare /dev/sdd1 but that is probably as a result of what I tried to bring it back. I could "stop" the raid and manually recreate it and the filesystems on it were fine. But it wouldn't come up without me doing that. I'm going to try to fail and re-add that disk again and see if it works now that it was able to complete a sync. I did a fail, remove, and add on /dev/sdd1 and it very quickly synced and came into service. The command "mdadm --detail /dev/md125" now shows a happy raid5 with four partitions in it, all "active sync" So all I had to do was add the --backup-file to the command to "grow" down to 4 devices, and also to mdadm-grow-continue@.service. I thought I'd let you know, in particular, that adding --backup-file=/tmp/raid-backup-file to the service file worked to get the process unstuck, and that due to SELinux it must be in tmp. Also, should the "Cannot set new_offset" complaint maybe suggest trying again with a backup file? Eddie ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-12 6:02 mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) Edward Kuns @ 2015-07-12 13:45 ` Phil Turmel 2015-07-12 19:24 ` Edward Kuns 0 siblings, 1 reply; 7+ messages in thread From: Phil Turmel @ 2015-07-12 13:45 UTC (permalink / raw) To: Edward Kuns, linux-raid Hi Edward, On 07/12/2015 02:02 AM, Edward Kuns wrote: [trim /] > The short version of the story is that I replaced the dead drive and > let the raid5 partition rebuild. Then I added a new drive and let the > partition rebuild. Then I removed the not-yet-dead drive and here is > where I ran into the same problem as the other poster. Basically, I > did this to replace the still-working-but-suspect device, after the > partition completed rebuilding when I replaced the actually-dead > drive: > > mdadm --manage /dev/md125 --add /dev/sdf1 > mdadm --grow --raid-devices=5 /dev/md125 > > ... wait for the rebuild to complete > > mdadm --fail /dev/md125 /dev/sdd2 > mdadm --remove /dev/md125 /dev/sdd2 > mdadm --grow --raid-devices=4 /dev/md125 > > mdadm: this change will reduce the size of the array. > use --grow --array-size first to truncate array. > e.g. mdadm --grow /dev/md125 --array-size 118964736 > > mdadm --grow /dev/md125 --array-size 118964736 > mdadm --grow --raid-devices=4 /dev/md125 > > ... this failed with a mysterious complaint about my first partition > (Cannot set new_offset). Research got me to try: > > mdadm --grow --raid-devices=4 /dev/md125 --backup-file /root/md125.backup Why were you using --grow for these operations only to reverse it? This is dangerous if you have a layer or filesystem on your array that doesn't support shrinking. None of the --grow operations were necessary in this sequence to achieve the end result of replacing disks. > .... here everything ground to a halt. The reshape was at 0% and > there was no disk activity. > > The solution was to edit > /lib/systemd/system/mdadm-grow-continue@.service to look like this (it > was important that the backup file was placed in /tmp and not in /root > or anywhere else. SELinux allowed mdadm to create a file in /tmp by > not anywhere else I tried): I'm not an SELinux guy, so I can't help with the rest, but you should know that many modern distros delete /tmp on reboot and/or play games with namespaces to isolate different users' /tmp spaces. [trim /] > I did a fail, remove, and > add on /dev/sdd1 and it very quickly synced and came into service. > The command "mdadm --detail /dev/md125" now shows a happy raid5 with > four partitions in it, all "active sync" These are the only operations you should have done in the first place. Although I would have put the --add first, so the --fail operation would have triggered a rebuild onto the spare right away. At no point should you have changed the number of raid devices. And for the still-running but suspect drive, the --replace operation would have been the right choice, again, after --add of a spare. HTH, Phil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-12 13:45 ` Phil Turmel @ 2015-07-12 19:24 ` Edward Kuns 2015-07-13 13:54 ` Phil Turmel 2015-07-13 18:37 ` Wols Lists 0 siblings, 2 replies; 7+ messages in thread From: Edward Kuns @ 2015-07-12 19:24 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid On Sun, Jul 12, 2015 at 8:45 AM, Phil Turmel <philip@turmel.org> wrote: > Why were you using --grow for these operations only to reverse it? This > is dangerous if you have a layer or filesystem on your array that > doesn't support shrinking. None of the --grow operations were necessary > in this sequence to achieve the end result of replacing disks. [snip] > At no point should you have changed the number of raid devices. [snip] > And for the still-running but suspect drive, the --replace operation > would have been the right choice, again, after --add of a spare. I didn't mention the steps I did to replace the failed drive because that went flawlessly. I did a fail and remove on it to be sure, but got complaints that it was already failed/removed. When I did an add for the replacement drive, it came in and synced automatically. I only ran into trouble trying to replace the "not yet dead but suspect" drive. I was following examples on the Internet. The example I was following was a clearly a bad one. The examples I found didn't suggest the --replace option. This is ultimately my fault for not being familiar enough with this. Now I know better. FWIW, I had LVM on top of the raid5, with two partitions (/var and an extra storage one) on the LVM. (I think there is some spare space too.) The goal, of course, is being able to survive any single-drive failure, which I did. You said this is dangerous. I went from 4->5 and then immediately 5->4 drives. I didn't expand the LVM on the raid5, and the replacement partition was a little bigger than the original. Next time, I'll use --replace, obviously. I just want to understand why it is dangerous. As long as the replacement partition is as big as the one it is replacing, isn't this just extra work, and more chance of running into problems like the one I ran into? But other than that, it shouldn't risk the actual data stored on the RAID,should it? > many modern distros delete /tmp on reboot and/or play > games with namespaces to isolate different users' /tmp spaces. So if the machine crashes during a rebuild, you may lose that backup file, depending on the distro. OK. Is there a better solution to this? Unfortunately, at the time of the failure to shrink, the rebuild that failed to start, stdout and stderr were not going to /var/log/messages, so I have no idea what the complaint was at that time. Does this service send so much output to stdout/stderr that it's useful to suppress it? If I'd seen something in /var/log/messages, it would have been more clear that there was a service with a complaint that was the cause of the rebuild failing to start. I wouldn't have done as much thrashing trying to figure out why. > These are the only operations you should have done in the first place. > Although I would have put the --add first, so the --fail operation would > have triggered a rebuild onto the spare right away. I did the fail/remove/add at the very end, after replacing the dead drive, after finally completing the "don't do it this way again" grow-to-5-then-shrink-to-4 process to replace the not-yet-dead drive. After the shrink finally completed, the new 4th drive showed as a spare and removed at the same time. i.e., this dump from my first EMail: Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 17 1 active sync /dev/sdb1 5 8 33 2 active sync /dev/sdc1 6 0 0 6 removed 6 8 49 - spare /dev/sdd1 Doing a fail, then remove, then add on that 4th partition (sdd1) brought it back and it very quickly synced. I did a forced fsck on both partitions to be sure, and both were clean. Thanks Eddie ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-12 19:24 ` Edward Kuns @ 2015-07-13 13:54 ` Phil Turmel 2015-07-13 21:38 ` Edward Kuns 2015-07-13 18:37 ` Wols Lists 1 sibling, 1 reply; 7+ messages in thread From: Phil Turmel @ 2015-07-13 13:54 UTC (permalink / raw) To: Edward Kuns; +Cc: linux-raid, NeilBrown Hi Eddie, On 07/12/2015 03:24 PM, Edward Kuns wrote: > On Sun, Jul 12, 2015 at 8:45 AM, Phil Turmel <philip@turmel.org> wrote: >> Why were you using --grow for these operations only to reverse it? This >> is dangerous if you have a layer or filesystem on your array that >> doesn't support shrinking. None of the --grow operations were necessary >> in this sequence to achieve the end result of replacing disks. > [snip] >> At no point should you have changed the number of raid devices. > [snip] >> And for the still-running but suspect drive, the --replace operation >> would have been the right choice, again, after --add of a spare. > > I didn't mention the steps I did to replace the failed drive because > that went flawlessly. I did a fail and remove on it to be sure, but > got complaints that it was already failed/removed. When I did an add > for the replacement drive, it came in and synced automatically. I > only ran into trouble trying to replace the "not yet dead but suspect" > drive. I was following examples on the Internet. The example I was > following was a clearly a bad one. The examples I found didn't > suggest the --replace option. This is ultimately my fault for not > being familiar enough with this. Now I know better. Even without the --replace operation, --grow should never have been used. On older kernels without support for --replace, the correct operation is --add spare then --fail, --remove. > FWIW, I had LVM on top of the raid5, with two partitions (/var and an > extra storage one) on the LVM. (I think there is some spare space > too.) The goal, of course, is being able to survive any single-drive > failure, which I did. > > You said this is dangerous. I went from 4->5 and then immediately > 5->4 drives. I didn't expand the LVM on the raid5, and the > replacement partition was a little bigger than the original. Next > time, I'll use --replace, obviously. I just want to understand why it > is dangerous. As long as the replacement partition is as big as the > one it is replacing, isn't this just extra work, and more chance of > running into problems like the one I ran into? But other than that, > it shouldn't risk the actual data stored on the RAID,should it? In theory, no. But the --grow operation has to move virtually every data block to a new location, and in your case, then back to its original location. Lots of unnecessary data movement that has a low but non-zero error-rate. Also, the complex operations in --grow have produced somewhat more than its fair share of mdadm bugs. Stuck reshapes are usually recoverable, but typically only with assistance from this list. Drive failures during reshapes can be particularly sticky, especially when the failure is of the device holding a critical section backup. >> many modern distros delete /tmp on reboot and/or play >> games with namespaces to isolate different users' /tmp spaces. > > So if the machine crashes during a rebuild, you may lose that backup > file, depending on the distro. OK. Is there a better solution to > this? Unfortunately, at the time of the failure to shrink, the > rebuild that failed to start, stdout and stderr were not going to > /var/log/messages, so I have no idea what the complaint was at that > time. Does this service send so much output to stdout/stderr that > it's useful to suppress it? If I'd seen something in > /var/log/messages, it would have been more clear that there was a > service with a complaint that was the cause of the rebuild failing to > start. I wouldn't have done as much thrashing trying to figure out > why. I don't use systemd so can't advise on this. Without systemd, mdadm just runs mdmon in the background and it all just works. >> These are the only operations you should have done in the first place. >> Although I would have put the --add first, so the --fail operation would >> have triggered a rebuild onto the spare right away. > > I did the fail/remove/add at the very end, after replacing the dead > drive, after finally completing the "don't do it this way again" > grow-to-5-then-shrink-to-4 process to replace the not-yet-dead drive. > After the shrink finally completed, the new 4th drive showed as a > spare and removed at the same time. i.e., this dump from my first Growing and shrinking didn't do anything to replace your suspect drive. It just moved the data blocks around on your other drives, all while not redundant. > EMail: > > Number Major Minor RaidDevice State > 0 8 2 0 active sync /dev/sda2 > 1 8 17 1 active sync /dev/sdb1 > 5 8 33 2 active sync /dev/sdc1 > 6 0 0 6 removed > > 6 8 49 - spare /dev/sdd1 It seems there is a corner case where at completion of shrink where one device becomes a spare, the new spare doesn't trigger the recovery code to pull it into service. Probably never noticed because reshaping a degraded array is *uncommon*. :-) This one is for Neil, I think... Phil ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-13 13:54 ` Phil Turmel @ 2015-07-13 21:38 ` Edward Kuns 0 siblings, 0 replies; 7+ messages in thread From: Edward Kuns @ 2015-07-13 21:38 UTC (permalink / raw) To: Phil Turmel; +Cc: linux-raid, NeilBrown On Mon, Jul 13, 2015 at 8:54 AM, Phil Turmel <philip@turmel.org> wrote: > Hi Eddie, > On older kernels without support for --replace, the correct > operation is --add spare then --fail, --remove. Makes sense. That was my original plan, since I didn't know about the replace option. Doing otherwise was a bad decision on my part. To make sure I understand this: 1) If you start out with a 4-drive healthy raid5 array and do add / fail / remove, the "fail" step immediately removes that drive from being an active participant in the array and causes the new drive to be populated with data recalculated from parity, right? 2) The new drive will sit in the array as a "spare" until it is needed, which doesn't happen until the "fail" step? And, 3) The "replace" option, instead, does the logical equivalent of moving all the data off one drive onto a spare but doesn't involve the other drives in a parity recalculation? >> it shouldn't risk the actual data stored on the RAID,should it? > > In theory, no. But the --grow operation has to move virtually every > data block to a new location, and in your case, then back to its > original location. Lots of unnecessary data movement that has a > low but non-zero error-rate. > > Also, the complex operations in --grow have produced somewhat > more than its fair share of mdadm bugs. Stuck reshapes are usually > recoverable, but typically only with assistance from this list. Drive > failures during reshapes can be particularly sticky, especially when > the failure is of the device holding a critical section backup. That all makes perfect sense, thanks. > I don't use systemd so can't advise on this. Without systemd, mdadm > just runs mdmon in the background and it all just works. I can't exactly say I use it by choice. I'd change distros but that would only delay the inevitable. > Growing and shrinking didn't do anything to replace your suspect drive. > It just moved the data blocks around on your other drives, all while > not redundant. I'm confused here. I started the grow 4->5 with a healthy raid5 with 4 drives. One of the four drives was "suspect" in that I expect it to fail at some point in the near future -- but it hadn't yet failed. I thought this grow would give me a raid with four data drives + one parity drive, all working. (And it seemed to.) And then I could fail the suspect drive and go back down to three data drives + one parity. The final output of the shrink certainly agrees with what you say, but I clearly don't understand it. I don't understand how going from 4 healthy drives to 5 healthy drives, and then failing and removing one of them and shrinking back down to 4 drives, ended up with 3 good and one spare. But that is what happened. > It seems there is a corner case where at completion of shrink where one > device becomes a spare, the new spare doesn't trigger the recovery code > to pull it into service. > > Probably never noticed because reshaping a degraded array is *uncommon*. > :-) It would be nice if my error in judgement helps save someone else in the future! If there is any data I can gather from my server that will help, I can get it. Although I won't be reproducing this experiment any time in the future on a server that has any data I care about. But note that I didn't reshape a degraded array. I reshaped a healthy array and ended up with a degraded one. Eddie ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-12 19:24 ` Edward Kuns 2015-07-13 13:54 ` Phil Turmel @ 2015-07-13 18:37 ` Wols Lists 2015-07-13 22:07 ` Edward Kuns 1 sibling, 1 reply; 7+ messages in thread From: Wols Lists @ 2015-07-13 18:37 UTC (permalink / raw) To: Edward Kuns, Phil Turmel; +Cc: linux-raid On 12/07/15 20:24, Edward Kuns wrote: >> > many modern distros delete /tmp on reboot and/or play >> > games with namespaces to isolate different users' /tmp spaces. > So if the machine crashes during a rebuild, you may lose that backup > file, depending on the distro. OK. Please note that this is the DEFINED behaviour of /tmp, so it has a very high probability of happening. If you want temporary data to survive a reboot, put it in /var/tmp. Oh - and if SeLinux only lets you put it in /tmp, what happens if you don't have a separate /tmp partition? You can't put the backup file on the partition you are rebuilding, and SeLinux won't let you put it anywhere else? That's a big disaster in the making ... Cheers, Wol ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) 2015-07-13 18:37 ` Wols Lists @ 2015-07-13 22:07 ` Edward Kuns 0 siblings, 0 replies; 7+ messages in thread From: Edward Kuns @ 2015-07-13 22:07 UTC (permalink / raw) To: Wols Lists; +Cc: Phil Turmel, linux-raid On Mon, Jul 13, 2015 at 1:37 PM, Wols Lists <antlists@youngman.org.uk> wrote: > Please note that this is the DEFINED behaviour of /tmp, so it has a very > high probability of happening. > > If you want temporary data to survive a reboot, put it in /var/tmp. OK. Looking more carefully, I see my "/tmp" partition is of type tmpfs. So yes, on reboot it would have been totally clean, exactly as you say. In my case, my /var partition was on the raid5 being reshaped, so /var/tmp wasn't an option for me. > Oh - and if SeLinux only lets you put it in /tmp, what happens if you > don't have a separate /tmp partition? You can't put the backup file on > the partition you are rebuilding, and SeLinux won't let you put it > anywhere else? That's a big disaster in the making ... Well, SELinux will let you put it anywhere that is labeled to allow it. So if it needs to go on some folder that isn't labelled properly, then some labeling needs to be done. I didn't want to deal with the (minor) hassle of creating a label, of having to understand what label would be appropriate, so I wanted to find a folder that was already allowed by the existing labeling. So I tried a bunch of folders in succession until one worked. What is the interaction between the backup file I had to specify in /lib/systemd/system/mdadm-grow-continue@.service and the backup file I had to specify on the command line to do the --grow to shrink the array? It kind of looks like the backup file on the "mdadm" command line doesn't really matter, except that I had to specify one, because "mdadm --grow --raid-devices=4 /dev/md125" wouldn't *try* to start without a backup file specified, but then just crashed in the mdadm-grow-continue service. Specifying a (different) backup file there and restarting the service got the reshape to complete. This raises a big question with SELinux. When a backup file is truly needed, mdadm needs the ability to write the backup file to more than one partition (not at a time, but in general), depending on which raid device is being modified. This means that some custom labeling may need to be done either in advance to prepare for recovery or on-the-fly in a recovery situation. Eddie ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2015-07-13 22:07 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-12 6:02 mdadm-grow-continue service crashing (similiar to "raid5 reshape is stuck" thread from May) Edward Kuns 2015-07-12 13:45 ` Phil Turmel 2015-07-12 19:24 ` Edward Kuns 2015-07-13 13:54 ` Phil Turmel 2015-07-13 21:38 ` Edward Kuns 2015-07-13 18:37 ` Wols Lists 2015-07-13 22:07 ` Edward Kuns
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).