* Re: let md auto-detect 128+ raid members, fix potential race condition [not found] ` <orac6qerr4.fsf@free.oliva.athome.lsd.ic.unicamp.br> @ 2006-07-30 23:20 ` Neil Brown 2006-07-31 16:34 ` Helge Hafting ` (2 more replies) 0 siblings, 3 replies; 16+ messages in thread From: Neil Brown @ 2006-07-30 23:20 UTC (permalink / raw) To: Alexandre Oliva; +Cc: Andrew Morton, linux-kernel, linux-raid [linux-raid added to cc. Background: patch was submitted to remove the current hard limit of 127 partitions that can be auto-detected - limit set by 'detected_devices array in md.c. ] My first inclination is not to fix this problem. I consider md auto-detect to be a legacy feature. I don't use it and I recommend that other people don't use it. However I cannot justify removing it, so it stays there. Having this limitation could be seen as a good motivation for some more users to stop using it. Why not use auto-detect? I have three issues with it. 1/ It just isn't "right". We don't mount filesystems from partitions just because they have type 'Linux'. We don't enable swap on partitions just because they have type 'Linux swap'. So why do we assemble md/raid from partitions that have type 'Linux raid autodetect'? 2/ It can cause problems when moving devices. If you have two machines, both with an 'md0' array and you move the drives from one on to the other - say because the first lost a powersupply - and then reboot the machine that received the drives, which array gets assembled as 'md0' ?? You might be lucky, you might not. This isn't purely theoretical - there have been pleas for help on linux-raid resulting from exactly this - though they have been few. 3/ The information redundancy can cause a problem when it gets out of sync. i.e. you add a partition to a raid array without setting the partition type to 'fd'. This works, but on the next reboot the partition doesn't get added back into the array and you have to manually add it yourself. This too is not purely theory - it has been reported slightly more often than '2'. So my preferred solution to the problem is to tell people not to use autodetect. Quite possibly this should be documented in the code, and maybe even have a KERN_INFO message if more than 64 devices are autodetected. Now one doesn't always go with one's first inclination so I should discuss the approaches to fixing the problem, should a fix really be the right thing to do. The idea of having a generic notifier is, I think, missing the point - so I'd better explain what the point is..... The kernel already has the hotplug mechanism for alerting user-space about new partitions, and I'm sure kernel clients could hook into this some how. But even if they could, md wouldn't want to. md doesn't really want to know about new partitions. It simply wants a list of all partitions which are of type 'autodetect'. Getting a list of all partitions is quite easy (/proc/partitions does it, so it cannot be hard). But 'struct hd_struct' doesn't record what the partition type is at all. It spare bits, so it could. (->policy is currently a 'ro' flag. A little work would allow multiple flags there). The point of the current detected_devices array is precisely to find partitions which have this flag set. If we were to 'fix' this problem, I think the cleanest approach (which I haven't actually coded, so it might not work...) would be to define a new flag to go in hd_struct->policy to say if the partition type suggested auto-detect, and get partitions/check.c to set this. Then have md iterate all partitions looking for this flag. That could be considered to be intrusive to bits of the kernel which don't need to be intruded in to, and could run the risk of someone wanting to expose that flag to user-space (with in /proc/partitions or /sys) and I would be against that. But otherwise it would be a fairly clean approach. The minimal (non-empty) approach of replacing the array by a linked list as the original patch did would also be quite reasonable. Maybe not as clean as a flag in hd_struct, but maybe we don't need this code to be particularly clean(?). So: Do you *really* need to *fix* this, or can you just use 'mdadm' to assemble you arrays instead? NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-30 23:20 ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown @ 2006-07-31 16:34 ` Helge Hafting 2006-07-31 20:27 ` Alexandre Oliva 2006-08-01 17:40 ` Bill Davidsen 2 siblings, 0 replies; 16+ messages in thread From: Helge Hafting @ 2006-07-31 16:34 UTC (permalink / raw) To: Neil Brown; +Cc: Alexandre Oliva, Andrew Morton, linux-kernel, linux-raid On Mon, Jul 31, 2006 at 09:20:58AM +1000, Neil Brown wrote: > > My first inclination is not to fix this problem. > > I consider md auto-detect to be a legacy feature. > I don't use it and I recommend that other people don't use it. > However I cannot justify removing it, so it stays there. > Having this limitation could be seen as a good motivation for some > more users to stop using it. > > Why not use auto-detect? [Arguments deleted] Well, if autodetection is removed, what is then the preferred way of booting off a raid-1 device? Kernel parameters? An initrd with mdadm just for this? Some people want to do even partition detection from initrd, in order to have a smaller kernel. We aren't there yet though. Autotetect is nice from an administrator viewpoint - compile it in and it "just works". The trouble when you connect an array from some other machine is to be expected, but that isn't exactly everyday stuff. Helge Hafting ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-30 23:20 ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown 2006-07-31 16:34 ` Helge Hafting @ 2006-07-31 20:27 ` Alexandre Oliva 2006-07-31 21:48 ` David Greaves 2006-08-01 1:19 ` Neil Brown 2006-08-01 17:40 ` Bill Davidsen 2 siblings, 2 replies; 16+ messages in thread From: Alexandre Oliva @ 2006-07-31 20:27 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote: > 1/ > It just isn't "right". We don't mount filesystems from partitions > just because they have type 'Linux'. We don't enable swap on > partitions just because they have type 'Linux swap'. So why do we > assemble md/raid from partitions that have type 'Linux raid > autodetect'? Similar reason to why vgscan finds and attempts to use any partitions that have the appropriate type/signature (difference being that raid auto-detect looks at the actual partition type, whereas vgscan looks at the actual data, just like mdadm, IIRC): when you have to bootstrap from an initrd, you don't want to be forced to have the correct data in the initrd image, since then any reconfiguration requires the info to be introduced in the initrd image before the machine goes down. Sometimes, especially in case of disk failures, you just can't do that. > 2/ > It can cause problems when moving devices. It can, indeed, and it has caused such problems to me before, but they're the exception, not the rule, and one should optimize for the rule, not the exception. > 3/ > The information redundancy can cause a problem when it gets out of > sync. i.e. you add a partition to a raid array without setting > the partition type to 'fd'. This works, but on the next reboot > the partition doesn't get added back into the array and you have > to manually add it yourself. > This too is not purely theory - it has been reported slightly more > often than '2'. This has happened to me as well, and I remember it was extremely confusing when it first happened :-) But that's an argument to change the behavior so as to look for the superblock instead of trusting the partition type, not an argument to remove the auto-detection feature. And then, the reliance on partition type has been useful at times as well, when I explicitly did *not* want a certain raid device or raid member to be brought up on boot. > So my preferred solution to the problem is to tell people not to use > autodetect. Quite possibly this should be documented in the code, and > maybe even have a KERN_INFO message if more than 64 devices are > autodetected. I wouldn't have a problem with that, since then distros would probably switch to a more recommended mechanism that works just as well, i.e., ideally without requiring initrd-regeneration after reconfigurations such as adding one more raid device to the logical volume group containing the root filesystem. > If we were to 'fix' this problem, I think the cleanest approach (which > I haven't actually coded, so it might not work...) would be to define > a new flag to go in hd_struct->policy to say if the partition type > suggested auto-detect, and get partitions/check.c to set this. > Then have md iterate all partitions looking for this flag. AFAICT we'd still need a list or an array, since we add stuff back to the list in various situations. > So: Do you *really* need to *fix* this, or can you just use 'mdadm' > to assemble you arrays instead? I'm not sure. I'd expect not to need it, but the limited feature currently in place, that initrd uses to bring up the raid1 devices containing the physical volumes that form the volume group where the logical volume with my root filesystem is also brings up various raid6 physical volumes that form an unrelated volume group, and it does so in such a way that the last of them, containing the 128th fd-type partition in the box, ends up being left out, so the raid device it's a member of is brought up either degraded or missing the spare member, none of which are good. I don't know that I can easily get initrd to replace nash's raidautorun for mdadm unless mdadm has a mode to bring up any arrays it can find, as opposed to bringing up a specific array out of a given list of members or scanning for members. Either way, this won't fix the problem 2) that you mentioned, but requiring initrd-regeneration after extending the volume group containing the root device is another problem that the current modes of operation of mdadm AFAIK won't contemplate, so switching to it will trade one problem for another, and the latter is IMHO more common than the former. -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-31 20:27 ` Alexandre Oliva @ 2006-07-31 21:48 ` David Greaves 2006-08-01 2:20 ` Alexandre Oliva 2006-08-01 1:19 ` Neil Brown 1 sibling, 1 reply; 16+ messages in thread From: David Greaves @ 2006-07-31 21:48 UTC (permalink / raw) To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid Alexandre Oliva wrote: > On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote: > >> 1/ >> It just isn't "right". We don't mount filesystems from partitions >> just because they have type 'Linux'. We don't enable swap on >> partitions just because they have type 'Linux swap'. So why do we >> assemble md/raid from partitions that have type 'Linux raid >> autodetect'? > > Similar reason to why vgscan finds and attempts to use any partitions > that have the appropriate type/signature (difference being that raid > auto-detect looks at the actual partition type, whereas vgscan looks > at the actual data, just like mdadm, IIRC): when you have to bootstrap > from an initrd, you don't want to be forced to have the correct data > in the initrd image, since then any reconfiguration requires the info > to be introduced in the initrd image before the machine goes down. > Sometimes, especially in case of disk failures, you just can't do > that. > This debate is not about generic autodetection - a good thing (tm) - but in-kernel vs userspace autodetection. Your example supports Neil's case - the proposal is to use initrd to run mdadm which thne (kinda) does what vgscan does. > >> So my preferred solution to the problem is to tell people not to use (in kernel) >> autodetect. Quite possibly this should be documented in the code, and >> maybe even have a KERN_INFO message if more than 64 devices are >> autodetected. > > I wouldn't have a problem with that, since then distros would probably > switch to a more recommended mechanism that works just as well, i.e., > ideally without requiring initrd-regeneration after reconfigurations > such as adding one more raid device to the logical volume group > containing the root filesystem. That's supported in today's mdadm. look at --uuid and --name >> So: Do you *really* need to *fix* this, or can you just use 'mdadm' >> to assemble you arrays instead? > > I'm not sure. I'd expect not to need it, but the limited feature > currently in place, that initrd uses to bring up the raid1 devices > containing the physical volumes that form the volume group where the > logical volume with my root filesystem is also brings up various raid6 > physical volumes that form an unrelated volume group, and it does so > in such a way that the last of them, containing the 128th fd-type > partition in the box, ends up being left out, so the raid device it's > a member of is brought up either degraded or missing the spare member, > none of which are good. > > I don't know that I can easily get initrd to replace nash's > raidautorun for mdadm unless mdadm has a mode to bring up any arrays > it can find, as opposed to bringing up a specific array out of a given > list of members or scanning for members. Either way, this won't fix > the problem 2) that you mentioned, but requiring initrd-regeneration > after extending the volume group containing the root device is another > problem that the current modes of operation of mdadm AFAIK won't > contemplate, so switching to it will trade one problem for another, > and the latter is IMHO more common than the former. > I think you should name your raid1 (maybe "hostname-root") and use initrd to bring it up by --name using: mdadm --assemble --scan --config partitions --name hostname-root It could also, later in the boot process, bring up "hostname-raid6" by --name too. mdadm --assemble --scan --config partitions --name hostname-raid6 David -- ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-31 21:48 ` David Greaves @ 2006-08-01 2:20 ` Alexandre Oliva 2006-08-01 8:28 ` Michael Tokarev 0 siblings, 1 reply; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 2:20 UTC (permalink / raw) To: David Greaves; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid On Jul 31, 2006, David Greaves <david@dgreaves.com> wrote: > Alexandre Oliva wrote: >> in the initrd image, since then any reconfiguration requires the info >> to be introduced in the initrd image before the machine goes down. >> Sometimes, especially in case of disk failures, you just can't do >> that. > Your example supports Neil's case - the proposal is to use initrd to run > mdadm which thne (kinda) does what vgscan does. If mdadm can indeed scan all partitions to bring up all raid devices in them, like nash's raidautorun does, great. I'll give that a try, since Neil suggested it should already work in the version of mdadm that I got here. I didn't get that impression while skimming through the man page, but upon closer inspection now I see it's all there. Oops :-) >> I wouldn't have a problem with that, since then distros would probably >> switch to a more recommended mechanism that works just as well, i.e., >> ideally without requiring initrd-regeneration after reconfigurations >> such as adding one more raid device to the logical volume group >> containing the root filesystem. > That's supported in today's mdadm. > look at --uuid and --name --uuid and --name won't help at all. I'm talking about adding raid physical volumes to a volume group, which means new uuid and name, so whatever already is in initrd won't get it. Neil's-posted command line should take care of that though. Even if the root device doesn't use the newly-added physical volume, initrd's vgscan needs to find *all* physical volumes in the volume group, otherwise the volume group will be started in `degraded' mode, i.e., with the missing physical volumes mapped to a device mapper node that will produce I/O errors on access IIRC, and everything else read-only, without any way to switch to read-write when the remaining devices are made available, which is arguably a missing feature in the LVM subsystem. -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 2:20 ` Alexandre Oliva @ 2006-08-01 8:28 ` Michael Tokarev 2006-08-01 21:24 ` Alexandre Oliva 0 siblings, 1 reply; 16+ messages in thread From: Michael Tokarev @ 2006-08-01 8:28 UTC (permalink / raw) To: Alexandre Oliva Cc: David Greaves, Neil Brown, Andrew Morton, linux-kernel, linux-raid Alexandre Oliva wrote: [] > If mdadm can indeed scan all partitions to bring up all raid devices > in them, like nash's raidautorun does, great. I'll give that a try, Never, ever, try to do that (again). Mdadm (or vgscan, or whatever) should NOT assemble ALL arrays found, but only those which it has been told to assemble. This is it again: you bring another disk into a system (disk which comes from another machine), and mdadm finds FOREIGN arrays and brings them up as /dev/md0, where YOUR root filesystem should be. That's what 'homehost' option is for, for example. If initrd should be reconfigured after some changes (be it raid arrays, LVM volumes, hostname, whatever), -- I for one am fine with that. Hopefully no one will argue that if you forgot to install an MBR into your replacement drive, it was entirely your own fault that your system become unbootable, after all ;) /mjt ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 8:28 ` Michael Tokarev @ 2006-08-01 21:24 ` Alexandre Oliva 0 siblings, 0 replies; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 21:24 UTC (permalink / raw) To: Michael Tokarev Cc: David Greaves, Neil Brown, Andrew Morton, linux-kernel, linux-raid On Aug 1, 2006, Michael Tokarev <mjt@tls.msk.ru> wrote: > Alexandre Oliva wrote: > [] >> If mdadm can indeed scan all partitions to bring up all raid devices >> in them, like nash's raidautorun does, great. I'll give that a try, > Never, ever, try to do that (again). Mdadm (or vgscan, or whatever) > should NOT assemble ALL arrays found, but only those which it has > been told to assemble. This is it again: you bring another disk into > a system (disk which comes from another machine), and mdadm finds > FOREIGN arrays and brings them up as /dev/md0, where YOUR root > filesystem should be. That's what 'homehost' option is for, for > example. Exactly. So make it /all/all local/, if you must. It's the same as far as I'm concerned. > If initrd should be reconfigured after some changes (be it raid > arrays, LVM volumes, hostname, whatever), -- I for one am fine > with that. Feel free to be fine with it, as long as you also let me be free to not be fine with it and try to cut a better deal :-) -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-31 20:27 ` Alexandre Oliva 2006-07-31 21:48 ` David Greaves @ 2006-08-01 1:19 ` Neil Brown 2006-08-01 2:35 ` Alexandre Oliva 1 sibling, 1 reply; 16+ messages in thread From: Neil Brown @ 2006-08-01 1:19 UTC (permalink / raw) To: Alexandre Oliva; +Cc: Andrew Morton, linux-kernel, linux-raid On Monday July 31, aoliva@redhat.com wrote: > On Jul 30, 2006, Neil Brown <neilb@suse.de> wrote: > > > 1/ > > It just isn't "right". We don't mount filesystems from partitions > > just because they have type 'Linux'. We don't enable swap on > > partitions just because they have type 'Linux swap'. So why do we > > assemble md/raid from partitions that have type 'Linux raid > > autodetect'? > > Similar reason to why vgscan finds and attempts to use any partitions > that have the appropriate type/signature (difference being that raid > auto-detect looks at the actual partition type, whereas vgscan looks > at the actual data, just like mdadm, IIRC): when you have to bootstrap > from an initrd, you don't want to be forced to have the correct data > in the initrd image, since then any reconfiguration requires the info > to be introduced in the initrd image before the machine goes down. > Sometimes, especially in case of disk failures, you just can't do > that. The initrd need to 'know' how to find the root filesystem, whether by devnum or uuid or whatever. In exactly the same way it needs to know how to find the components for the root md array - uuid is the best. There is no need to reconfigure this in the case of a disk failure. Current mdadm will assemble arrays for you given only a hostname. You still need to get the hostname into the initrd, but that is no different from a root device number. > > > 2/ > > It can cause problems when moving devices. > > It can, indeed, and it has caused such problems to me before, but > they're the exception, not the rule, and one should optimize for the > rule, not the exception. We aren't talking about optimisation. We are talking about whether it actually works or not. A system that stops booting just because you plugged a couple of extra drives in is a badly configured system. > > > 3/ > > The information redundancy can cause a problem when it gets out of > > sync. i.e. you add a partition to a raid array without setting > > the partition type to 'fd'. This works, but on the next reboot > > the partition doesn't get added back into the array and you have > > to manually add it yourself. > > This too is not purely theory - it has been reported slightly more > > often than '2'. > > This has happened to me as well, and I remember it was extremely > confusing when it first happened :-) But that's an argument to change > the behavior so as to look for the superblock instead of trusting the > partition type, not an argument to remove the auto-detection > feature. As has been said, I don't want to remove auto-detection. I want to do it right, and do it from userspace. It is in-kernel autodetection that I have no interest in improving. > > And then, the reliance on partition type has been useful at times as > well, when I explicitly did *not* want a certain raid device or raid > member to be brought up on boot. Well, at boot it should only bring up the raid array containing the root filesystem. Everything else is best done by /etc/init.d scripts. And you can stop those from running by booting with -s (or whatever it is to get single-user). > > > So my preferred solution to the problem is to tell people not to use > > autodetect. Quite possibly this should be documented in the code, and > > maybe even have a KERN_INFO message if more than 64 devices are > > autodetected. > > I wouldn't have a problem with that, since then distros would probably > switch to a more recommended mechanism that works just as well, i.e., > ideally without requiring initrd-regeneration after reconfigurations > such as adding one more raid device to the logical volume group > containing the root filesystem. > > > If we were to 'fix' this problem, I think the cleanest approach (which > > I haven't actually coded, so it might not work...) would be to define > > a new flag to go in hd_struct->policy to say if the partition type > > suggested auto-detect, and get partitions/check.c to set this. > > Then have md iterate all partitions looking for this flag. > > AFAICT we'd still need a list or an array, since we add stuff back to > the list in various situations. No. We just need a list of partitions of the appropriate type. Taking items of the list and putting them back on later is simply a non-essential optimisation. > > > So: Do you *really* need to *fix* this, or can you just use 'mdadm' > > to assemble you arrays instead? > > I'm not sure. I'd expect not to need it, but the limited feature > currently in place, that initrd uses to bring up the raid1 devices > containing the physical volumes that form the volume group where the > logical volume with my root filesystem is also brings up various raid6 > physical volumes that form an unrelated volume group, and it does so > in such a way that the last of them, containing the 128th fd-type > partition in the box, ends up being left out, so the raid device it's > a member of is brought up either degraded or missing the spare member, > none of which are good. > > I don't know that I can easily get initrd to replace nash's > raidautorun for mdadm unless mdadm has a mode to bring up any arrays > it can find, as opposed to bringing up a specific array out of a given > list of members or scanning for members. Either way, this won't fix > the problem 2) that you mentioned, but requiring initrd-regeneration > after extending the volume group containing the root device is another > problem that the current modes of operation of mdadm AFAIK won't > contemplate, so switching to it will trade one problem for another, > and the latter is IMHO more common than the former. Get mdadm 2.5.2 (or 2.5.3 if I get that out soon enough) and try mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \ --auto=yes --run in your initrd, having set the hostname correctly first. It might do exactly what you want. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 1:19 ` Neil Brown @ 2006-08-01 2:35 ` Alexandre Oliva 2006-08-01 3:33 ` Alexandre Oliva 0 siblings, 1 reply; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 2:35 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid On Jul 31, 2006, Neil Brown <neilb@suse.de> wrote: > The initrd need to 'know' how to find the root filesystem, whether by > devnum or uuid or whatever. Yeah, the tricky bit is the `whatever' alternative, when / is a logical volume, and you need to bring up all of the physical volumes in order for vgscan to bring up the volume group in a usable way. > In exactly the same way it needs to know how to find the components > for the root md array - uuid is the best. There is no need to > reconfigure this in the case of a disk failure. When you add physical volumes to the volume group, you'd have to reconfigure initrd if it wasn't for mdadm's ability to scan all partitions. > Current mdadm will assemble arrays for you given only a hostname. You > still need to get the hostname into the initrd, but that is no > different from a root device number. Yep, this should work, at least until someone changes the hostname, creates a new array with the new option and then gets puzzled because only that array isn't brought up. Or, worse, does all of the above and then rebuilds initrd ``just in case'', and then ends up unable to reboot because the root device won't be brought up. Oops :-) >> >> > 2/ >> > It can cause problems when moving devices. >> It can, indeed, and it has caused such problems to me before, but >> they're the exception, not the rule, and one should optimize for the >> rule, not the exception. > We aren't talking about optimisation. We are talking about whether it > actually works or not. Yes, I'm talking about getting it to work most often in the most common case. Obviously we can't get it to work in every possible case, since there are the various corner cases involving moving disks around, renaming hosts and creating arrays, some of which must necessarily fail in order for others to work. It's finding the right balance between them that is tricky, and some people will always be unhappy because their particularly rare case failed, even without realizing that this was in order to enable a more common case they happened to rely on to work. > A system that stops booting just because you plugged a couple of > extra drives in is a badly configured system. I tend to agree, although I used to exercise a case that wouldn't be covered by this new policy: I used to move a pair of raid 1 external disks between two hosts, and have them configured to be optionally mounted on boot, depending on whether the raid devices were in place or not. With hostname identification, this wouldn't quite work :-) > Well, at boot it should only bring up the raid array containing the > root filesystem. If all you have is in a single LVM volume group, then that must be everything :-/ > Everything else is best done by /etc/init.d scripts. And you can > stop those from running by booting with -s (or whatever it is to get > single-user). Booting into single user mode actually attempts to mount everything that is local, after bringing up raid devices et al, so that would be too late. But there's always init=/bin/bash :-) > Get mdadm 2.5.2 (or 2.5.3 if I get that out soon enough) and try > mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \ > --auto=yes --run > in your initrd, having set the hostname correctly first. It might do > exactly what you want. Awesome, thanks, I'd missed that in the docs. It might make sense to spell it out as an example, instead of requiring someone to figure out all of the bits and pieces from the extensive documentation. Not complaining about the extent of the documentation, BTW :-) I'll give it a try some time tomorrow, since I won't turn on that noisy box today any more; my daughter is already asleep :-) Anyhow, unless there's a good reason to keep the code the way it is, wasting valuable bytes of memory in the fixed-size array, I guess it would make more sense to just merge the patch in, no? :-) -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 2:35 ` Alexandre Oliva @ 2006-08-01 3:33 ` Alexandre Oliva 2006-08-01 20:46 ` Alexandre Oliva 0 siblings, 1 reply; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 3:33 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid [-- Attachment #1: Type: text/plain, Size: 705 bytes --] On Jul 31, 2006, Alexandre Oliva <aoliva@redhat.com> wrote: >> mdadm --assemble --scan --homehost='<system>' --auto-update-homehost \ >> --auto=yes --run >> in your initrd, having set the hostname correctly first. It might do >> exactly what you want. > I'll give it a try some time tomorrow, since I won't turn on that > noisy box today any more; my daughter is already asleep :-) But then, I could use my own desktop to test it :-) FWIW, here's the patch for Fedora rawhide's mkinitrd that worked for me. I figured even without --homehost it worked fine, even without HOMEHOST set in mdadm.conf. I hope copying mdadm.conf to initrd won't ever hurt, can you think of any case in which it would? [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: mkinitrd-mdadm.patch --] [-- Type: text/x-patch, Size: 650 bytes --] --- /sbin/mkinitrd 2006-07-26 15:43:41.000000000 -0300 +++ /tmp/mkinitrd 2006-08-01 00:06:14.000000000 -0300 @@ -1240,10 +1240,19 @@ emitdms if [ -n "$raiddevices" ]; then + if test -f /sbin/mdadm.static; then + if test -f /etc/mdadm.conf; then + inst /etc/mdadm.conf "$MNTIMAGE/etc/mdadm.conf" + fi + inst /sbin/mdadm.static "$MNTIMAGE/sbin/mdadm" + emit "mkdir /dev/md" + emit "mdadm --quiet --assemble --scan --auto-update-homehost --auto=yes --run" + else for dev in $raiddevices; do cp -a /dev/${dev} $MNTIMAGE/dev emit "raidautorun /dev/${dev}" done + fi fi if [ -n "$vg_list" ]; then [-- Attachment #3: Type: text/plain, Size: 249 bytes --] -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 3:33 ` Alexandre Oliva @ 2006-08-01 20:46 ` Alexandre Oliva 2006-08-02 6:37 ` Luca Berra 0 siblings, 1 reply; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 20:46 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel, linux-raid On Aug 1, 2006, Alexandre Oliva <aoliva@redhat.com> wrote: >> I'll give it a try some time tomorrow, since I won't turn on that >> noisy box today any more; my daughter is already asleep :-) > But then, I could use my own desktop to test it :-) But then, I wouldn't be testing quite the same scenario. My boot-required RAID devices were all raid 1, whereas the larger, separate volume group was all raid 6. Using the mkinitrd patch that I posted before, the result was that mdadm did try to bring up all raid devices but, because the raid456 module was not loaded in initrd, the raid devices were left inactive. Then, when rc.sysinit tried to bring them up with mdadm -A -s, that did nothing to the inactive devices, since they didn't have to be assembled. Adding --run didn't help. My current work-around is to add raid456 to initrd, but that's ugly. Scanning /proc/mdstat for inactive devices in rc.sysinit and doing mdadm --run on them is feasible, but it looks ugly and error-prone. Would it be reasonable to change mdadm so as to, erhm, disassemble ;-) the raid devices it tried to bring up but that, for whatever reason, it couldn't activate? (say, missing module, not enough members, whatever) -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 20:46 ` Alexandre Oliva @ 2006-08-02 6:37 ` Luca Berra 0 siblings, 0 replies; 16+ messages in thread From: Luca Berra @ 2006-08-02 6:37 UTC (permalink / raw) To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid On Tue, Aug 01, 2006 at 05:46:38PM -0300, Alexandre Oliva wrote: >Using the mkinitrd patch that I posted before, the result was that >mdadm did try to bring up all raid devices but, because the raid456 >module was not loaded in initrd, the raid devices were left inactive. probably your initrd is broken, it should not have even tried to bring up an md array that was not needed to mount root. >Then, when rc.sysinit tried to bring them up with mdadm -A -s, that >did nothing to the inactive devices, since they didn't have to be >assembled. Adding --run didn't help. > >My current work-around is to add raid456 to initrd, but that's ugly. >Scanning /proc/mdstat for inactive devices in rc.sysinit and doing >mdadm --run on them is feasible, but it looks ugly and error-prone. > >Would it be reasonable to change mdadm so as to, erhm, disassemble ;-) >the raid devices it tried to bring up but that, for whatever reason, >it couldn't activate? (say, missing module, not enough members, >whatever) this would make sense if it were an option, patches welcome :) L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-07-30 23:20 ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown 2006-07-31 16:34 ` Helge Hafting 2006-07-31 20:27 ` Alexandre Oliva @ 2006-08-01 17:40 ` Bill Davidsen 2006-08-01 21:32 ` Alexandre Oliva 2 siblings, 1 reply; 16+ messages in thread From: Bill Davidsen @ 2006-08-01 17:40 UTC (permalink / raw) To: Neil Brown; +Cc: Alexandre Oliva, Andrew Morton, linux-kernel, linux-raid Neil Brown wrote: >[linux-raid added to cc. > Background: patch was submitted to remove the current hard limit > of 127 partitions that can be auto-detected - limit set by > 'detected_devices array in md.c. >] > >My first inclination is not to fix this problem. > >I consider md auto-detect to be a legacy feature. >I don't use it and I recommend that other people don't use it. >However I cannot justify removing it, so it stays there. >Having this limitation could be seen as a good motivation for some >more users to stop using it. > >Why not use auto-detect? >I have three issues with it. > > 1/ > It just isn't "right". We don't mount filesystems from partitions > just because they have type 'Linux'. We don't enable swap on > partitions just because they have type 'Linux swap'. So why do we > assemble md/raid from partitions that have type 'Linux raid > autodetect'? > > I rarely think you are totally wrong about anything RAID, but I do believe you have missed the point of autodetect. It is intended to work as it does now, building the array without depending on some user level functionality. The name "autodetect" clearly differentiates this type from the others you mentioned, there is no implication that swap or Linux partitions should do anything automatically. This is not a case of my using a feature and defending it, I don't use it currently. for all of the reasons you enumerate. That doesn't mean that I haven't used the autodetect in the past or that I won't in the future, particularly with embedded systems. > 2/ > It can cause problems when moving devices. If you have two > machines, both with an 'md0' array and you move the drives from one > on to the other - say because the first lost a powersupply - and > then reboot the machine that received the drives, which array gets > assembled as 'md0' ?? You might be lucky, you might not. This > isn't purely theoretical - there have been pleas for help on > linux-raid resulting from exactly this - though they have been > few. > > 3/ > The information redundancy can cause a problem when it gets out of > sync. i.e. you add a partition to a raid array without setting > the partition type to 'fd'. This works, but on the next reboot > the partition doesn't get added back into the array and you have > to manually add it yourself. > This too is not purely theory - it has been reported slightly more > often than '2'. > >So my preferred solution to the problem is to tell people not to use >autodetect. Quite possibly this should be documented in the code, and >maybe even have a KERN_INFO message if more than 64 devices are >autodetected. > > I don't personally see the value of autodetect for putting together the huge number of drives people configure. I see this as a way to improve boot reliability, if someone needs 64 drives for root and boot, they need to read a few essays on filesystem configuration. However, I'm aware that there are some really bizarre special cases out there. Maybe the limit should be in KCONFIG, with a default of 16 or so. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 17:40 ` Bill Davidsen @ 2006-08-01 21:32 ` Alexandre Oliva 2006-08-02 6:47 ` Luca Berra 2006-08-02 16:47 ` Bill Davidsen 0 siblings, 2 replies; 16+ messages in thread From: Alexandre Oliva @ 2006-08-01 21:32 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid On Aug 1, 2006, Bill Davidsen <davidsen@tmr.com> wrote: > I rarely think you are totally wrong about anything RAID, but I do > believe you have missed the point of autodetect. It is intended to > work as it does now, building the array without depending on some user > level functionality. Well, it clearly depends on at least some user level functionality (the ioctl that triggers autodetect). Going from that to a full-fledged mdadm doesn't sound like such a big deal to me. > I don't personally see the value of autodetect for putting together > the huge number of drives people configure. I see this as a way to > improve boot reliability, if someone needs 64 drives for root and > boot, they need to read a few essays on filesystem > configuration. However, I'm aware that there are some really bizarre > special cases out there. There's LVM. If you have to keep root out of the VG just because people say so, you lose lots of benefits from LVM, such as being able to grow root with the system running, take snapshots of root, etc. Sure enough the LVM subsystem could make things better for one to not need all of the PVs in the root-containing VG in order to be able to mount root read-write, or at all, but if you think about it, if initrd is set up such that you only bring up the devices that hold the actual root device within the VG and then you change that, say by taking a snapshot of root, moving it around, growing it, etc, you'd be better off if you could still boot. So you do want all of the VG members to be around, just in case. This is trivially-accomplished for regular disks whose drivers are loaded by initrd, but for raid devices, you need to tentatively bring up every raid member you can, just in case some piece of root is there, otherwise you may end up unable to boot. Yes, this is an argument against root on LVM, but there are arguments *for* root on LVM as well, and there's no reason to not support both behaviors equally well and let people figure out what works best for them. -- Alexandre Oliva http://www.lsd.ic.unicamp.br/~oliva/ Secretary for FSF Latin America http://www.fsfla.org/ Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org} Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org} ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 21:32 ` Alexandre Oliva @ 2006-08-02 6:47 ` Luca Berra 2006-08-02 16:47 ` Bill Davidsen 1 sibling, 0 replies; 16+ messages in thread From: Luca Berra @ 2006-08-02 6:47 UTC (permalink / raw) To: Alexandre Oliva Cc: Bill Davidsen, Neil Brown, Andrew Morton, linux-kernel, linux-raid On Tue, Aug 01, 2006 at 06:32:33PM -0300, Alexandre Oliva wrote: >Sure enough the LVM subsystem could make things better for one to not >need all of the PVs in the root-containing VG in order to be able to >mount root read-write, or at all, but if you think about it, if initrd it shouldn't need all of the PVs you just need all the pv where the rootfs is. >is set up such that you only bring up the devices that hold the actual >root device within the VG and then you change that, say by taking a >snapshot of root, moving it around, growing it, etc, you'd be better >off if you could still boot. So you do want all of the VG members to >be around, just in case. in this case just regenerate the initramfs after modifying the vg that contains root. I am fairly sure that kernel upgrades are far more frequent than the addirion of PVs to the root VG. >Yes, this is an argument against root on LVM, but there are arguments >*for* root on LVM as well, and there's no reason to not support both >behaviors equally well and let people figure out what works best for >them. No, this is just an argument against misusing root on lvm. L. -- Luca Berra -- bluca@comedia.it Communication Media & Services S.r.l. /"\ \ / ASCII RIBBON CAMPAIGN X AGAINST HTML MAIL / \ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: let md auto-detect 128+ raid members, fix potential race condition 2006-08-01 21:32 ` Alexandre Oliva 2006-08-02 6:47 ` Luca Berra @ 2006-08-02 16:47 ` Bill Davidsen 1 sibling, 0 replies; 16+ messages in thread From: Bill Davidsen @ 2006-08-02 16:47 UTC (permalink / raw) To: Alexandre Oliva; +Cc: Neil Brown, Andrew Morton, linux-kernel, linux-raid Alexandre Oliva wrote: > On Aug 1, 2006, Bill Davidsen <davidsen@tmr.com> wrote: > >> I rarely think you are totally wrong about anything RAID, but I do >> believe you have missed the point of autodetect. It is intended to >> work as it does now, building the array without depending on some user >> level functionality. > > Well, it clearly depends on at least some user level functionality > (the ioctl that triggers autodetect). Going from that to a > full-fledged mdadm doesn't sound like such a big deal to me. > >> I don't personally see the value of autodetect for putting together >> the huge number of drives people configure. I see this as a way to >> improve boot reliability, if someone needs 64 drives for root and >> boot, they need to read a few essays on filesystem >> configuration. However, I'm aware that there are some really bizarre >> special cases out there. > > There's LVM. If you have to keep root out of the VG just because > people say so, you lose lots of benefits from LVM, such as being able > to grow root with the system running, take snapshots of root, etc. > But it's MY system. I don't have to anything. More to the point, growing root while the system is running is done a lot less than booting. In general the root f/s has very little in it, and that's a good thing. -- Bill Davidsen <davidsen@tmr.com> Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a normal user and is setuid root, with the "vi" line edit mode selected, and the character set is "big5," an off-by-one errors occurs during wildcard (glob) expansion. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2006-08-02 16:47 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <ork65veg2y.fsf@free.oliva.athome.lsd.ic.unicamp.br>
[not found] ` <20060730124139.45861b47.akpm@osdl.org>
[not found] ` <orac6qerr4.fsf@free.oliva.athome.lsd.ic.unicamp.br>
2006-07-30 23:20 ` let md auto-detect 128+ raid members, fix potential race condition Neil Brown
2006-07-31 16:34 ` Helge Hafting
2006-07-31 20:27 ` Alexandre Oliva
2006-07-31 21:48 ` David Greaves
2006-08-01 2:20 ` Alexandre Oliva
2006-08-01 8:28 ` Michael Tokarev
2006-08-01 21:24 ` Alexandre Oliva
2006-08-01 1:19 ` Neil Brown
2006-08-01 2:35 ` Alexandre Oliva
2006-08-01 3:33 ` Alexandre Oliva
2006-08-01 20:46 ` Alexandre Oliva
2006-08-02 6:37 ` Luca Berra
2006-08-01 17:40 ` Bill Davidsen
2006-08-01 21:32 ` Alexandre Oliva
2006-08-02 6:47 ` Luca Berra
2006-08-02 16:47 ` Bill Davidsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).