* Re: "Enhanced" MD code avaible for review
@ 2004-03-19 20:19 Justin T. Gibbs
2004-03-23 5:05 ` Neil Brown
0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-19 20:19 UTC (permalink / raw)
To: linux-raid; +Cc: linux-kernel
[ CC trimmed since all those on the CC line appear to be on the lists ... ]
Lets take a step back and focus on a few of the points to which we can
hopefully all agree:
o Any successful solution will have to have "meta-data modules" for
active arrays "core resident" in order to be robust. This
requirement stems from the need to avoid deadlock during error
recovery scenarios that must block "normal I/O" to the array while
meta-data operations take place.
o It is desirable for arrays to auto-assemble based on recorded
meta-data. This includes the ability to have a user hot-insert
a "cold spare", have the system recognize it as a spare (based
on the meta-data resident on it) and activate it if necessary to
restore a degraded array.
o Child devices of an array should only be accessible through the
array while the array is in a configured state (bd_claim'ed).
This avoids situations where a user can subvert the integrity of
the array by performing "rogue I/O" to an array member.
Concentrating on just these three, we come to the conclusion that
whether the solution comes via "early user fs" or kernel modules,
the resident size of the solution *will* include the cost for
meta-data support. In either case, the user is able to tailor their
system to include only the support necessary for their individual
system to operate.
If we want to argue the merits of either approach based on just the
sheer size of resident code, I have little doubt that the kernel
module approach will prove smaller:
o No need for "mdadm" or some other daemon to be locked resident in
memory. This alone saves you having a locked copy of klibc or
any other user libraries core resident. The kernel modules
leverage the kernel APIs that already have to be core resident
to satisfy the needs of other parts of the kernel which also
helps in reducing its size.
o Initial RAM disk data can be discarded after modules are loaded at
boot time.
Putting the size argument aside for a moment, lets explore how a
userland solution could satisfy just the above three requirements.
How is meta-data updated on child members of an array while that
array is on-line? Remember that these operations occur with some
frequency. MD includes "safe-mode" support where redundant arrays
are marked clean any time writes cease for a predetermined, fairly
short, amount of time. The userland app cannot access the component
devices directly since they are bd_claim'ed. Even if that mechanism
is somehow subverted, how do we guarantee that these meta-data
writes do not cause a deadlock? In the case of a transition from
Read-only to Write mode, all writes are blocked to the array (this
must be the case for "Dirty" state to be accurate). It seems to
me that you must then provide extra code to not only pre-allocate
buffers for the userland app to do its work, but also provide a
"back-door" interface for these operations to take place.
The argument has also been made that shifting some of this code out
to a userland app "simplifies" the solution and perhaps even makes
it easier to develop. Comparing the two approaches we have:
UserFS:
o Kernel Driver + "enhanced interface to userland daemon"
o Userland Daemon (core resident)
o Userland Meta-Data modules
o Userland Management tool
- This tool needs to interface to the daemon and
perhaps also the kernel driver.
Kernel:
o Kernel RAID Transform Drivers
o Kernel Meta-Data modules
o Simple Userland Mangement
tool with no meta-data knowledge
So two questions arise from this analysis:
1) Are meta-data modules easier to code up or more robust as user
or kernel modules? I believe that doing these outside the kernel
will make them larger and more complex while also losing the
ability to have meta-data modules weigh in on rapidly occurring
events without incurring performance tradeoffs. Regardless of
where they reside, these modules must be robust. A kernel Oops
or a segfault in the daemon is unacceptable to the end user.
Saying that a segfault is less harmful in some way than an Oops
when we're talking about the users data completely misses the
point of why people use RAID.
2) What added complexity is incurred by supporting both a core
resident daemon as well as management interfaces to the daemon
and potentially the kernel module? I have not fully thought
through the corner cases such an approach would expose, so I
cannot quantify this cost. There are certainly more components
to get right and keep synchronized.
In the end, I find it hard to justify inventing all of the userland
machinery necessary to make this work just to avoid roughly ~2K
lines of code per-metadata module from being part of the kernel.
The ASR module for example, which is only required by those that
need support for this meta-data type, is only 19K with all of its
debugging printks and code enabled, unstripped. Are there benefits
to the userland approach that I'm missing?
--
Justin
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: "Enhanced" MD code avaible for review 2004-03-19 20:19 "Enhanced" MD code avaible for review Justin T. Gibbs @ 2004-03-23 5:05 ` Neil Brown 2004-03-23 6:23 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Neil Brown @ 2004-03-23 5:05 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel On Friday March 19, gibbs@scsiguy.com wrote: > [ CC trimmed since all those on the CC line appear to be on the lists ... ] > > Lets take a step back and focus on a few of the points to which we can > hopefully all agree: > > o Any successful solution will have to have "meta-data modules" for > active arrays "core resident" in order to be robust. This > requirement stems from the need to avoid deadlock during error > recovery scenarios that must block "normal I/O" to the array while > meta-data operations take place. I agree. 'Linear' and 'raid0' arrays don't really need metadata support in the kernel as their metadata is essentially read-only. There are interesting applications for raid1 without metadata, but I think that for all raid personalities where metadata might need to be updated in an error condition to preserve data integrity, the kernel should know enough about the metadata to perform that update. It would be nice to keep the in-kernel knowledge to a minimum, though some metadata formats probably make that hard. > > o It is desirable for arrays to auto-assemble based on recorded > meta-data. This includes the ability to have a user hot-insert > a "cold spare", have the system recognize it as a spare (based > on the meta-data resident on it) and activate it if necessary to > restore a degraded array. Certainly. It doesn't follow that the auto-assembly has to happen within the kernel. Having it all done in user-space makes it much easier to control/configure. I think the best way to describe my attitude to auto-assembly is that it could be needs-driven rather than availability-driven. needs-driven means: if the user asks to access an array that doesn't exist, then try to find the bits and assemble it. availability driven means: find all the devices that could be part of an array, and combine as many of them as possible together into arrays. Currently filesystems are needs-driven. At boot time, only to root filesystem, which has been identified somehow, gets mounted. Then the init scripts mount any others that are needed. We don't have any hunting around for filesystem superblocks and mounting the filesystems just in case they are needed. Currently partitions are (sufficiently) needs-driven. It is true that any partitionable devices has it's partitions presented. However the existence of partitions does not affect access to the whole device at all. Only once the partitions are claimed is the whole-device blocked. Providing that auto-assembly of arrays works the same way (needs driven), I am happy for arrays to auto-assemble. I happen to think this most easily done in user-space. With DDF format metadata, there is a concept of 'imported' arrays, which basically means arrays from some other controller that have been attached to the current controller. Part of my desire for needs-driven assembly is that I don't want to inadvertently assemble 'imported' arrays. A DDF controller has NVRAM or a hardcoded serial number to help avoid this. A generic Linux machine doesn't. I could possibly be happy with auto-assembly where a kernel parameter of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF arrays that have a controler-id (or whatever it is) of xx.yy.zz. This is probably simple enough to live entirely in the kernel. > > o Child devices of an array should only be accessible through the > array while the array is in a configured state (bd_claim'ed). > This avoids situations where a user can subvert the integrity of > the array by performing "rogue I/O" to an array member. bd_claim doesn't and (I believe) shouldn't stop access from user-space. It does stop a number of sorts of access that would expect exclusive access. But back to your original post: I suspect there is lots of valuable stuff in your emd patch, but as you have probably gathered, big patches are not the way we work around here, and with good reason. If you would like to identify isolated pieces of functionality, create patches to implement them, and submit them for review I will be quite happy to review them and, when appropriate, forward them to Andrew/Linus. I suggest you start with less controversial changes and work your way forward. NeilBrown ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-23 5:05 ` Neil Brown @ 2004-03-23 6:23 ` Justin T. Gibbs 2004-03-24 2:26 ` Neil Brown 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-23 6:23 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid, linux-kernel >> o Any successful solution will have to have "meta-data modules" for >> active arrays "core resident" in order to be robust. This ... > I agree. > 'Linear' and 'raid0' arrays don't really need metadata support in the > kernel as their metadata is essentially read-only. > There are interesting applications for raid1 without metadata, but I > think that for all raid personalities where metadata might need to be > updated in an error condition to preserve data integrity, the kernel > should know enough about the metadata to perform that update. > > It would be nice to keep the in-kernel knowledge to a minimum, though > some metadata formats probably make that hard. Can you further explain why you want to limit the kernel's knowledge and where you would separate the roles between kernel and userland? In reviewing one of our typical metadata modules, perhaps 80% of the code is generic meta-data record parsing and state conversion logic that would have to be retained in the kernel to perform "minimal meta-data updates". Some high portion of this 80% (less the portion that builds the in-kernel data structures to manipulate and update meta-data) would also need to be replicated into a user-land utility for any type of separation of labor to be possible. The remaining 20% of the kernel code deals with validation of user meta-data creation requests. This code is relatively small since it leverages all of the other routines that are already required for the operational requirements of the module. Splitting the roles bring up some important issues: 1) Code duplication. Depending on the complexity of the meta-data format being supported, the amount of code duplication between userland and kernel modules may be quite large. Any time code is duplicated, the solution is prone to getting out of sync - bugs are fixed in one copy of the code but not another. 2) Solution Complexity Two entities understand how to read and manipulate the meta-data. Policies and APIs must be created to ensure that only one entity is performing operations on the meta-data at a time. This is true even if one entity is primarily a read-only "client". For example, a meta-data module may defer meta-data updates in some instances (e.g. rebuild checkpointing) until the meta-data is closed (writing the checkpoint sooner doesn't make sense considering that you should restart your scrub, rebuild or verify if the system is not safely shutdown). How does the userland client get the most up-to-date information? This is just one of the problems in this area. 3) Size Due to code duplication, the total solution will be larger in code size. What benefits of operating in userland outweigh these issues? >> o It is desirable for arrays to auto-assemble based on recorded >> meta-data. This includes the ability to have a user hot-insert >> a "cold spare", have the system recognize it as a spare (based >> on the meta-data resident on it) and activate it if necessary to >> restore a degraded array. > > Certainly. It doesn't follow that the auto-assembly has to happen > within the kernel. Having it all done in user-space makes it much > easier to control/configure. > > I think the best way to describe my attitude to auto-assembly is that > it could be needs-driven rather than availability-driven. > > needs-driven means: if the user asks to access an array that doesn't > exist, then try to find the bits and assemble it. > availability driven means: find all the devices that could be part of > an array, and combine as many of them as possible together into > arrays. > > Currently filesystems are needs-driven. At boot time, only to root > filesystem, which has been identified somehow, gets mounted. > Then the init scripts mount any others that are needed. > We don't have any hunting around for filesystem superblocks and > mounting the filesystems just in case they are needed. Are filesystems the correct analogy? Consider that a user's attempt to mount a filesystem by label requires that all of the "block devices" that might contain that filesystem be enumerated automatically by the system. In this respect, the system is treating an MD device in exactly the same way as a SCSI or IDE disk. The array must be exported to the system on an "availability basis" in order for the "needs-driven" features of the system to behave as expected. > Currently partitions are (sufficiently) needs-driven. It is true that > any partitionable devices has it's partitions presented. However the > existence of partitions does not affect access to the whole device at > all. Only once the partitions are claimed is the whole-device > blocked. This seems a slight digression from your earlier argument. Is your concern that the arrays are auto-enumerated, or that the act of enumerating them prevents the component devices from being accessed (due to bd_clam)? > Providing that auto-assembly of arrays works the same way (needs > driven), I am happy for arrays to auto-assemble. > I happen to think this most easily done in user-space. I don't know how to reconcile a needs based approach with system features that require arrays to be exported as soon as they are detected. > With DDF format metadata, there is a concept of 'imported' arrays, > which basically means arrays from some other controller that have been > attached to the current controller. > > Part of my desire for needs-driven assembly is that I don't want to > inadvertently assemble 'imported' arrays. > A DDF controller has NVRAM or a hardcoded serial number to help avoid > this. A generic Linux machine doesn't. > > I could possibly be happy with auto-assembly where a kernel parameter > of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF > arrays that have a controler-id (or whatever it is) of xx.yy.zz. > > This is probably simple enough to live entirely in the kernel. The concept of "importing" an array doesn't really make sense in the case of MD's DDF. To fully take advantage of features like a controller BIOS's ability to natively boot an array, the disks for that domain must remain in that controller's domain. Determining the domain to assign to new arrays will require input from the user since there is limited topology information available to MD. The user will also have the ability to assign newly created arrays to the "MD Domain" which is not tied to any particular controller domain. ... > But back to your original post: I suspect there is lots of valuable > stuff in your emd patch, but as you have probably gathered, big > patches are not the way we work around here, and with good reason. > > If you would like to identify isolated pieces of functionality, create > patches to implement them, and submit them for review I will be quite > happy to review them and, when appropriate, forward them to > Andrew/Linus. > I suggest you start with less controversial changes and work your way > forward. One suggestion that was recently raised was to present these changes in the form of an alternate "EMD" driver to avoid any potential breakage of the existing MD. Do you have any opinion on this? -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-23 6:23 ` Justin T. Gibbs @ 2004-03-24 2:26 ` Neil Brown 2004-03-24 19:09 ` Matt Domsch 2004-03-25 2:21 ` Jeff Garzik 0 siblings, 2 replies; 38+ messages in thread From: Neil Brown @ 2004-03-24 2:26 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel On Monday March 22, gibbs@scsiguy.com wrote: > >> o Any successful solution will have to have "meta-data modules" for > >> active arrays "core resident" in order to be robust. This > > ... > > > I agree. > > 'Linear' and 'raid0' arrays don't really need metadata support in the > > kernel as their metadata is essentially read-only. > > There are interesting applications for raid1 without metadata, but I > > think that for all raid personalities where metadata might need to be > > updated in an error condition to preserve data integrity, the kernel > > should know enough about the metadata to perform that update. > > > > It would be nice to keep the in-kernel knowledge to a minimum, though > > some metadata formats probably make that hard. > > Can you further explain why you want to limit the kernel's knowledge > and where you would separate the roles between kernel and userland? General caution. It is generally harder the change mistakes in the kernel than it is to change mistakes in userspace, and similarly it is easer to add functionality and configurability in userspace. A design that puts the control in userspace is therefore preferred. A design that ties you to working through a narrow user-kernel interface is disliked. A design that gives easy control to user-space, and allows the kernel to do simple things simply is probably best. I'm not particularly concerned with code size and code duplication. A clean, expressive design is paramount. > 2) Solution Complexity > > Two entities understand how to read and manipulate the meta-data. > Policies and APIs must be created to ensure that only one entity > is performing operations on the meta-data at a time. This is true > even if one entity is primarily a read-only "client". For example, > a meta-data module may defer meta-data updates in some instances > (e.g. rebuild checkpointing) until the meta-data is closed (writing > the checkpoint sooner doesn't make sense considering that you should > restart your scrub, rebuild or verify if the system is not safely > shutdown). How does the userland client get the most up-to-date > information? This is just one of the problems in this area. If the kernel and userspace both need to know about metadata, then the design must make clear how they communicate. > > > Currently partitions are (sufficiently) needs-driven. It is true that > > any partitionable devices has it's partitions presented. However the > > existence of partitions does not affect access to the whole device at > > all. Only once the partitions are claimed is the whole-device > > blocked. > > This seems a slight digression from your earlier argument. Is your > concern that the arrays are auto-enumerated, or that the act of enumerating > them prevents the component devices from being accessed (due to > bd_clam)? Primarily the latter. But also that the act of enumerating them may cause an update to an underlying devices (e.g. metadata update or resync). That is what I am particularly uncomfortable about. > > > Providing that auto-assembly of arrays works the same way (needs > > driven), I am happy for arrays to auto-assemble. > > I happen to think this most easily done in user-space. > > I don't know how to reconcile a needs based approach with system > features that require arrays to be exported as soon as they are > detected. > Maybe if arrays were auto-assembled in a read-only mode that guaranteed not to write to the devices *at*all* and did not bd_claim them. When they are needed (either though some explicit set-writable command or through an implicit first-write) then the underlying components are bd_claimed. If that succeeds, the array becomes "live". If it fails, it stays read-only. > > > But back to your original post: I suspect there is lots of valuable > > stuff in your emd patch, but as you have probably gathered, big > > patches are not the way we work around here, and with good reason. > > > > If you would like to identify isolated pieces of functionality, create > > patches to implement them, and submit them for review I will be quite > > happy to review them and, when appropriate, forward them to > > Andrew/Linus. > > I suggest you start with less controversial changes and work your way > > forward. > > One suggestion that was recently raised was to present these changes > in the form of an alternate "EMD" driver to avoid any potential > breakage of the existing MD. Do you have any opinion on this? Choice is good. Competition is good. I would not try to interfere with you creating a new "emd" driver that didn't interfere with "md". What Linus would think of it I really don't know. It is certainly not impossible that he would accept it. However I'm not sure that having three separate device-array systems (dm, md, emd) is actually a good idea. It would probably be really good to unite md and dm somehow, but no-one seems really keen on actually doing the work. I seriously think the best long-term approach for your emd work is to get it integrated into md. I do listen to reason and I am not completely head-strong, but I do have opinions, and you would need to put in the effort to convincing me. NeilBrown ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-24 2:26 ` Neil Brown @ 2004-03-24 19:09 ` Matt Domsch 2004-03-25 2:21 ` Jeff Garzik 1 sibling, 0 replies; 38+ messages in thread From: Matt Domsch @ 2004-03-24 19:09 UTC (permalink / raw) To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel On Wed, Mar 24, 2004 at 01:26:47PM +1100, Neil Brown wrote: > On Monday March 22, gibbs@scsiguy.com wrote: > > One suggestion that was recently raised was to present these changes > > in the form of an alternate "EMD" driver to avoid any potential > > breakage of the existing MD. Do you have any opinion on this? > > I seriously think the best long-term approach for your emd work is to > get it integrated into md. I do listen to reason and I am not > completely head-strong, but I do have opinions, and you would need to > put in the effort to convincing me. I completely agree that long-term, md and emd need to be the same. However, watching the pain that the IDE changes took in early 2.5, I'd like to see emd be merged alongside md for the short-term while the kinks get worked out, keeping in mind the desire to merge them together again soon as that happens. Thanks, Matt -- Matt Domsch Sr. Software Engineer, Lead Engineer Dell Linux Solutions linux.dell.com & www.dell.com/linux Linux on Dell mailing lists @ http://lists.us.dell.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-24 2:26 ` Neil Brown 2004-03-24 19:09 ` Matt Domsch @ 2004-03-25 2:21 ` Jeff Garzik 2004-03-25 18:00 ` Kevin Corry 1 sibling, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-25 2:21 UTC (permalink / raw) To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel Neil Brown wrote: > Choice is good. Competition is good. I would not try to interfere > with you creating a new "emd" driver that didn't interfere with "md". > What Linus would think of it I really don't know. It is certainly not > impossible that he would accept it. Agreed. Independent DM efforts have already started supporting MD raid0/1 metadata from what I understand, though these efforts don't seem to post to linux-kernel or linux-raid much at all. :/ > However I'm not sure that having three separate device-array systems > (dm, md, emd) is actually a good idea. It would probably be really > good to unite md and dm somehow, but no-one seems really keen on > actually doing the work. I would be disappointed if all the work that has gone into the MD driver is simply obsoleted by new DM targets. Particularly RAID 1/5/6. You pretty much echoed my sentiments exactly... ideally md and dm can be bound much more tightly to each other. For example, convert md's raid[0156].c into device mapper targets... but indeed, nobody has stepped up to do that so far. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 2:21 ` Jeff Garzik @ 2004-03-25 18:00 ` Kevin Corry 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 22:59 ` Justin T. Gibbs 0 siblings, 2 replies; 38+ messages in thread From: Kevin Corry @ 2004-03-25 18:00 UTC (permalink / raw) To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Wednesday 24 March 2004 8:21 pm, Jeff Garzik wrote: > Neil Brown wrote: > > Choice is good. Competition is good. I would not try to interfere > > with you creating a new "emd" driver that didn't interfere with "md". > > What Linus would think of it I really don't know. It is certainly not > > impossible that he would accept it. > > Agreed. > > Independent DM efforts have already started supporting MD raid0/1 > metadata from what I understand, though these efforts don't seem to post > to linux-kernel or linux-raid much at all. :/ I post on lkml.....occasionally. :) I'm guessing you're referring to EVMS in that comment, since we have done *part* of what you just described. EVMS has always had a plugin to recognize MD devices, and has been using the MD driver for quite some time (along with using Device-Mapper for non-MD stuff). However, as of our most recent release (earlier this month), we switched to using Device-Mapper for MD RAID-linear and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" module (both required to support LVM volumes), and it was a rather trivial exercise to switch to activating these RAID devices using DM instead of MD. This decision was not based on any real dislike of the MD driver, but rather for the benefits that are gained by using Device-Mapper. In particular, Device-Mapper provides the ability to change out the device mapping on the fly, by temporarily suspending I/O, changing the table, and resuming the I/O I'm sure many of you know this already. But I'm not sure everyone fully understands how powerful a feature this is. For instance, it means EVMS can now expand RAID-linear devices online. While that particular example may not sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to Device-Mapper, this feature would then allow you to do stuff like add new "active" members to a RAID-1 online (think changing from 2-way mirror to 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online simply by adding a new disk (assuming other limitations, e.g. a single stripe-zone). Unfortunately, these are things the MD driver can't do online, because you need to completely stop the MD device before making such changes (to prevent the kernel and user-space from trampling on the same metadata), and MD won't stop the device if it's open (i.e. if it's mounted or if you have other device (LVM) built on top of MD). Often times this means you need to boot to a rescue-CD to make these types of configuration changes. As for not posting this information on lkml and/or linux-raid, I do apologize if this is something you would like to have been informed of. Most of the recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken that to mean the folks on the list aren't terribly interested in EVMS developments. And since EVMS is a completely user-space tool and this decision didn't affect any kernel components, I didn't think it was really relevent to mention here. We usually discuss such things on evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to cross-post to lkml more often if it's something that might be pertinent. > > However I'm not sure that having three separate device-array systems > > (dm, md, emd) is actually a good idea. It would probably be really > > good to unite md and dm somehow, but no-one seems really keen on > > actually doing the work. > > I would be disappointed if all the work that has gone into the MD driver > is simply obsoleted by new DM targets. Particularly RAID 1/5/6. > > You pretty much echoed my sentiments exactly... ideally md and dm can > be bound much more tightly to each other. For example, convert md's > raid[0156].c into device mapper targets... but indeed, nobody has > stepped up to do that so far. We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some point in the future, primarily for some of the reasons I mentioned above. Obviously linear.c and raid0.c don't really need to be ported. DM provides equivalent functionality, the discovery/activation can be driven from user-space, and no in-kernel status updating is necessary (unlike RAID-1 and -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started on any such work, or even had any significant discussions about *how* to do it. I can't imagine we would try this without at least involving Neil and other folks from linux-raid, since it would be nice to actually reuse as much of the existing MD code as possible (especially for RAID-5 and -6). I have no desire to try to rewrite those from scratch. Device-Mapper does currently contain a mirroring module (still just in Joe's -udm tree), which has primarily been used to provide online-move functionality in LVM2 and EVMS. They've recently added support for persistent logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 has some additional requirements for updating status in its superblock at runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring module could still be used, with the possibility of either adding MD-RAID-1-specific information to the persistent-log module, or simply as an additional log type. So, if this is the direction everyone else would like to see MD and DM take, we'd be happy to help out. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:00 ` Kevin Corry @ 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik ` (3 more replies) 2004-03-25 22:59 ` Justin T. Gibbs 1 sibling, 4 replies; 38+ messages in thread From: Jeff Garzik @ 2004-03-25 18:42 UTC (permalink / raw) To: Kevin Corry; +Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid Kevin Corry wrote: > I'm guessing you're referring to EVMS in that comment, since we have done > *part* of what you just described. EVMS has always had a plugin to recognize > MD devices, and has been using the MD driver for quite some time (along with > using Device-Mapper for non-MD stuff). However, as of our most recent release > (earlier this month), we switched to using Device-Mapper for MD RAID-linear > and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" > module (both required to support LVM volumes), and it was a rather trivial > exercise to switch to activating these RAID devices using DM instead of MD. nod > This decision was not based on any real dislike of the MD driver, but rather > for the benefits that are gained by using Device-Mapper. In particular, > Device-Mapper provides the ability to change out the device mapping on the > fly, by temporarily suspending I/O, changing the table, and resuming the I/O > I'm sure many of you know this already. But I'm not sure everyone fully > understands how powerful a feature this is. For instance, it means EVMS can > now expand RAID-linear devices online. While that particular example may not [...] Sounds interesting but is mainly an implementation detail for the purposes of this discussion... Some of this emd may want to use, for example. > As for not posting this information on lkml and/or linux-raid, I do apologize > if this is something you would like to have been informed of. Most of the > recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken > that to mean the folks on the list aren't terribly interested in EVMS > developments. And since EVMS is a completely user-space tool and this > decision didn't affect any kernel components, I didn't think it was really > relevent to mention here. We usually discuss such things on > evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to > cross-post to lkml more often if it's something that might be pertinent. Understandable... for the stuff that impacts MD some mention of the work, on occasion, to linux-raid and/or linux-kernel would be useful. I'm mainly looking at it from a standpoint of making sure that all the various RAID efforts are not independent of each other. > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some > point in the future, primarily for some of the reasons I mentioned above. > Obviously linear.c and raid0.c don't really need to be ported. DM provides > equivalent functionality, the discovery/activation can be driven from > user-space, and no in-kernel status updating is necessary (unlike RAID-1 and > -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 > (and now RAID-6) to Device-Mapper targets, but we haven't started on any such > work, or even had any significant discussions about *how* to do it. I can't let's have that discussion :) > imagine we would try this without at least involving Neil and other folks > from linux-raid, since it would be nice to actually reuse as much of the > existing MD code as possible (especially for RAID-5 and -6). I have no desire > to try to rewrite those from scratch. <cheers> > Device-Mapper does currently contain a mirroring module (still just in Joe's > -udm tree), which has primarily been used to provide online-move > functionality in LVM2 and EVMS. They've recently added support for persistent > logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 > has some additional requirements for updating status in its superblock at > runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring > module could still be used, with the possibility of either adding > MD-RAID-1-specific information to the persistent-log module, or simply as an > additional log type. WRT specific implementation, I would hope for the reverse -- that the existing, known, well-tested MD raid1 code would be used. But perhaps that's a naive impression... Folks with more knowledge of the implementation can make that call better than I. I'd like to focus on the "additional requirements" you mention, as I think that is a key area for consideration. There is a certain amount of metadata that -must- be updated at runtime, as you recognize. Over and above what MD already cares about, DDF and its cousins introduce more items along those lines: event logs, bad sector logs, controller-level metadata... these are some of the areas I think Justin/Scott are concerned about. My take on things... the configuration of RAID arrays got a lot more complex with DDF and "host RAID" in general. Association of RAID arrays based on specific hardware controllers. Silently building RAID0+1 stacked arrays out of non-RAID block devices the kernel presents. Failing over when one of the drives the kernel presents does not respond. All that just screams "do it in userland". OTOH, once the devices are up and running, kernel needs update some of that configuration itself. Hot spare lists are an easy example, but any time the state of the overall RAID array changes, some host RAID formats, more closely tied to hardware than MD, may require configuration metadata changes when some hardware condition(s) change. I respectfully disagree with the EMD folks that a userland approach is impossible, given all the failure scenarios. In a userland approach, there -will- be some duplicated metadata-management code between userland and the kernel. But for configuration _above_ the single-raid-array level, I think that's best left to userspace. There will certainly be a bit of intra-raid-array management code in the kernel, including configuration updating. I agree to its necessity... but that doesn't mean that -all- configuration/autorun stuff needs to be in the kernel. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik @ 2004-03-25 18:48 ` Jeff Garzik 2004-03-25 23:46 ` Justin T. Gibbs 2004-03-25 22:04 ` Lars Marowsky-Bree ` (2 subsequent siblings) 3 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-25 18:48 UTC (permalink / raw) To: linux-kernel; +Cc: Kevin Corry, Neil Brown, Justin T. Gibbs, linux-raid Jeff Garzik wrote: > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. Association of RAID arrays > based on specific hardware controllers. Silently building RAID0+1 > stacked arrays out of non-RAID block devices the kernel presents. > Failing over when one of the drives the kernel presents does not respond. > > All that just screams "do it in userland". Just so there is no confusion... the "failing over...in userland" thing I mention is _only_ during discovery of the root disk. Similar code would need to go into the bootloader, for controllers that do not present the entire RAID array as a faked BIOS INT drive. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:48 ` Jeff Garzik @ 2004-03-25 23:46 ` Justin T. Gibbs 2004-03-26 0:01 ` Jeff Garzik 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-25 23:46 UTC (permalink / raw) To: Jeff Garzik, linux-kernel; +Cc: Kevin Corry, Neil Brown, linux-raid > Jeff Garzik wrote: > > Just so there is no confusion... the "failing over...in userland" thing I > mention is _only_ during discovery of the root disk. None of the solutions being talked about perform "failing over" in userland. The RAID transforms which perform this operation are kernel resident in DM, MD, and EMD. Perhaps you are talking about spare activation and rebuild? > Similar code would need to go into the bootloader, for controllers that do > not present the entire RAID array as a faked BIOS INT drive. None of the solutions presented here are attempting to make RAID transforms operate from the boot loader environment without BIOS support. I see this as a completely tangental problem to what is being discussed. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:46 ` Justin T. Gibbs @ 2004-03-26 0:01 ` Jeff Garzik 2004-03-26 0:10 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-26 0:01 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid Justin T. Gibbs wrote: >>Jeff Garzik wrote: >> >>Just so there is no confusion... the "failing over...in userland" thing I >>mention is _only_ during discovery of the root disk. > > > None of the solutions being talked about perform "failing over" in > userland. The RAID transforms which perform this operation are kernel > resident in DM, MD, and EMD. Perhaps you are talking about spare > activation and rebuild? This is precisely why I sent the second email, and made the qualification I did :) For a "do it in userland" solution, an initrd or initramfs piece examines the system configuration, and assembles physical disks into RAID arrays based on the information it finds. I was mainly implying that an initrd solution would have to provide some primitive failover initially, before the kernel is bootstrapped... much like a bootloader that supports booting off a RAID1 array would need to do. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:01 ` Jeff Garzik @ 2004-03-26 0:10 ` Justin T. Gibbs 2004-03-26 0:14 ` Jeff Garzik 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-26 0:10 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid >> None of the solutions being talked about perform "failing over" in >> userland. The RAID transforms which perform this operation are kernel >> resident in DM, MD, and EMD. Perhaps you are talking about spare >> activation and rebuild? > > This is precisely why I sent the second email, and made the qualification > I did :) > > For a "do it in userland" solution, an initrd or initramfs piece examines > the system configuration, and assembles physical disks into RAID arrays > based on the information it finds. I was mainly implying that an initrd > solution would have to provide some primitive failover initially, before > the kernel is bootstrapped... much like a bootloader that supports booting > off a RAID1 array would need to do. "Failover" (i.e. redirecting a read to a viable member) will not occur via userland at all. The initrd solution just has to present all available members to the kernel interface performing the RAID transform. There is no need for "special failover handling" during bootstrap in either case. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:10 ` Justin T. Gibbs @ 2004-03-26 0:14 ` Jeff Garzik 0 siblings, 0 replies; 38+ messages in thread From: Jeff Garzik @ 2004-03-26 0:14 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid Justin T. Gibbs wrote: >>>None of the solutions being talked about perform "failing over" in >>>userland. The RAID transforms which perform this operation are kernel >>>resident in DM, MD, and EMD. Perhaps you are talking about spare >>>activation and rebuild? >> >>This is precisely why I sent the second email, and made the qualification >>I did :) >> >>For a "do it in userland" solution, an initrd or initramfs piece examines >>the system configuration, and assembles physical disks into RAID arrays >>based on the information it finds. I was mainly implying that an initrd >>solution would have to provide some primitive failover initially, before >>the kernel is bootstrapped... much like a bootloader that supports booting >>off a RAID1 array would need to do. > > > "Failover" (i.e. redirecting a read to a viable member) will not occur > via userland at all. The initrd solution just has to present all available > members to the kernel interface performing the RAID transform. There > is no need for "special failover handling" during bootstrap in either > case. hmmm, yeah, agreed. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik @ 2004-03-25 22:04 ` Lars Marowsky-Bree 2004-03-26 19:19 ` Kevin Corry 2004-03-25 23:35 ` Justin T. Gibbs 2004-03-26 19:15 ` Kevin Corry 3 siblings, 1 reply; 38+ messages in thread From: Lars Marowsky-Bree @ 2004-03-25 22:04 UTC (permalink / raw) To: Jeff Garzik, Kevin Corry Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid On 2004-03-25T13:42:12, Jeff Garzik <jgarzik@pobox.com> said: > >and -5). And we've talked for a long time about wanting to port RAID-1 and > >RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started > >on any such work, or even had any significant discussions about *how* to > >do it. I can't > let's have that discussion :) Nice 2.7 material, and parts I've always wanted to work on. (Including making the entire partition scanning user-space on top of DM too.) KS material? > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. And then add all the other stuff, like scenarios where half of your RAID is "somewhere" on the network via nbd, iSCSI or whatever and all the other possible stackings... Definetely user-space material, and partly because it /needs/ to have the input from the volume managers to do the sane things. The point about this implying that the superblock parsing/updating logic needs to be duplicated between userspace and kernel land is valid too though, and I'm keen on resolving this in a way which doesn't suck... Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 22:04 ` Lars Marowsky-Bree @ 2004-03-26 19:19 ` Kevin Corry 2004-03-31 17:07 ` Randy.Dunlap 0 siblings, 1 reply; 38+ messages in thread From: Kevin Corry @ 2004-03-26 19:19 UTC (permalink / raw) To: linux-kernel Cc: Lars Marowsky-Bree, Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote: > On 2004-03-25T13:42:12, > > Jeff Garzik <jgarzik@pobox.com> said: > > >and -5). And we've talked for a long time about wanting to port RAID-1 > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't > > > started on any such work, or even had any significant discussions about > > > *how* to do it. I can't > > > > let's have that discussion :) > > Nice 2.7 material, and parts I've always wanted to work on. (Including > making the entire partition scanning user-space on top of DM too.) Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think we've already proved this is possible. We really only need to work on making early-userspace a little easier to use. > KS material? Sounds good to me. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 19:19 ` Kevin Corry @ 2004-03-31 17:07 ` Randy.Dunlap 0 siblings, 0 replies; 38+ messages in thread From: Randy.Dunlap @ 2004-03-31 17:07 UTC (permalink / raw) To: Kevin Corry; +Cc: linux-kernel, lmb, jgarzik, neilb, gibbs, linux-raid On Fri, 26 Mar 2004 13:19:28 -0600 Kevin Corry wrote: | On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote: | > On 2004-03-25T13:42:12, | > | > Jeff Garzik <jgarzik@pobox.com> said: | > > >and -5). And we've talked for a long time about wanting to port RAID-1 | > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't | > > > started on any such work, or even had any significant discussions about | > > > *how* to do it. I can't | > > | > > let's have that discussion :) | > | > Nice 2.7 material, and parts I've always wanted to work on. (Including | > making the entire partition scanning user-space on top of DM too.) | | Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think | we've already proved this is possible. We really only need to work on making | early-userspace a little easier to use. | | > KS material? | | Sounds good to me. Ditto. I didn't see much conclusion to this thread, other than Neil's good suggestions. (maybe on some other list that I don't read?) I wouldn't want this or any other projects to have to wait for the kernel summit. Email has worked well for many years...let's try to keep it working. :) -- ~Randy "You can't do anything without having to do something else first." -- Belefant's Law ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik 2004-03-25 22:04 ` Lars Marowsky-Bree @ 2004-03-25 23:35 ` Justin T. Gibbs 2004-03-26 0:13 ` Jeff Garzik 2004-03-26 19:15 ` Kevin Corry 3 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-25 23:35 UTC (permalink / raw) To: Jeff Garzik, Kevin Corry; +Cc: linux-kernel, Neil Brown, linux-raid > I respectfully disagree with the EMD folks that a userland approach is > impossible, given all the failure scenarios. I've never said that it was impossible, just unwise. I believe that a userland approach offers no benefit over allowing the kernel to perform all meta-data operations. The end result of such an approach (given feature and robustness parity with the EMD solution) is a larger resident side, code duplication, and more complicated configuration/management interfaces. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:35 ` Justin T. Gibbs @ 2004-03-26 0:13 ` Jeff Garzik 2004-03-26 17:43 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-26 0:13 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid Justin T. Gibbs wrote: >>I respectfully disagree with the EMD folks that a userland approach is >>impossible, given all the failure scenarios. > > > I've never said that it was impossible, just unwise. I believe > that a userland approach offers no benefit over allowing the kernel > to perform all meta-data operations. The end result of such an > approach (given feature and robustness parity with the EMD solution) > is a larger resident side, code duplication, and more complicated > configuration/management interfaces. There is some code duplication, yes. But the right userspace solution does not have a larger RSS, and has _less_ complicated management interfaces. A key benefit of "do it in userland" is a clear gain in flexibility, simplicity, and debuggability (if that's a word). But it's hard. It requires some deep thinking. It's a whole lot easier to do everything in the kernel -- but that doesn't offer you the protections of userland, particularly separate address spaces from the kernel, and having to try harder to crash the kernel. :) Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:13 ` Jeff Garzik @ 2004-03-26 17:43 ` Justin T. Gibbs 2004-03-28 0:06 ` Lincoln Dale 2004-03-28 0:30 ` Jeff Garzik 0 siblings, 2 replies; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-26 17:43 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid >>> I respectfully disagree with the EMD folks that a userland approach is >>> impossible, given all the failure scenarios. >> >> >> I've never said that it was impossible, just unwise. I believe >> that a userland approach offers no benefit over allowing the kernel >> to perform all meta-data operations. The end result of such an >> approach (given feature and robustness parity with the EMD solution) >> is a larger resident side, code duplication, and more complicated >> configuration/management interfaces. > > There is some code duplication, yes. But the right userspace solution > does not have a larger RSS, and has _less_ complicated management > interfaces. > > A key benefit of "do it in userland" is a clear gain in flexibility, > simplicity, and debuggability (if that's a word). This is just as much hand waving as, 'All that just screams "do it in userland".' <sigh> I posted a rather detailed, technical, analysis of what I believe would be required to make this work correctly using a userland approach. The only response I've received is from Neil Brown. Please, point out, in a technical fashion, how you would address the feature set being proposed: o Rebuilds o Auto-array enumeration o Meta-data updates for topology changes (failed members, spare activation) o Meta-data updates for "safe mode" o Array creation/deletion o "Hot member addition" Only then can a true comparative analysis of which solution is "less complex", "more maintainable", and "smaller" be performed. > But it's hard. It requires some deep thinking. It's a whole lot easier > to do everything in the kernel -- but that doesn't offer you the > protections of userland, particularly separate address spaces from the > kernel, and having to try harder to crash the kernel. :) A crash in any component of a RAID solution that prevents automatic failover and rebuilds without customer intervention is unacceptable. Whether it crashes your kernel or not is really not that important other than the customer will probably notice that their data is no longer protected *sooner* if the system crashes. In other-words, the solution must be *correct* regardless of where it resides. Saying that doing a portion of it in userland allows it to safely be buggier seems a very strange argument. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 17:43 ` Justin T. Gibbs @ 2004-03-28 0:06 ` Lincoln Dale 2004-03-30 17:54 ` Justin T. Gibbs 2004-03-28 0:30 ` Jeff Garzik 1 sibling, 1 reply; 38+ messages in thread From: Lincoln Dale @ 2004-03-28 0:06 UTC (permalink / raw) To: Justin T. Gibbs Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid At 03:43 AM 27/03/2004, Justin T. Gibbs wrote: >I posted a rather detailed, technical, analysis of what I believe would >be required to make this work correctly using a userland approach. The >only response I've received is from Neil Brown. Please, point out, in >a technical fashion, how you would address the feature set being proposed: i'll have a go. your position is one of "put it all in the kernel". Jeff, Neil, Kevin et al is one of "it can live in userspace". to that end, i agree with the userspace approach. the way i personally believe that it SHOULD happen is that you tie your metadata format (and RAID format, if its different to others) into DM. you boot up using an initrd where you can start some form of userspace management daemon from initrd. you can have your binary (userspace) tools started from initrd which can populate the tables for all disks/filesystems, including pivoting to a new root filesystem if need-be. the only thing your BIOS/int13h redirection needs to do is be able to provide sufficient information to be capable of loading the kernel and the initial ramdisk. perhaps that means that you guys could provide enhancements to grub/lilo if they are insufficient for things like finding a secondary copy of initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things the "open source way" and help improve the overall tools, if the end goal ends up being the same: enabling YOUR system to work better?) moving forward, perhaps initrd will be deprecated in favour of initramfs - but until then, there isn't any downside to this approach that i can see. with all this in mind, and the basic premise being that as a minimum, the kernel has booted, and initrd is working then answering your other points: > o Rebuilds userspace is running. rebuilds are simply a process of your userspace tools recognising that there are disk groups in a inconsistent state, and don't bring them online, but rather, do whatever is necessary to rebuild them. nothing says that you cannot have a KERNEL-space 'helper' to help do the rebuild.. > o Auto-array enumeration your userspace tool can receive notification (via udev/hotplug) when new disks/devices appear. from there, your userspace tool can read whatever metadata exists on the disk, and use that to enumerate whatever block devices exist. perhaps DM needs some hooks to be able to do this - but i believe that the DM v4 ioctls cover this already. > o Meta-data updates for topology changes (failed members, spare activation) a failed member may be as a result of a disk being pulled out. for such an event, udev/hotplug should tell your userspace daemon. a failed member may be as a result of lots of I/O errors. perhaps there is work needed in the linux block layer to indicate some form of hotplug event such as 'excessive errors', perhaps its something needed in the DM layer. in either case, it isn't out of the question that userspace can be notified. for a "spare activation", once again, that can be done entirely from userspace. > o Meta-data updates for "safe mode" seems implementation specific to me. > o Array creation/deletion the short answer here is "how does one create or remove DM/LVM/MD partitions today?" it certainly isn't in the kernel ... > o "Hot member addition" this should also be possible today. i haven't looked too closely at whether there are sufficient interfaces for quiescence of I/O or not - but once again, if not, why not implement something that can be used for all? >Only then can a true comparative analysis of which solution is "less >complex", "more maintainable", and "smaller" be performed. there may be less lines of code involved in "entirely in kernel" for YOUR hardware -- but what about when 4 other storage vendors come out with such a card? what if someone wants to use your card in conjunction with the storage being multipathed or replicated automatically? what about when someone wants to create snapshots for backups? all that functionality has to then go into your EMD driver. Adaptec may decide all that is too hard -- at which point, your product may become obsolete as the storage paradigms have moved beyond what your EMD driver is capable of. if you could tie it into DM -- which i believe to be the defacto path forward for lots of this cool functionality -- you gain this kind of functionality gratis -- or at least with minimal effort to integrate. better yet, Linux as a whole benefits from your involvement -- your time/effort isn't put into something specific to your hardware -- but rather your time/effort is put into something that can be used by all. this conversation really sounds like the same one you had with James about the SCSI Mid layer and why you just have to bypass items there and do your own proprietary things. in summary, i don't believe you should be focussing on a short-term viiew of "but its more lines of code", but rather a more big-picture view of "overall, there will be LESS lines of code" and "it will fit better into the overall device-mapper/block-remapper functionality" within the kernel. cheers, lincoln. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-28 0:06 ` Lincoln Dale @ 2004-03-30 17:54 ` Justin T. Gibbs 0 siblings, 0 replies; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:54 UTC (permalink / raw) To: Lincoln Dale Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid > At 03:43 AM 27/03/2004, Justin T. Gibbs wrote: >> I posted a rather detailed, technical, analysis of what I believe would >> be required to make this work correctly using a userland approach. The >> only response I've received is from Neil Brown. Please, point out, in >> a technical fashion, how you would address the feature set being proposed: > > i'll have a go. > > your position is one of "put it all in the kernel". > Jeff, Neil, Kevin et al is one of "it can live in userspace". Please don't misrepresent or over simplify my statements. What I have said is that meta-data reading and writing should occur in only one place. Since, as has already been acknowledged by many, meta-data updates are required in the kernel, that means this support should be handled in the kernel. Any other solution adds complexity and size to the solution. > to that end, i agree with the userspace approach. > the way i personally believe that it SHOULD happen is that you tie > your metadata format (and RAID format, if its different to others) into DM. Saying how you think something should happen without any technical argument for it, doesn't help me to understand the benefits of your approach. ... > perhaps that means that you guys could provide enhancements to grub/lilo > if they are insufficient for things like finding a secondary copy of > initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things > the "open source way" and help improve the overall tools, if the end goal > ends up being the same: enabling YOUR system to work better?) I don't understand your argument. We have improved an already existing opensource driver to provide this functionality. This is not the OpenSource way? > then answering your other points: Again, you have presented strategies that may or may not work, but no technical arguments for their superiority over placing meta-data in the kernel. > there may be less lines of code involved in "entirely in kernel" for YOUR > hardware -- but what about when 4 other storage vendors come out with such > a card? There will be less lines of code total for any vendor that decides to add a new meta-data type. All the vendor has to do is provide a meta-data module. There are no changes to the userland utilities (they know nothing about specific meta-data formats), to the RAID transform modules, or to the core of EMD. If this were not the case, there would be little point to the EMD work. > what if someone wants to use your card in conjunction with the storage > being multipathed or replicated automatically? > what about when someone wants to create snapshots for backups? > > all that functionality has to then go into your EMD driver. No. DM already works on any block device exported to the kernel. EMD exports its devices as block devices. Thus, all of the DM functionality you are talking about is also available for EMD. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 17:43 ` Justin T. Gibbs 2004-03-28 0:06 ` Lincoln Dale @ 2004-03-28 0:30 ` Jeff Garzik 1 sibling, 0 replies; 38+ messages in thread From: Jeff Garzik @ 2004-03-28 0:30 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid Justin T. Gibbs wrote: > o Rebuilds > 90% kernel, AFAICS, otherwise you have races with requests that the driver is actively satisfying > o Auto-array enumeration userspace > o Meta-data updates for "safe mode" unsure of the definition of safe mode > o Array creation/deletion of entire arrays? can mostly be done in userspace, but deletion also needs to update controller-wide metadata, which might be stored on active arrays. > o "Hot member addition" userspace prepares, kernel completes [moved this down in your list] > o Meta-data updates for topology changes (failed members, spare activation) [warning: this is a tangent from the userspace sub-thread/topic] the kernel, of course, must manage topology, otherwise things Don't Get Done, and requests don't do where they should. :) Part of the value of device mapper is that it provides container objects for multi-disk groups, and a common method of messing around with those container objects. You clearly recognized the same need in emd... but I don't think we want two different pieces of code doing the same basic thing. I do think that metadata management needs to be fairly cleanly separately (I like what emd did, there) such that a user needs three in-kernel pieces: * device mapper * generic raid1 engine * personality module "personality" would be where the specifics of the metadata management lived, and it would be responsible for handling the specifics of non-hot-path events that nonetheless still need to be in the kernel. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik ` (2 preceding siblings ...) 2004-03-25 23:35 ` Justin T. Gibbs @ 2004-03-26 19:15 ` Kevin Corry 2004-03-26 20:45 ` Justin T. Gibbs 3 siblings, 1 reply; 38+ messages in thread From: Kevin Corry @ 2004-03-26 19:15 UTC (permalink / raw) To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Thursday 25 March 2004 12:42 pm, Jeff Garzik wrote: > > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at > > some point in the future, primarily for some of the reasons I mentioned > > above. Obviously linear.c and raid0.c don't really need to be ported. DM > > provides equivalent functionality, the discovery/activation can be driven > > from user-space, and no in-kernel status updating is necessary (unlike > > RAID-1 and -5). And we've talked for a long time about wanting to port > > RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we > > haven't started on any such work, or even had any significant discussions > > about *how* to do it. I can't > > let's have that discussion :) Great! Where do we begin? :) > I'd like to focus on the "additional requirements" you mention, as I > think that is a key area for consideration. > > There is a certain amount of metadata that -must- be updated at runtime, > as you recognize. Over and above what MD already cares about, DDF and > its cousins introduce more items along those lines: event logs, bad > sector logs, controller-level metadata... these are some of the areas I > think Justin/Scott are concerned about. I'm sure these things could be accomodated within DM. Nothing in DM prevents having some sort of in-kernel metadata knowledge. In fact, other DM modules already do - dm-snapshot and the above mentioned dm-mirror both need to do some amount of in-kernel status updating. But I see this as completely separate from in-kernel device discovery (which we seem to agree is the wrong direction). And IMO, well designed metadata will make this "split" very obvious, so it's clear which parts of the metadata the kernel can use for status, and which parts are purely for identification (which the kernel thus ought to be able to ignore). The main point I'm trying to get across here is that DM provides a simple yet extensible kernel framework for a variety of storage management tasks, including a lot more than just RAID. I think it would be a huge benefit for the RAID drivers to make use of this framework to provide functionality beyond what is currently available. > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. Association of RAID arrays > based on specific hardware controllers. Silently building RAID0+1 > stacked arrays out of non-RAID block devices the kernel presents. By this I assume you mean RAID devices that don't contain any type of on-disk metadata (e.g. MD superblocks). I don't see this as a huge hurdle. As long as the device drivers (SCIS, IDE, etc) export the necessary identification info through sysfs, user-space tools can contain the policies necessary to allow them to detect which disks belong together in a RAID device, and then tell the kernel to activate said RAID device. This sounds a lot like how Christophe Varoqui has been doing things in his new multipath tools. > Failing over when one of the drives the kernel presents does not respond. > > All that just screams "do it in userland". > > OTOH, once the devices are up and running, kernel needs update some of > that configuration itself. Hot spare lists are an easy example, but any > time the state of the overall RAID array changes, some host RAID > formats, more closely tied to hardware than MD, may require > configuration metadata changes when some hardware condition(s) change. Certainly. Of course, I see things like adding and removing hot-spares and removing stale/faulty disks as something that can be driven from user-space. For example, for adding a new hot-spare, with DM it's as simple as loading a new mapping that contains the new disk, then telling DM to switch the device mapping (which implies a suspend/resume of I/O). And if necessary, such a user-space tool can be activated by hotplug events triggered by the insertion of a new disk into the system, making the process effectively transparent to the user. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 19:15 ` Kevin Corry @ 2004-03-26 20:45 ` Justin T. Gibbs 2004-03-27 15:39 ` Kevin Corry 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-26 20:45 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid >> There is a certain amount of metadata that -must- be updated at runtime, >> as you recognize. Over and above what MD already cares about, DDF and >> its cousins introduce more items along those lines: event logs, bad >> sector logs, controller-level metadata... these are some of the areas I >> think Justin/Scott are concerned about. > > I'm sure these things could be accommodated within DM. Nothing in DM prevents > having some sort of in-kernel metadata knowledge. In fact, other DM modules > already do - dm-snapshot and the above mentioned dm-mirror both need to do > some amount of in-kernel status updating. But I see this as completely > separate from in-kernel device discovery (which we seem to agree is the wrong > direction). And IMO, well designed metadata will make this "split" very > obvious, so it's clear which parts of the metadata the kernel can use for > status, and which parts are purely for identification (which the kernel thus > ought to be able to ignore). We don't have control over the meta-data formats being used by the industry. Coming up with a solution that only works for "Linux Engineered Meta-data formats" removes any possibility of supporting things like DDF, Adaptec ASR, and a host of other meta-data formats that can be plugged into things like EMD. In the two cases we are supporting today with EMD, the records required for doing discovery reside in the same sectors as those that need to be updated at runtime from some "in-core" context. > The main point I'm trying to get across here is that DM provides a simple yet > extensible kernel framework for a variety of storage management tasks, > including a lot more than just RAID. I think it would be a huge benefit for > the RAID drivers to make use of this framework to provide functionality > beyond what is currently available. DM is a transform layer that has the ability to pause I/O while that transform is updated from userland. That's all it provides. As such, it is perfectly suited to some types of logical volume management applications. But that is as far as it goes. It does not have any support for doing "sync/resync/scrub" type operations or any generic support for doing anything with meta-data. In all of the examples you have presented so far, you have not explained how this part of the equation is handled. Sure, adding a member to a RAID1 is trivial. Just pause the I/O, update the transform, and let it go. Unfortunately, that new member is not in sync with the rest. The transform must be aware of this and only trust the member below the sync mark. How is this information communicated to the transform? Who updates the sync mark? Who copies the data to the new member while guaranteeing that an in-flight write does not occur to the area being synced? If you intend to add all of this to DM, then it is no longer any "simpler" or more extensible than EMD. Don't take my arguments the wrong way. I believe that DM is useful for what it was designed for: LVM. It does not, however, provide the machinery required for it to replace a generic RAID stack. Could you merge a RAID stack into DM. Sure. Its only software. But for it to be robust, the same types of operations MD/EMD perform in kernel space will have to be done there too. The simplicity of DM is part of why it is compelling. My belief is that merging RAID into DM will compromise this simplicity and divert DM from what it was designed to do - provide LVM transforms. As for RAID discovery, this is the trivial portion of RAID. For an extra 10% or less of code in a meta-data module, you get RAID discovery. You also get a single point of access to the meta-data, avoid duplicated code, and complex kernel/user interfaces. There seems to be a consistent feeling that it is worth compromising all of these benefits just to push this 10% of the meta-data handling code out of the kernel (and inflate it by 5 or 6 X duplicating code already in the kernel). Where are the benefits of this userland approach? -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 20:45 ` Justin T. Gibbs @ 2004-03-27 15:39 ` Kevin Corry 2004-03-28 9:11 ` [dm-devel] " christophe varoqui 2004-03-30 17:03 ` Justin T. Gibbs 0 siblings, 2 replies; 38+ messages in thread From: Kevin Corry @ 2004-03-27 15:39 UTC (permalink / raw) To: linux-kernel, Justin T. Gibbs Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel On Friday 26 March 2004 2:45 pm, Justin T. Gibbs wrote: > We don't have control over the meta-data formats being used by the > industry. Coming up with a solution that only works for "Linux Engineered > Meta-data formats" removes any possibility of supporting things like DDF, > Adaptec ASR, and a host of other meta-data formats that can be plugged into > things like EMD. In the two cases we are supporting today with EMD, the > records required for doing discovery reside in the same sectors as those > that need to be updated at runtime from some "in-core" context. Well, there's certainly no guarantee that the "industry" will get it right. In this case, it seems that they didn't. But even given that we don't have ideal metadata formats, it's still possible to do discovery and a number of other management tasks from user-space. > > The main point I'm trying to get across here is that DM provides a simple > > yet extensible kernel framework for a variety of storage management > > tasks, including a lot more than just RAID. I think it would be a huge > > benefit for the RAID drivers to make use of this framework to provide > > functionality beyond what is currently available. > > DM is a transform layer that has the ability to pause I/O while that > transform is updated from userland. That's all it provides. I think the DM developers would disagree with you on this point. > As such, > it is perfectly suited to some types of logical volume management > applications. But that is as far as it goes. It does not have any > support for doing "sync/resync/scrub" type operations or any generic > support for doing anything with meta-data. The core DM driver would not and should not be handling these operations. These are handled in modules specific to one type of mapping. There's no need for the DM core to know anything about any metadata. If one particular module (e.g. dm-mirror) needs to support one or more metadata formats, it's free to do so. On the other hand, DM *does* provide services that make "sync/resync" a great deal simpler for such a module. It provides simple services for performing synchronous or asynchronous I/O to pages or vm areas. It provides a service for performing copies from one block-device area to another. The dm-mirror module uses these for this very purpose. If we need additional "libraries" for common RAID tasks (e.g. parity calculations) we can certainly add them. > In all of the examples you > have presented so far, you have not explained how this part of the equation > is handled. Sure, adding a member to a RAID1 is trivial. Just pause the > I/O, update the transform, and let it go. Unfortunately, that new member > is not in sync with the rest. The transform must be aware of this and only > trust the member below the sync mark. How is this information communicated > to the transform? Who updates the sync mark? Who copies the data to the > new member while guaranteeing that an in-flight write does not occur to the > area being synced? Before the new disk is added to the raid1, user-space is responsible for writing an initial state to that disk, effectively marking it as completely dirty and unsynced. When the new table is loaded, part of the "resume" is for the module to read any metadata and do any initial setup that's necessary. In this particular example, it means the new disk would start with all of its "regions" marked "dirty", and all the regions would need to be synced from corresponding "clean" regions on another disk in the set. If the previously-existing disks were part-way through a sync when the table was switched, their metadata would indicate where the current "sync mark" was located. The module could then continue the sync from where it left off, including the new disk that was just added. When the sync completed, it might have to scan back to the beginning of the new disk to see if had any remaining dirty regions that needed to be synced before that disk was completely clean. And of course the I/O-mapping path just has to be smart enough to know which regions are dirty and avoid sending live I/O to those. (And I'm sure Joe or Alasdair could provide a better in-depth explanation of the current dm-mirror module than I'm trying to. This is obviously a very high-level overview.) This process is somewhat similar to how dm-snapshot works. If it reads an empty header structure, it assumes it's a new snapshot, and starts with an empty hash table. If it reads a previously existing header, it continues to read the on-disk COW tables and constructs the necessary in-memory hash-table to represent that initial state. > If you intend to add all of this to DM, then it is no > longer any "simpler" or more extensible than EMD. Sure it is. Because very little (if any) of this needs to affect the core DM driver, that core remains as simple and extensible as it currently is. The extra complexity only really affects the new modules that would handle RAID. > Don't take my arguments the wrong way. I believe that DM is useful > for what it was designed for: LVM. It does not, however, provide the > machinery required for it to replace a generic RAID stack. Could > you merge a RAID stack into DM. Sure. Its only software. But for > it to be robust, the same types of operations MD/EMD perform in kernel > space will have to be done there too. > > The simplicity of DM is part of why it is compelling. My belief is that > merging RAID into DM will compromise this simplicity and divert DM from > what it was designed to do - provide LVM transforms. I disagree. The simplicity of the core DM driver really isn't at stake here. We're only talking about adding a few relatively complex target modules. And with DM you get the benefit of a very simple user/kernel interface. > As for RAID discovery, this is the trivial portion of RAID. For an extra > 10% or less of code in a meta-data module, you get RAID discovery. You > also get a single point of access to the meta-data, avoid duplicated code, > and complex kernel/user interfaces. There seems to be a consistent feeling > that it is worth compromising all of these benefits just to push this 10% > of the meta-data handling code out of the kernel (and inflate it by 5 or > 6 X duplicating code already in the kernel). Where are the benefits of > this userland approach? I've got to admit, this whole discussion is very ironic. Two years ago I was exactly where you are today, pushing for in-kernel discover, a variety of metadata modules, internal opaque device stacking, etc, etc. I can only imagine that hch is laughing his ass off now that I'm the one arguing for moving all this stuff to user-space. I don't honestly expect to suddenly change your mind on all these issues. A lot of work has obviously gone into EMD, and I definitely know how hard it can be when the community isn't greeting your suggestions with open arms. And I'm certainly not saying the EMD method isn't a potentially viable approach. But it doesn't seem to be the approach the community is looking for. We faced the same resistance two years ago. It took months of arguing with the community and arguing amongst ourselves before we finally decided to move EVMS to user-space and use MD and DM. It was a decision that meant essentially throwing away an enormous amount of work from several people. It was an incredibly hard choice, but I really believe now that it was the right decision. It was the direction the community wanted to move in, and the only way for our project to truely survive was to move with them. So feel free to continue to develop and promote EMD. I'm not trying to stop you and I don't mind having competition for finding the best way to do RAID in Linux. But I can tell you from experience that EMD is going to face a good bit of opposition based on its current design and you might want to take that into consideration. I am interested in discussing if and how RAID could be supported under Device-Mapper (or some other "merging" of these two drivers). Jeff and Lars have shown some interest, and I certainly hope we can convince Neil and Joe that this is a good direction. Maybe it can be done and maybe it can't. I personally think it can be, and I'd at least like to have that discussion and find out. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [dm-devel] Re: "Enhanced" MD code avaible for review 2004-03-27 15:39 ` Kevin Corry @ 2004-03-28 9:11 ` christophe varoqui 2004-03-30 17:03 ` Justin T. Gibbs 1 sibling, 0 replies; 38+ messages in thread From: christophe varoqui @ 2004-03-28 9:11 UTC (permalink / raw) To: device-mapper development Cc: linux-kernel, Justin T. Gibbs, linux-raid, Jeff Garzik, Neil Brown Justin, I direct you to http://christophe.varoqui.free.fr/ for a well documented example of coordination between the device-mapper and the userspace multipath tools. I hope you'll see how robust and elegant the solution can be. regards, cvaroqui ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-27 15:39 ` Kevin Corry 2004-03-28 9:11 ` [dm-devel] " christophe varoqui @ 2004-03-30 17:03 ` Justin T. Gibbs 2004-03-30 17:15 ` Jeff Garzik 1 sibling, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:03 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel > Well, there's certainly no guarantee that the "industry" will get it right. In > this case, it seems that they didn't. But even given that we don't have ideal > metadata formats, it's still possible to do discovery and a number of other > management tasks from user-space. I have never proposed that management activities be performed solely within the kernel. My position has been that meta-data parsing and updating has to be core-resident for any solution that handles advanced RAID functionality and that spliting out any portion of those roles to userland just complicates the solution. >> it is perfectly suited to some types of logical volume management >> applications. But that is as far as it goes. It does not have any >> support for doing "sync/resync/scrub" type operations or any generic >> support for doing anything with meta-data. > > The core DM driver would not and should not be handling these operations. > These are handled in modules specific to one type of mapping. There's no > need for the DM core to know anything about any metadata. If one particular > module (e.g. dm-mirror) needs to support one or more metadata formats, it's > free to do so. That's unfortunate considering that the meta-data formats we are talking about already have the capability of expressing RAID 1(E),4,5,6. There has to be a common meta-data framework in order to avoid this duplication. >> In all of the examples you >> have presented so far, you have not explained how this part of the equation >> is handled. ... > Before the new disk is added to the raid1, user-space is responsible for > writing an initial state to that disk, effectively marking it as completely > dirty and unsynced. When the new table is loaded, part of the "resume" is for > the module to read any metadata and do any initial setup that's necessary. In > this particular example, it means the new disk would start with all of its > "regions" marked "dirty", and all the regions would need to be synced from > corresponding "clean" regions on another disk in the set. > > If the previously-existing disks were part-way through a sync when the table > was switched, their metadata would indicate where the current "sync mark" was > located. The module could then continue the sync from where it left off, > including the new disk that was just added. When the sync completed, it might > have to scan back to the beginning of the new disk to see if had any remaining > dirty regions that needed to be synced before that disk was completely clean. > > And of course the I/O-mapping path just has to be smart enough to know which > regions are dirty and avoid sending live I/O to those. > > (And I'm sure Joe or Alasdair could provide a better in-depth explanation of > the current dm-mirror module than I'm trying to. This is obviously a very > high-level overview.) So all of this complexity is still in the kernel. The only difference is that the meta-data can *also* be manipulated from userspace. In order for this to be safe, the mirror must be suspended (meta-data becomes stable), the meta-data must be re-read by the userland program, the meta-data must be updated, the mapping must be updated, the mirror must be resumed, and the mirror must revalidate all meta-data. How do you avoid deadlock in this process? Does the userland daemon, which must be core resident in this case, pre-allocate buffers for reading and writing the meta-data? The dm-raid1 module also appears to intrinsicly trust its mapping and the contents of its meta-data (simple magic number check). It seems to me that the kernel should validate all of its inputs regardless of whether the ioctls that are used to present them are only supposed to be used by a "trusted daemon". All of this adds up to more complexity. Your argument seems to be that, since DM avoids this complexity in its core, this is a better solution, but I am more interested in the least complex, most easily maintained total solution. >> The simplicity of DM is part of why it is compelling. My belief is that >> merging RAID into DM will compromise this simplicity and divert DM from >> what it was designed to do - provide LVM transforms. > > I disagree. The simplicity of the core DM driver really isn't at stake here. > We're only talking about adding a few relatively complex target modules. And > with DM you get the benefit of a very simple user/kernel interface. The simplicity of the user/kernel interface is not what is at stake here. With EMD, you can perform all of the same operations talked about above, in just as few ioctl calls. The only difference is that the kernel and only the kernel, reads and modifies the metadata. There are actually fewer steps for the userland application than before. This becomes even more evident as more meta-data modules are added. > I don't honestly expect to suddenly change your mind on all these issues. > A lot of work has obviously gone into EMD, and I definitely know how hard it > can be when the community isn't greeting your suggestions with open arms. I honestly don't care if the final solution is EMD, DM, or XYZ so long as that solution is correct, supportable, and covers all of the scenarios required for robust RAID support. That is the crux of the argument, not "please love my code". -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:03 ` Justin T. Gibbs @ 2004-03-30 17:15 ` Jeff Garzik 2004-03-30 17:35 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-30 17:15 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: > The dm-raid1 module also appears to intrinsicly trust its mapping and the > contents of its meta-data (simple magic number check). It seems to me that > the kernel should validate all of its inputs regardless of whether the > ioctls that are used to present them are only supposed to be used by a > "trusted daemon". The kernel should not be validating -trusted- userland inputs. Root is allowed to scrag the disk, violate limits, and/or crash his own machine. A simple example is requiring userland, when submitting ATA taskfiles via an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). If the data phase is specified incorrectly, you kill the OS driver's ATA host state machine, and the results are very unpredictable. Since this is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the required details right (just like following a spec). > I honestly don't care if the final solution is EMD, DM, or XYZ so long > as that solution is correct, supportable, and covers all of the scenarios > required for robust RAID support. That is the crux of the argument, not > "please love my code". hehe. I think we all agree here... Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:15 ` Jeff Garzik @ 2004-03-30 17:35 ` Justin T. Gibbs 2004-03-30 17:46 ` Jeff Garzik 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 0 siblings, 2 replies; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:35 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel > The kernel should not be validating -trusted- userland inputs. Root is > allowed to scrag the disk, violate limits, and/or crash his own machine. > > A simple example is requiring userland, when submitting ATA taskfiles via > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). > If the data phase is specified incorrectly, you kill the OS driver's ATA > host wwtate machine, and the results are very unpredictable. Since this > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the > required details right (just like following a spec). That's unfortunate for those using ATA. A command submitted from userland to the SCSI drivers I've written that causes a protocol violation will be detected, result in appropriate recovery, and a nice diagnostic that can be used to diagnose the problem. Part of this is because I cannot know if the protocol violation stems from a target defect, the input from the user or, for that matter, from the kernel. The main reason is for robustness and ease of debugging. In SCSI case, there is almost no run-time cost, and the system will stop before data corruption occurs. In the meta-data case we've been discussing in terms of EMD, there is no runtime cost, the validation has to occur somewhere anyway, and in many cases some validation is already required to avoid races with external events. If the validation is done in the kernel, then you get the benefit of nice diagnostics instead of strange crashes that are difficult to debug. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:35 ` Justin T. Gibbs @ 2004-03-30 17:46 ` Jeff Garzik 2004-03-30 18:04 ` Justin T. Gibbs 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-30 17:46 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>The kernel should not be validating -trusted- userland inputs. Root is >>allowed to scrag the disk, violate limits, and/or crash his own machine. >> >>A simple example is requiring userland, when submitting ATA taskfiles via >>an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). >>If the data phase is specified incorrectly, you kill the OS driver's ATA >>host wwtate machine, and the results are very unpredictable. Since this >>is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the >>required details right (just like following a spec). > > > That's unfortunate for those using ATA. A command submitted from userland Required, since one cannot know the data phase of vendor-specific commands. > to the SCSI drivers I've written that causes a protocol violation will > be detected, result in appropriate recovery, and a nice diagnostic that > can be used to diagnose the problem. Part of this is because I cannot know > if the protocol violation stems from a target defect, the input from the > user or, for that matter, from the kernel. The main reason is for robustness Well, * the target is not _issuing_ commands, * any user issuing incorrect commands/cdbs is not your bug, * and kernel code issuing incorrect cmands/cdbs isn't your bug either Particularly, checking whether the kernel is doing something wrong, or wrong, just wastes cycles. That's not a scalable way to code... if every driver and Linux subsystem did that, things would be unbearable slow. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:46 ` Jeff Garzik @ 2004-03-30 18:04 ` Justin T. Gibbs 2004-03-30 21:47 ` Jeff Garzik 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-30 18:04 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel >> That's unfortunate for those using ATA. A command submitted from userland > > Required, since one cannot know the data phase of vendor-specific commands. So you are saying that this presents an unrecoverable situation? > Particularly, checking whether the kernel is doing something wrong, or wrong, > just wastes cycles. That's not a scalable way to code... if every driver > and Linux subsystem did that, things would be unbearable slow. Hmm. I've never had someone tell me that my SCSI drivers are slow. I don't think that your statement is true in the general case. My belief is that validation should occur where it is cheap and efficient to do so. More expensive checks should be pushed into diagnostic code that is disabled by default, but the code *should be there*. In any event, for RAID meta-data, we're talking about code that is *not* in the common or time critical path of the kernel. A few dozen lines of validation code there has almost no impact on the size of the kernel and yields huge benefits for debugging and maintaining the code. This is even more the case in Linux the end user is often your test lab. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 18:04 ` Justin T. Gibbs @ 2004-03-30 21:47 ` Jeff Garzik 2004-03-30 22:12 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Jeff Garzik @ 2004-03-30 21:47 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>>That's unfortunate for those using ATA. A command submitted from userland >> >>Required, since one cannot know the data phase of vendor-specific commands. > > > So you are saying that this presents an unrecoverable situation? No, I'm saying that the data phase need not have a bunch of in-kernel checks, it should be generated correctly from the source. >>Particularly, checking whether the kernel is doing something wrong, or wrong, >>just wastes cycles. That's not a scalable way to code... if every driver >>and Linux subsystem did that, things would be unbearable slow. > > > Hmm. I've never had someone tell me that my SCSI drivers are slow. This would be noticed in the CPU utilization area. Your drivers are probably a long way from being CPU-bound. > I don't think that your statement is true in the general case. My > belief is that validation should occur where it is cheap and efficient > to do so. More expensive checks should be pushed into diagnostic code > that is disabled by default, but the code *should be there*. In any event, > for RAID meta-data, we're talking about code that is *not* in the common > or time critical path of the kernel. A few dozen lines of validation code > there has almost no impact on the size of the kernel and yields huge > benefits for debugging and maintaining the code. This is even more > the case in Linux the end user is often your test lab. It doesn't scale terribly well, because the checks themselves become a source of bugs. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 21:47 ` Jeff Garzik @ 2004-03-30 22:12 ` Justin T. Gibbs 2004-03-30 22:34 ` Jeff Garzik 0 siblings, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-30 22:12 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel >> So you are saying that this presents an unrecoverable situation? > > No, I'm saying that the data phase need not have a bunch of in-kernel > checks, it should be generated correctly from the source. The SCSI drivers validate the controller's data phase based on the expected phase presented to them from an upper layer. I never talked about adding checks that make little sense or are overly expensive. You seem to equate validation with huge expense. That is just not the general case. >> Hmm. I've never had someone tell me that my SCSI drivers are slow. > > This would be noticed in the CPU utilization area. Your drivers are > probably a long way from being CPU-bound. I very much doubt that. There are perhaps four or five tests in the I/O path where some value already in a cache line that has to be accessed anyway is compared against a constant. We're talking about something down in the noise of any type of profiling you could perform. As I said, validation makes sense where there is basically no-cost to do it. >> I don't think that your statement is true in the general case. My >> belief is that validation should occur where it is cheap and efficient >> to do so. More expensive checks should be pushed into diagnostic code >> that is disabled by default, but the code *should be there*. In any event, >> for RAID meta-data, we're talking about code that is *not* in the common >> or time critical path of the kernel. A few dozen lines of validation code >> there has almost no impact on the size of the kernel and yields huge >> benefits for debugging and maintaining the code. This is even more >> the case in Linux the end user is often your test lab. > > It doesn't scale terribly well, because the checks themselves become a > source of bugs. So now the complaint is that validation code is somehow harder to write and maintain than the rest of the code? -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 22:12 ` Justin T. Gibbs @ 2004-03-30 22:34 ` Jeff Garzik 0 siblings, 0 replies; 38+ messages in thread From: Jeff Garzik @ 2004-03-30 22:34 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>>So you are saying that this presents an unrecoverable situation? >> >>No, I'm saying that the data phase need not have a bunch of in-kernel >>checks, it should be generated correctly from the source. > > > The SCSI drivers validate the controller's data phase based on the > expected phase presented to them from an upper layer. I never talked > about adding checks that make little sense or are overly expensive. You > seem to equate validation with huge expense. That is just not the > general case. > > >>>Hmm. I've never had someone tell me that my SCSI drivers are slow. >> >>This would be noticed in the CPU utilization area. Your drivers are >>probably a long way from being CPU-bound. > > > I very much doubt that. There are perhaps four or five tests in the > I/O path where some value already in a cache line that has to be accessed > anyway is compared against a constant. We're talking about something > down in the noise of any type of profiling you could perform. As I said, > validation makes sense where there is basically no-cost to do it. > > >>>I don't think that your statement is true in the general case. My >>>belief is that validation should occur where it is cheap and efficient >>>to do so. More expensive checks should be pushed into diagnostic code >>>that is disabled by default, but the code *should be there*. In any event, >>>for RAID meta-data, we're talking about code that is *not* in the common >>>or time critical path of the kernel. A few dozen lines of validation code >>>there has almost no impact on the size of the kernel and yields huge >>>benefits for debugging and maintaining the code. This is even more >>>the case in Linux the end user is often your test lab. >> >>It doesn't scale terribly well, because the checks themselves become a >>source of bugs. > > > So now the complaint is that validation code is somehow harder to write > and maintain than the rest of the code? Actually, yes. Validation of random user input has always been a source of bugs (usually in edge cases), in Linux and in other operating systems. It is often the area where security bugs are found. Basically you want to avoid add checks for conditions that don't occur in properly written software, and make sure that the kernel always generates correct requests. Obviously that excludes anything on the target side, but other than that... in userland, a priveleged user is free to do anything they wish, including violate protocols, cook their disk, etc. Jeff ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:35 ` Justin T. Gibbs 2004-03-30 17:46 ` Jeff Garzik @ 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 1 sibling, 0 replies; 38+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-30 18:11 UTC (permalink / raw) To: Justin T. Gibbs, Jeff Garzik Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel On Tuesday 30 of March 2004 19:35, Justin T. Gibbs wrote: > > The kernel should not be validating -trusted- userland inputs. Root is > > allowed to scrag the disk, violate limits, and/or crash his own machine. > > > > A simple example is requiring userland, when submitting ATA taskfiles via > > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). > > If the data phase is specified incorrectly, you kill the OS driver's ATA > > host wwtate machine, and the results are very unpredictable. Since this > > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get > > the required details right (just like following a spec). > > That's unfortunate for those using ATA. A command submitted from userland > to the SCSI drivers I've written that causes a protocol violation will > be detected, result in appropriate recovery, and a nice diagnostic that > can be used to diagnose the problem. Part of this is because I cannot know > if the protocol violation stems from a target defect, the input from the > user or, for that matter, from the kernel. The main reason is for > robustness and ease of debugging. In SCSI case, there is almost no > run-time cost, and the system will stop before data corruption occurs. In In ATA case detection of protocol violation is not possible w/o checking every possible command opcode. Even if implemented (notice that checking commands coming from kernel is out of question - for performance reasons) this breaks for future and vendor specific commands. > the meta-data case we've been discussing in terms of EMD, there is no > runtime cost, the validation has to occur somewhere anyway, and in many > cases some validation is already required to avoid races with external > events. If the validation is done in the kernel, then you get the benefit > of nice diagnostics instead of strange crashes that are difficult to debug. Unless code that crashes is the one doing validation. ;-) Bartlomiej ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:00 ` Kevin Corry 2004-03-25 18:42 ` Jeff Garzik @ 2004-03-25 22:59 ` Justin T. Gibbs 2004-03-25 23:44 ` Lars Marowsky-Bree 1 sibling, 1 reply; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-25 22:59 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid >> Independent DM efforts have already started supporting MD raid0/1 >> metadata from what I understand, though these efforts don't seem to post >> to linux-kernel or linux-raid much at all. :/ > > I post on lkml.....occasionally. :) ... > This decision was not based on any real dislike of the MD driver, but rather > for the benefits that are gained by using Device-Mapper. In particular, > Device-Mapper provides the ability to change out the device mapping on the > fly, by temporarily suspending I/O, changing the table, and resuming the I/O > I'm sure many of you know this already. But I'm not sure everyone fully > understands how powerful a feature this is. For instance, it means EVMS can > now expand RAID-linear devices online. While that particular example may not > sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to > Device-Mapper, this feature would then allow you to do stuff like add new > "active" members to a RAID-1 online (think changing from 2-way mirror to > 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online > simply by adding a new disk (assuming other limitations, e.g. a single > stripe-zone). Unfortunately, these are things the MD driver can't do online, > because you need to completely stop the MD device before making such changes > (to prevent the kernel and user-space from trampling on the same metadata), > and MD won't stop the device if it's open (i.e. if it's mounted or if you > have other device (LVM) built on top of MD). Often times this means you need > to boot to a rescue-CD to make these types of configuration changes. We should be clear about your argument here. It is not that DM makes generic morphing easy and possible, it is that with DM the most basic types of morphing (no data striping or de-striping) is easily accomplished. You sight two examples: 1) Adding another member to a RAID-1. While MD may not allow this to occur while the array is operational, EMD does. This is possible because there is only one entity controlling the meta-data. 2) Converting a RAID0 to a RAID4 while possible with DM is not particularly interesting from an end user perspective. The fact of the matter is that neither EMD nor DM provide a generic morphing capability. If this is desirable, we can discuss how it could be achieved, but my initial belief is that attempting any type of complicated morphing from userland would be slow, prone to deadlocks, and thus difficult to achieve in a fashion that guaranteed no loss of data in the face of unexpected system restarts. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 22:59 ` Justin T. Gibbs @ 2004-03-25 23:44 ` Lars Marowsky-Bree 2004-03-26 0:03 ` Justin T. Gibbs 0 siblings, 1 reply; 38+ messages in thread From: Lars Marowsky-Bree @ 2004-03-25 23:44 UTC (permalink / raw) To: Justin T. Gibbs, Kevin Corry, linux-kernel Cc: Jeff Garzik, Neil Brown, linux-raid On 2004-03-25T15:59:00, "Justin T. Gibbs" <gibbs@scsiguy.com> said: > The fact of the matter is that neither EMD nor DM provide a generic > morphing capability. If this is desirable, we can discuss how it could > be achieved, but my initial belief is that attempting any type of > complicated morphing from userland would be slow, prone to deadlocks, > and thus difficult to achieve in a fashion that guaranteed no loss of > data in the face of unexpected system restarts. Uhm. DM sort of does (at least where the morphing amounts to resyncing a part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc). Freeze, load new mapping, continue. I agree that more complex morphings (RAID1->RAID5 or vice-versa in particular) are more difficult to get right, but are not that often needed online - or if they are, typically such scenarios will have enough temporary storage to create the new target, RAID1 over, disconnect the old part and free it, which will work just fine with DM. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:44 ` Lars Marowsky-Bree @ 2004-03-26 0:03 ` Justin T. Gibbs 0 siblings, 0 replies; 38+ messages in thread From: Justin T. Gibbs @ 2004-03-26 0:03 UTC (permalink / raw) To: Lars Marowsky-Bree, Kevin Corry, linux-kernel Cc: Jeff Garzik, Neil Brown, linux-raid > Uhm. DM sort of does (at least where the morphing amounts to resyncing a > part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc). > Freeze, load new mapping, continue. The point is that these trivial "morphings" can be achieved with limited effort regardless of whether you do it via EMD or DM. Implementing this in EMD could be achieved with perhaps 8 hours work with no significant increase in code size or complexity. This is part of why I find them "uninteresting". If we really want to talk about generic morphing, I think you'll find that DM is no better suited to this task than MD or its derivatives. > I agree that more complex morphings (RAID1->RAID5 or vice-versa in > particular) are more difficult to get right, but are not that often > needed online - or if they are, typically such scenarios will have > enough temporary storage to create the new target, RAID1 over, > disconnect the old part and free it, which will work just fine with DM. The most common requests that we hear from customers are: o single -> R1 Equally possible with MD or DM assuming your singles are accessed via a volume manager. Without that support the user will have to dismount and remount storage. o R1 -> R10 This should require just double the number of active members. This is not possible today with either DM or MD. Only "migration" is possible. o R1 -> R5 o R5 -> R1 These typically occur when data access patterns change for the customer. Again not possible with DM or MD today. All of these are important to some subset of customers and are, to my mind, required if you want to claim even basic morphing capability. If you are allowing the "cop-out" of using a volume manager to substitute data-migration for true morphing, then MD is almost as well suited to that task as DM. -- Justin ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2004-03-31 17:07 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-03-19 20:19 "Enhanced" MD code avaible for review Justin T. Gibbs 2004-03-23 5:05 ` Neil Brown 2004-03-23 6:23 ` Justin T. Gibbs 2004-03-24 2:26 ` Neil Brown 2004-03-24 19:09 ` Matt Domsch 2004-03-25 2:21 ` Jeff Garzik 2004-03-25 18:00 ` Kevin Corry 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik 2004-03-25 23:46 ` Justin T. Gibbs 2004-03-26 0:01 ` Jeff Garzik 2004-03-26 0:10 ` Justin T. Gibbs 2004-03-26 0:14 ` Jeff Garzik 2004-03-25 22:04 ` Lars Marowsky-Bree 2004-03-26 19:19 ` Kevin Corry 2004-03-31 17:07 ` Randy.Dunlap 2004-03-25 23:35 ` Justin T. Gibbs 2004-03-26 0:13 ` Jeff Garzik 2004-03-26 17:43 ` Justin T. Gibbs 2004-03-28 0:06 ` Lincoln Dale 2004-03-30 17:54 ` Justin T. Gibbs 2004-03-28 0:30 ` Jeff Garzik 2004-03-26 19:15 ` Kevin Corry 2004-03-26 20:45 ` Justin T. Gibbs 2004-03-27 15:39 ` Kevin Corry 2004-03-28 9:11 ` [dm-devel] " christophe varoqui 2004-03-30 17:03 ` Justin T. Gibbs 2004-03-30 17:15 ` Jeff Garzik 2004-03-30 17:35 ` Justin T. Gibbs 2004-03-30 17:46 ` Jeff Garzik 2004-03-30 18:04 ` Justin T. Gibbs 2004-03-30 21:47 ` Jeff Garzik 2004-03-30 22:12 ` Justin T. Gibbs 2004-03-30 22:34 ` Jeff Garzik 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 2004-03-25 22:59 ` Justin T. Gibbs 2004-03-25 23:44 ` Lars Marowsky-Bree 2004-03-26 0:03 ` Justin T. Gibbs
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).