* Re: "Enhanced" MD code avaible for review [not found] ` <1AOTW-4Vx-5@gated-at.bofh.it> @ 2004-03-18 1:33 ` Andi Kleen 2004-03-18 2:00 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Andi Kleen @ 2004-03-18 1:33 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-raid, justin_gibbs, Linux Kernel Jeff Garzik <jgarzik@pobox.com> writes: > > ioctl's are a pain for 32->64-bit translation layers. Using a > read/write interface allows one to create an interface that requires > no translation layer -- a big deal for AMD64 and IA32e processors > moving forward -- and it also gives one a lot more control over the > interface. Sorry, Jeff, but that's just not true. While ioctls need an additional entry in the conversion table, they can at least easily get an translation handler if needed. When they are correctly designed you just need a single line to enable pass through the emulation. If you don't want to add that line to the generic compat_ioctl.h file you can also do it in your driver. read/write has the big disadvantage that if someone gets the emulation wrong (and that happens regularly) it is near impossible to add an emulation handler then, because there is no good way to hook into the read/write paths. There may be valid reasons to go for read/write, but 32bit emulation is not one of them. In fact from the emulation perspective read/write should be avoided. -Andi ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-18 1:33 ` "Enhanced" MD code avaible for review Andi Kleen @ 2004-03-18 2:00 ` Jeff Garzik 2004-03-20 9:58 ` Jamie Lokier 0 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-18 2:00 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-raid, justin_gibbs, Linux Kernel Andi Kleen wrote: > Sorry, Jeff, but that's just not true. While ioctls need an additional > entry in the conversion table, they can at least easily get an > translation handler if needed. When they are correctly designed you > just need a single line to enable pass through the emulation. > If you don't want to add that line to the generic compat_ioctl.h > file you can also do it in your driver. > > read/write has the big disadvantage that if someone gets the emulation > wrong (and that happens regularly) it is near impossible to add an > emulation handler then, because there is no good way to hook > into the read/write paths. > > There may be valid reasons to go for read/write, but 32bit emulation > is not one of them. In fact from the emulation perspective read/write > should be avoided. I'll probably have to illustrate with code, but basically, read/write can be completely ignorant of 32/64-bit architecture, endianness, it can even be network-transparent. ioctls just can't do that. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-18 2:00 ` Jeff Garzik @ 2004-03-20 9:58 ` Jamie Lokier 0 siblings, 0 replies; 56+ messages in thread From: Jamie Lokier @ 2004-03-20 9:58 UTC (permalink / raw) To: Jeff Garzik; +Cc: Andi Kleen, linux-raid, justin_gibbs, Linux Kernel Jeff Garzik wrote: > I'll probably have to illustrate with code, but basically, read/write > can be completely ignorant of 32/64-bit architecture, endianness, it can > even be network-transparent. ioctls just can't do that. Apart from the network transparency, yes they can. Ioctl is no different from read/write/read-modify-write except the additional command argument. You can write architecture-specific ioctls which take and return structs -- and you can do the same with read/write. This is what Andi is thinking of as dangerous: the read/write case is then much harder to emulate. Or, you can write architecture-independent read/write, which use fixed formats, which you seem to have in mind. That works fine with ioctls too. It isn't commonly done, because people prefer the convenience of a struct. But it does work. It's slightly easier in the driver to implement commands this way using an ioctl, because you don't have to check the read/write length. It's about the same to use from userspace: both read/write and ioctl methods using an architecture-independent data format require the program to lay out the command bytes and then issue one system call. -- Jamie ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review
@ 2004-03-19 20:19 Justin T. Gibbs
2004-03-23 5:05 ` Neil Brown
0 siblings, 1 reply; 56+ messages in thread
From: Justin T. Gibbs @ 2004-03-19 20:19 UTC (permalink / raw)
To: linux-raid; +Cc: linux-kernel
[ CC trimmed since all those on the CC line appear to be on the lists ... ]
Lets take a step back and focus on a few of the points to which we can
hopefully all agree:
o Any successful solution will have to have "meta-data modules" for
active arrays "core resident" in order to be robust. This
requirement stems from the need to avoid deadlock during error
recovery scenarios that must block "normal I/O" to the array while
meta-data operations take place.
o It is desirable for arrays to auto-assemble based on recorded
meta-data. This includes the ability to have a user hot-insert
a "cold spare", have the system recognize it as a spare (based
on the meta-data resident on it) and activate it if necessary to
restore a degraded array.
o Child devices of an array should only be accessible through the
array while the array is in a configured state (bd_claim'ed).
This avoids situations where a user can subvert the integrity of
the array by performing "rogue I/O" to an array member.
Concentrating on just these three, we come to the conclusion that
whether the solution comes via "early user fs" or kernel modules,
the resident size of the solution *will* include the cost for
meta-data support. In either case, the user is able to tailor their
system to include only the support necessary for their individual
system to operate.
If we want to argue the merits of either approach based on just the
sheer size of resident code, I have little doubt that the kernel
module approach will prove smaller:
o No need for "mdadm" or some other daemon to be locked resident in
memory. This alone saves you having a locked copy of klibc or
any other user libraries core resident. The kernel modules
leverage the kernel APIs that already have to be core resident
to satisfy the needs of other parts of the kernel which also
helps in reducing its size.
o Initial RAM disk data can be discarded after modules are loaded at
boot time.
Putting the size argument aside for a moment, lets explore how a
userland solution could satisfy just the above three requirements.
How is meta-data updated on child members of an array while that
array is on-line? Remember that these operations occur with some
frequency. MD includes "safe-mode" support where redundant arrays
are marked clean any time writes cease for a predetermined, fairly
short, amount of time. The userland app cannot access the component
devices directly since they are bd_claim'ed. Even if that mechanism
is somehow subverted, how do we guarantee that these meta-data
writes do not cause a deadlock? In the case of a transition from
Read-only to Write mode, all writes are blocked to the array (this
must be the case for "Dirty" state to be accurate). It seems to
me that you must then provide extra code to not only pre-allocate
buffers for the userland app to do its work, but also provide a
"back-door" interface for these operations to take place.
The argument has also been made that shifting some of this code out
to a userland app "simplifies" the solution and perhaps even makes
it easier to develop. Comparing the two approaches we have:
UserFS:
o Kernel Driver + "enhanced interface to userland daemon"
o Userland Daemon (core resident)
o Userland Meta-Data modules
o Userland Management tool
- This tool needs to interface to the daemon and
perhaps also the kernel driver.
Kernel:
o Kernel RAID Transform Drivers
o Kernel Meta-Data modules
o Simple Userland Mangement
tool with no meta-data knowledge
So two questions arise from this analysis:
1) Are meta-data modules easier to code up or more robust as user
or kernel modules? I believe that doing these outside the kernel
will make them larger and more complex while also losing the
ability to have meta-data modules weigh in on rapidly occurring
events without incurring performance tradeoffs. Regardless of
where they reside, these modules must be robust. A kernel Oops
or a segfault in the daemon is unacceptable to the end user.
Saying that a segfault is less harmful in some way than an Oops
when we're talking about the users data completely misses the
point of why people use RAID.
2) What added complexity is incurred by supporting both a core
resident daemon as well as management interfaces to the daemon
and potentially the kernel module? I have not fully thought
through the corner cases such an approach would expose, so I
cannot quantify this cost. There are certainly more components
to get right and keep synchronized.
In the end, I find it hard to justify inventing all of the userland
machinery necessary to make this work just to avoid roughly ~2K
lines of code per-metadata module from being part of the kernel.
The ASR module for example, which is only required by those that
need support for this meta-data type, is only 19K with all of its
debugging printks and code enabled, unstripped. Are there benefits
to the userland approach that I'm missing?
--
Justin
^ permalink raw reply [flat|nested] 56+ messages in thread* Re: "Enhanced" MD code avaible for review 2004-03-19 20:19 Justin T. Gibbs @ 2004-03-23 5:05 ` Neil Brown 2004-03-23 6:23 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Neil Brown @ 2004-03-23 5:05 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel On Friday March 19, gibbs@scsiguy.com wrote: > [ CC trimmed since all those on the CC line appear to be on the lists ... ] > > Lets take a step back and focus on a few of the points to which we can > hopefully all agree: > > o Any successful solution will have to have "meta-data modules" for > active arrays "core resident" in order to be robust. This > requirement stems from the need to avoid deadlock during error > recovery scenarios that must block "normal I/O" to the array while > meta-data operations take place. I agree. 'Linear' and 'raid0' arrays don't really need metadata support in the kernel as their metadata is essentially read-only. There are interesting applications for raid1 without metadata, but I think that for all raid personalities where metadata might need to be updated in an error condition to preserve data integrity, the kernel should know enough about the metadata to perform that update. It would be nice to keep the in-kernel knowledge to a minimum, though some metadata formats probably make that hard. > > o It is desirable for arrays to auto-assemble based on recorded > meta-data. This includes the ability to have a user hot-insert > a "cold spare", have the system recognize it as a spare (based > on the meta-data resident on it) and activate it if necessary to > restore a degraded array. Certainly. It doesn't follow that the auto-assembly has to happen within the kernel. Having it all done in user-space makes it much easier to control/configure. I think the best way to describe my attitude to auto-assembly is that it could be needs-driven rather than availability-driven. needs-driven means: if the user asks to access an array that doesn't exist, then try to find the bits and assemble it. availability driven means: find all the devices that could be part of an array, and combine as many of them as possible together into arrays. Currently filesystems are needs-driven. At boot time, only to root filesystem, which has been identified somehow, gets mounted. Then the init scripts mount any others that are needed. We don't have any hunting around for filesystem superblocks and mounting the filesystems just in case they are needed. Currently partitions are (sufficiently) needs-driven. It is true that any partitionable devices has it's partitions presented. However the existence of partitions does not affect access to the whole device at all. Only once the partitions are claimed is the whole-device blocked. Providing that auto-assembly of arrays works the same way (needs driven), I am happy for arrays to auto-assemble. I happen to think this most easily done in user-space. With DDF format metadata, there is a concept of 'imported' arrays, which basically means arrays from some other controller that have been attached to the current controller. Part of my desire for needs-driven assembly is that I don't want to inadvertently assemble 'imported' arrays. A DDF controller has NVRAM or a hardcoded serial number to help avoid this. A generic Linux machine doesn't. I could possibly be happy with auto-assembly where a kernel parameter of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF arrays that have a controler-id (or whatever it is) of xx.yy.zz. This is probably simple enough to live entirely in the kernel. > > o Child devices of an array should only be accessible through the > array while the array is in a configured state (bd_claim'ed). > This avoids situations where a user can subvert the integrity of > the array by performing "rogue I/O" to an array member. bd_claim doesn't and (I believe) shouldn't stop access from user-space. It does stop a number of sorts of access that would expect exclusive access. But back to your original post: I suspect there is lots of valuable stuff in your emd patch, but as you have probably gathered, big patches are not the way we work around here, and with good reason. If you would like to identify isolated pieces of functionality, create patches to implement them, and submit them for review I will be quite happy to review them and, when appropriate, forward them to Andrew/Linus. I suggest you start with less controversial changes and work your way forward. NeilBrown ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-23 5:05 ` Neil Brown @ 2004-03-23 6:23 ` Justin T. Gibbs 2004-03-24 2:26 ` Neil Brown 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-23 6:23 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid, linux-kernel >> o Any successful solution will have to have "meta-data modules" for >> active arrays "core resident" in order to be robust. This ... > I agree. > 'Linear' and 'raid0' arrays don't really need metadata support in the > kernel as their metadata is essentially read-only. > There are interesting applications for raid1 without metadata, but I > think that for all raid personalities where metadata might need to be > updated in an error condition to preserve data integrity, the kernel > should know enough about the metadata to perform that update. > > It would be nice to keep the in-kernel knowledge to a minimum, though > some metadata formats probably make that hard. Can you further explain why you want to limit the kernel's knowledge and where you would separate the roles between kernel and userland? In reviewing one of our typical metadata modules, perhaps 80% of the code is generic meta-data record parsing and state conversion logic that would have to be retained in the kernel to perform "minimal meta-data updates". Some high portion of this 80% (less the portion that builds the in-kernel data structures to manipulate and update meta-data) would also need to be replicated into a user-land utility for any type of separation of labor to be possible. The remaining 20% of the kernel code deals with validation of user meta-data creation requests. This code is relatively small since it leverages all of the other routines that are already required for the operational requirements of the module. Splitting the roles bring up some important issues: 1) Code duplication. Depending on the complexity of the meta-data format being supported, the amount of code duplication between userland and kernel modules may be quite large. Any time code is duplicated, the solution is prone to getting out of sync - bugs are fixed in one copy of the code but not another. 2) Solution Complexity Two entities understand how to read and manipulate the meta-data. Policies and APIs must be created to ensure that only one entity is performing operations on the meta-data at a time. This is true even if one entity is primarily a read-only "client". For example, a meta-data module may defer meta-data updates in some instances (e.g. rebuild checkpointing) until the meta-data is closed (writing the checkpoint sooner doesn't make sense considering that you should restart your scrub, rebuild or verify if the system is not safely shutdown). How does the userland client get the most up-to-date information? This is just one of the problems in this area. 3) Size Due to code duplication, the total solution will be larger in code size. What benefits of operating in userland outweigh these issues? >> o It is desirable for arrays to auto-assemble based on recorded >> meta-data. This includes the ability to have a user hot-insert >> a "cold spare", have the system recognize it as a spare (based >> on the meta-data resident on it) and activate it if necessary to >> restore a degraded array. > > Certainly. It doesn't follow that the auto-assembly has to happen > within the kernel. Having it all done in user-space makes it much > easier to control/configure. > > I think the best way to describe my attitude to auto-assembly is that > it could be needs-driven rather than availability-driven. > > needs-driven means: if the user asks to access an array that doesn't > exist, then try to find the bits and assemble it. > availability driven means: find all the devices that could be part of > an array, and combine as many of them as possible together into > arrays. > > Currently filesystems are needs-driven. At boot time, only to root > filesystem, which has been identified somehow, gets mounted. > Then the init scripts mount any others that are needed. > We don't have any hunting around for filesystem superblocks and > mounting the filesystems just in case they are needed. Are filesystems the correct analogy? Consider that a user's attempt to mount a filesystem by label requires that all of the "block devices" that might contain that filesystem be enumerated automatically by the system. In this respect, the system is treating an MD device in exactly the same way as a SCSI or IDE disk. The array must be exported to the system on an "availability basis" in order for the "needs-driven" features of the system to behave as expected. > Currently partitions are (sufficiently) needs-driven. It is true that > any partitionable devices has it's partitions presented. However the > existence of partitions does not affect access to the whole device at > all. Only once the partitions are claimed is the whole-device > blocked. This seems a slight digression from your earlier argument. Is your concern that the arrays are auto-enumerated, or that the act of enumerating them prevents the component devices from being accessed (due to bd_clam)? > Providing that auto-assembly of arrays works the same way (needs > driven), I am happy for arrays to auto-assemble. > I happen to think this most easily done in user-space. I don't know how to reconcile a needs based approach with system features that require arrays to be exported as soon as they are detected. > With DDF format metadata, there is a concept of 'imported' arrays, > which basically means arrays from some other controller that have been > attached to the current controller. > > Part of my desire for needs-driven assembly is that I don't want to > inadvertently assemble 'imported' arrays. > A DDF controller has NVRAM or a hardcoded serial number to help avoid > this. A generic Linux machine doesn't. > > I could possibly be happy with auto-assembly where a kernel parameter > of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF > arrays that have a controler-id (or whatever it is) of xx.yy.zz. > > This is probably simple enough to live entirely in the kernel. The concept of "importing" an array doesn't really make sense in the case of MD's DDF. To fully take advantage of features like a controller BIOS's ability to natively boot an array, the disks for that domain must remain in that controller's domain. Determining the domain to assign to new arrays will require input from the user since there is limited topology information available to MD. The user will also have the ability to assign newly created arrays to the "MD Domain" which is not tied to any particular controller domain. ... > But back to your original post: I suspect there is lots of valuable > stuff in your emd patch, but as you have probably gathered, big > patches are not the way we work around here, and with good reason. > > If you would like to identify isolated pieces of functionality, create > patches to implement them, and submit them for review I will be quite > happy to review them and, when appropriate, forward them to > Andrew/Linus. > I suggest you start with less controversial changes and work your way > forward. One suggestion that was recently raised was to present these changes in the form of an alternate "EMD" driver to avoid any potential breakage of the existing MD. Do you have any opinion on this? -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-23 6:23 ` Justin T. Gibbs @ 2004-03-24 2:26 ` Neil Brown 2004-03-24 19:09 ` Matt Domsch 2004-03-25 2:21 ` Jeff Garzik 0 siblings, 2 replies; 56+ messages in thread From: Neil Brown @ 2004-03-24 2:26 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel On Monday March 22, gibbs@scsiguy.com wrote: > >> o Any successful solution will have to have "meta-data modules" for > >> active arrays "core resident" in order to be robust. This > > ... > > > I agree. > > 'Linear' and 'raid0' arrays don't really need metadata support in the > > kernel as their metadata is essentially read-only. > > There are interesting applications for raid1 without metadata, but I > > think that for all raid personalities where metadata might need to be > > updated in an error condition to preserve data integrity, the kernel > > should know enough about the metadata to perform that update. > > > > It would be nice to keep the in-kernel knowledge to a minimum, though > > some metadata formats probably make that hard. > > Can you further explain why you want to limit the kernel's knowledge > and where you would separate the roles between kernel and userland? General caution. It is generally harder the change mistakes in the kernel than it is to change mistakes in userspace, and similarly it is easer to add functionality and configurability in userspace. A design that puts the control in userspace is therefore preferred. A design that ties you to working through a narrow user-kernel interface is disliked. A design that gives easy control to user-space, and allows the kernel to do simple things simply is probably best. I'm not particularly concerned with code size and code duplication. A clean, expressive design is paramount. > 2) Solution Complexity > > Two entities understand how to read and manipulate the meta-data. > Policies and APIs must be created to ensure that only one entity > is performing operations on the meta-data at a time. This is true > even if one entity is primarily a read-only "client". For example, > a meta-data module may defer meta-data updates in some instances > (e.g. rebuild checkpointing) until the meta-data is closed (writing > the checkpoint sooner doesn't make sense considering that you should > restart your scrub, rebuild or verify if the system is not safely > shutdown). How does the userland client get the most up-to-date > information? This is just one of the problems in this area. If the kernel and userspace both need to know about metadata, then the design must make clear how they communicate. > > > Currently partitions are (sufficiently) needs-driven. It is true that > > any partitionable devices has it's partitions presented. However the > > existence of partitions does not affect access to the whole device at > > all. Only once the partitions are claimed is the whole-device > > blocked. > > This seems a slight digression from your earlier argument. Is your > concern that the arrays are auto-enumerated, or that the act of enumerating > them prevents the component devices from being accessed (due to > bd_clam)? Primarily the latter. But also that the act of enumerating them may cause an update to an underlying devices (e.g. metadata update or resync). That is what I am particularly uncomfortable about. > > > Providing that auto-assembly of arrays works the same way (needs > > driven), I am happy for arrays to auto-assemble. > > I happen to think this most easily done in user-space. > > I don't know how to reconcile a needs based approach with system > features that require arrays to be exported as soon as they are > detected. > Maybe if arrays were auto-assembled in a read-only mode that guaranteed not to write to the devices *at*all* and did not bd_claim them. When they are needed (either though some explicit set-writable command or through an implicit first-write) then the underlying components are bd_claimed. If that succeeds, the array becomes "live". If it fails, it stays read-only. > > > But back to your original post: I suspect there is lots of valuable > > stuff in your emd patch, but as you have probably gathered, big > > patches are not the way we work around here, and with good reason. > > > > If you would like to identify isolated pieces of functionality, create > > patches to implement them, and submit them for review I will be quite > > happy to review them and, when appropriate, forward them to > > Andrew/Linus. > > I suggest you start with less controversial changes and work your way > > forward. > > One suggestion that was recently raised was to present these changes > in the form of an alternate "EMD" driver to avoid any potential > breakage of the existing MD. Do you have any opinion on this? Choice is good. Competition is good. I would not try to interfere with you creating a new "emd" driver that didn't interfere with "md". What Linus would think of it I really don't know. It is certainly not impossible that he would accept it. However I'm not sure that having three separate device-array systems (dm, md, emd) is actually a good idea. It would probably be really good to unite md and dm somehow, but no-one seems really keen on actually doing the work. I seriously think the best long-term approach for your emd work is to get it integrated into md. I do listen to reason and I am not completely head-strong, but I do have opinions, and you would need to put in the effort to convincing me. NeilBrown ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-24 2:26 ` Neil Brown @ 2004-03-24 19:09 ` Matt Domsch 2004-03-25 2:21 ` Jeff Garzik 1 sibling, 0 replies; 56+ messages in thread From: Matt Domsch @ 2004-03-24 19:09 UTC (permalink / raw) To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel On Wed, Mar 24, 2004 at 01:26:47PM +1100, Neil Brown wrote: > On Monday March 22, gibbs@scsiguy.com wrote: > > One suggestion that was recently raised was to present these changes > > in the form of an alternate "EMD" driver to avoid any potential > > breakage of the existing MD. Do you have any opinion on this? > > I seriously think the best long-term approach for your emd work is to > get it integrated into md. I do listen to reason and I am not > completely head-strong, but I do have opinions, and you would need to > put in the effort to convincing me. I completely agree that long-term, md and emd need to be the same. However, watching the pain that the IDE changes took in early 2.5, I'd like to see emd be merged alongside md for the short-term while the kinks get worked out, keeping in mind the desire to merge them together again soon as that happens. Thanks, Matt -- Matt Domsch Sr. Software Engineer, Lead Engineer Dell Linux Solutions linux.dell.com & www.dell.com/linux Linux on Dell mailing lists @ http://lists.us.dell.com ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-24 2:26 ` Neil Brown 2004-03-24 19:09 ` Matt Domsch @ 2004-03-25 2:21 ` Jeff Garzik 2004-03-25 18:00 ` Kevin Corry 1 sibling, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-25 2:21 UTC (permalink / raw) To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel Neil Brown wrote: > Choice is good. Competition is good. I would not try to interfere > with you creating a new "emd" driver that didn't interfere with "md". > What Linus would think of it I really don't know. It is certainly not > impossible that he would accept it. Agreed. Independent DM efforts have already started supporting MD raid0/1 metadata from what I understand, though these efforts don't seem to post to linux-kernel or linux-raid much at all. :/ > However I'm not sure that having three separate device-array systems > (dm, md, emd) is actually a good idea. It would probably be really > good to unite md and dm somehow, but no-one seems really keen on > actually doing the work. I would be disappointed if all the work that has gone into the MD driver is simply obsoleted by new DM targets. Particularly RAID 1/5/6. You pretty much echoed my sentiments exactly... ideally md and dm can be bound much more tightly to each other. For example, convert md's raid[0156].c into device mapper targets... but indeed, nobody has stepped up to do that so far. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 2:21 ` Jeff Garzik @ 2004-03-25 18:00 ` Kevin Corry 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 22:59 ` Justin T. Gibbs 0 siblings, 2 replies; 56+ messages in thread From: Kevin Corry @ 2004-03-25 18:00 UTC (permalink / raw) To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Wednesday 24 March 2004 8:21 pm, Jeff Garzik wrote: > Neil Brown wrote: > > Choice is good. Competition is good. I would not try to interfere > > with you creating a new "emd" driver that didn't interfere with "md". > > What Linus would think of it I really don't know. It is certainly not > > impossible that he would accept it. > > Agreed. > > Independent DM efforts have already started supporting MD raid0/1 > metadata from what I understand, though these efforts don't seem to post > to linux-kernel or linux-raid much at all. :/ I post on lkml.....occasionally. :) I'm guessing you're referring to EVMS in that comment, since we have done *part* of what you just described. EVMS has always had a plugin to recognize MD devices, and has been using the MD driver for quite some time (along with using Device-Mapper for non-MD stuff). However, as of our most recent release (earlier this month), we switched to using Device-Mapper for MD RAID-linear and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" module (both required to support LVM volumes), and it was a rather trivial exercise to switch to activating these RAID devices using DM instead of MD. This decision was not based on any real dislike of the MD driver, but rather for the benefits that are gained by using Device-Mapper. In particular, Device-Mapper provides the ability to change out the device mapping on the fly, by temporarily suspending I/O, changing the table, and resuming the I/O I'm sure many of you know this already. But I'm not sure everyone fully understands how powerful a feature this is. For instance, it means EVMS can now expand RAID-linear devices online. While that particular example may not sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to Device-Mapper, this feature would then allow you to do stuff like add new "active" members to a RAID-1 online (think changing from 2-way mirror to 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online simply by adding a new disk (assuming other limitations, e.g. a single stripe-zone). Unfortunately, these are things the MD driver can't do online, because you need to completely stop the MD device before making such changes (to prevent the kernel and user-space from trampling on the same metadata), and MD won't stop the device if it's open (i.e. if it's mounted or if you have other device (LVM) built on top of MD). Often times this means you need to boot to a rescue-CD to make these types of configuration changes. As for not posting this information on lkml and/or linux-raid, I do apologize if this is something you would like to have been informed of. Most of the recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken that to mean the folks on the list aren't terribly interested in EVMS developments. And since EVMS is a completely user-space tool and this decision didn't affect any kernel components, I didn't think it was really relevent to mention here. We usually discuss such things on evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to cross-post to lkml more often if it's something that might be pertinent. > > However I'm not sure that having three separate device-array systems > > (dm, md, emd) is actually a good idea. It would probably be really > > good to unite md and dm somehow, but no-one seems really keen on > > actually doing the work. > > I would be disappointed if all the work that has gone into the MD driver > is simply obsoleted by new DM targets. Particularly RAID 1/5/6. > > You pretty much echoed my sentiments exactly... ideally md and dm can > be bound much more tightly to each other. For example, convert md's > raid[0156].c into device mapper targets... but indeed, nobody has > stepped up to do that so far. We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some point in the future, primarily for some of the reasons I mentioned above. Obviously linear.c and raid0.c don't really need to be ported. DM provides equivalent functionality, the discovery/activation can be driven from user-space, and no in-kernel status updating is necessary (unlike RAID-1 and -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started on any such work, or even had any significant discussions about *how* to do it. I can't imagine we would try this without at least involving Neil and other folks from linux-raid, since it would be nice to actually reuse as much of the existing MD code as possible (especially for RAID-5 and -6). I have no desire to try to rewrite those from scratch. Device-Mapper does currently contain a mirroring module (still just in Joe's -udm tree), which has primarily been used to provide online-move functionality in LVM2 and EVMS. They've recently added support for persistent logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 has some additional requirements for updating status in its superblock at runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring module could still be used, with the possibility of either adding MD-RAID-1-specific information to the persistent-log module, or simply as an additional log type. So, if this is the direction everyone else would like to see MD and DM take, we'd be happy to help out. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:00 ` Kevin Corry @ 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik ` (3 more replies) 2004-03-25 22:59 ` Justin T. Gibbs 1 sibling, 4 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-25 18:42 UTC (permalink / raw) To: Kevin Corry; +Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid Kevin Corry wrote: > I'm guessing you're referring to EVMS in that comment, since we have done > *part* of what you just described. EVMS has always had a plugin to recognize > MD devices, and has been using the MD driver for quite some time (along with > using Device-Mapper for non-MD stuff). However, as of our most recent release > (earlier this month), we switched to using Device-Mapper for MD RAID-linear > and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" > module (both required to support LVM volumes), and it was a rather trivial > exercise to switch to activating these RAID devices using DM instead of MD. nod > This decision was not based on any real dislike of the MD driver, but rather > for the benefits that are gained by using Device-Mapper. In particular, > Device-Mapper provides the ability to change out the device mapping on the > fly, by temporarily suspending I/O, changing the table, and resuming the I/O > I'm sure many of you know this already. But I'm not sure everyone fully > understands how powerful a feature this is. For instance, it means EVMS can > now expand RAID-linear devices online. While that particular example may not [...] Sounds interesting but is mainly an implementation detail for the purposes of this discussion... Some of this emd may want to use, for example. > As for not posting this information on lkml and/or linux-raid, I do apologize > if this is something you would like to have been informed of. Most of the > recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken > that to mean the folks on the list aren't terribly interested in EVMS > developments. And since EVMS is a completely user-space tool and this > decision didn't affect any kernel components, I didn't think it was really > relevent to mention here. We usually discuss such things on > evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to > cross-post to lkml more often if it's something that might be pertinent. Understandable... for the stuff that impacts MD some mention of the work, on occasion, to linux-raid and/or linux-kernel would be useful. I'm mainly looking at it from a standpoint of making sure that all the various RAID efforts are not independent of each other. > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some > point in the future, primarily for some of the reasons I mentioned above. > Obviously linear.c and raid0.c don't really need to be ported. DM provides > equivalent functionality, the discovery/activation can be driven from > user-space, and no in-kernel status updating is necessary (unlike RAID-1 and > -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 > (and now RAID-6) to Device-Mapper targets, but we haven't started on any such > work, or even had any significant discussions about *how* to do it. I can't let's have that discussion :) > imagine we would try this without at least involving Neil and other folks > from linux-raid, since it would be nice to actually reuse as much of the > existing MD code as possible (especially for RAID-5 and -6). I have no desire > to try to rewrite those from scratch. <cheers> > Device-Mapper does currently contain a mirroring module (still just in Joe's > -udm tree), which has primarily been used to provide online-move > functionality in LVM2 and EVMS. They've recently added support for persistent > logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 > has some additional requirements for updating status in its superblock at > runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring > module could still be used, with the possibility of either adding > MD-RAID-1-specific information to the persistent-log module, or simply as an > additional log type. WRT specific implementation, I would hope for the reverse -- that the existing, known, well-tested MD raid1 code would be used. But perhaps that's a naive impression... Folks with more knowledge of the implementation can make that call better than I. I'd like to focus on the "additional requirements" you mention, as I think that is a key area for consideration. There is a certain amount of metadata that -must- be updated at runtime, as you recognize. Over and above what MD already cares about, DDF and its cousins introduce more items along those lines: event logs, bad sector logs, controller-level metadata... these are some of the areas I think Justin/Scott are concerned about. My take on things... the configuration of RAID arrays got a lot more complex with DDF and "host RAID" in general. Association of RAID arrays based on specific hardware controllers. Silently building RAID0+1 stacked arrays out of non-RAID block devices the kernel presents. Failing over when one of the drives the kernel presents does not respond. All that just screams "do it in userland". OTOH, once the devices are up and running, kernel needs update some of that configuration itself. Hot spare lists are an easy example, but any time the state of the overall RAID array changes, some host RAID formats, more closely tied to hardware than MD, may require configuration metadata changes when some hardware condition(s) change. I respectfully disagree with the EMD folks that a userland approach is impossible, given all the failure scenarios. In a userland approach, there -will- be some duplicated metadata-management code between userland and the kernel. But for configuration _above_ the single-raid-array level, I think that's best left to userspace. There will certainly be a bit of intra-raid-array management code in the kernel, including configuration updating. I agree to its necessity... but that doesn't mean that -all- configuration/autorun stuff needs to be in the kernel. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik @ 2004-03-25 18:48 ` Jeff Garzik 2004-03-25 23:46 ` Justin T. Gibbs 2004-03-25 22:04 ` Lars Marowsky-Bree ` (2 subsequent siblings) 3 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-25 18:48 UTC (permalink / raw) To: linux-kernel; +Cc: Kevin Corry, Neil Brown, Justin T. Gibbs, linux-raid Jeff Garzik wrote: > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. Association of RAID arrays > based on specific hardware controllers. Silently building RAID0+1 > stacked arrays out of non-RAID block devices the kernel presents. > Failing over when one of the drives the kernel presents does not respond. > > All that just screams "do it in userland". Just so there is no confusion... the "failing over...in userland" thing I mention is _only_ during discovery of the root disk. Similar code would need to go into the bootloader, for controllers that do not present the entire RAID array as a faked BIOS INT drive. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:48 ` Jeff Garzik @ 2004-03-25 23:46 ` Justin T. Gibbs 2004-03-26 0:01 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-25 23:46 UTC (permalink / raw) To: Jeff Garzik, linux-kernel; +Cc: Kevin Corry, Neil Brown, linux-raid > Jeff Garzik wrote: > > Just so there is no confusion... the "failing over...in userland" thing I > mention is _only_ during discovery of the root disk. None of the solutions being talked about perform "failing over" in userland. The RAID transforms which perform this operation are kernel resident in DM, MD, and EMD. Perhaps you are talking about spare activation and rebuild? > Similar code would need to go into the bootloader, for controllers that do > not present the entire RAID array as a faked BIOS INT drive. None of the solutions presented here are attempting to make RAID transforms operate from the boot loader environment without BIOS support. I see this as a completely tangental problem to what is being discussed. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:46 ` Justin T. Gibbs @ 2004-03-26 0:01 ` Jeff Garzik 2004-03-26 0:10 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-26 0:01 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid Justin T. Gibbs wrote: >>Jeff Garzik wrote: >> >>Just so there is no confusion... the "failing over...in userland" thing I >>mention is _only_ during discovery of the root disk. > > > None of the solutions being talked about perform "failing over" in > userland. The RAID transforms which perform this operation are kernel > resident in DM, MD, and EMD. Perhaps you are talking about spare > activation and rebuild? This is precisely why I sent the second email, and made the qualification I did :) For a "do it in userland" solution, an initrd or initramfs piece examines the system configuration, and assembles physical disks into RAID arrays based on the information it finds. I was mainly implying that an initrd solution would have to provide some primitive failover initially, before the kernel is bootstrapped... much like a bootloader that supports booting off a RAID1 array would need to do. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:01 ` Jeff Garzik @ 2004-03-26 0:10 ` Justin T. Gibbs 2004-03-26 0:14 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-26 0:10 UTC (permalink / raw) To: Jeff Garzik; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid >> None of the solutions being talked about perform "failing over" in >> userland. The RAID transforms which perform this operation are kernel >> resident in DM, MD, and EMD. Perhaps you are talking about spare >> activation and rebuild? > > This is precisely why I sent the second email, and made the qualification > I did :) > > For a "do it in userland" solution, an initrd or initramfs piece examines > the system configuration, and assembles physical disks into RAID arrays > based on the information it finds. I was mainly implying that an initrd > solution would have to provide some primitive failover initially, before > the kernel is bootstrapped... much like a bootloader that supports booting > off a RAID1 array would need to do. "Failover" (i.e. redirecting a read to a viable member) will not occur via userland at all. The initrd solution just has to present all available members to the kernel interface performing the RAID transform. There is no need for "special failover handling" during bootstrap in either case. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:10 ` Justin T. Gibbs @ 2004-03-26 0:14 ` Jeff Garzik 0 siblings, 0 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-26 0:14 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid Justin T. Gibbs wrote: >>>None of the solutions being talked about perform "failing over" in >>>userland. The RAID transforms which perform this operation are kernel >>>resident in DM, MD, and EMD. Perhaps you are talking about spare >>>activation and rebuild? >> >>This is precisely why I sent the second email, and made the qualification >>I did :) >> >>For a "do it in userland" solution, an initrd or initramfs piece examines >>the system configuration, and assembles physical disks into RAID arrays >>based on the information it finds. I was mainly implying that an initrd >>solution would have to provide some primitive failover initially, before >>the kernel is bootstrapped... much like a bootloader that supports booting >>off a RAID1 array would need to do. > > > "Failover" (i.e. redirecting a read to a viable member) will not occur > via userland at all. The initrd solution just has to present all available > members to the kernel interface performing the RAID transform. There > is no need for "special failover handling" during bootstrap in either > case. hmmm, yeah, agreed. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik @ 2004-03-25 22:04 ` Lars Marowsky-Bree 2004-03-26 19:19 ` Kevin Corry 2004-03-25 23:35 ` Justin T. Gibbs 2004-03-26 19:15 ` Kevin Corry 3 siblings, 1 reply; 56+ messages in thread From: Lars Marowsky-Bree @ 2004-03-25 22:04 UTC (permalink / raw) To: Jeff Garzik, Kevin Corry Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid On 2004-03-25T13:42:12, Jeff Garzik <jgarzik@pobox.com> said: > >and -5). And we've talked for a long time about wanting to port RAID-1 and > >RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started > >on any such work, or even had any significant discussions about *how* to > >do it. I can't > let's have that discussion :) Nice 2.7 material, and parts I've always wanted to work on. (Including making the entire partition scanning user-space on top of DM too.) KS material? > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. And then add all the other stuff, like scenarios where half of your RAID is "somewhere" on the network via nbd, iSCSI or whatever and all the other possible stackings... Definetely user-space material, and partly because it /needs/ to have the input from the volume managers to do the sane things. The point about this implying that the superblock parsing/updating logic needs to be duplicated between userspace and kernel land is valid too though, and I'm keen on resolving this in a way which doesn't suck... Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 22:04 ` Lars Marowsky-Bree @ 2004-03-26 19:19 ` Kevin Corry 2004-03-31 17:07 ` Randy.Dunlap 0 siblings, 1 reply; 56+ messages in thread From: Kevin Corry @ 2004-03-26 19:19 UTC (permalink / raw) To: linux-kernel Cc: Lars Marowsky-Bree, Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote: > On 2004-03-25T13:42:12, > > Jeff Garzik <jgarzik@pobox.com> said: > > >and -5). And we've talked for a long time about wanting to port RAID-1 > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't > > > started on any such work, or even had any significant discussions about > > > *how* to do it. I can't > > > > let's have that discussion :) > > Nice 2.7 material, and parts I've always wanted to work on. (Including > making the entire partition scanning user-space on top of DM too.) Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think we've already proved this is possible. We really only need to work on making early-userspace a little easier to use. > KS material? Sounds good to me. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 19:19 ` Kevin Corry @ 2004-03-31 17:07 ` Randy.Dunlap 0 siblings, 0 replies; 56+ messages in thread From: Randy.Dunlap @ 2004-03-31 17:07 UTC (permalink / raw) To: Kevin Corry; +Cc: linux-kernel, lmb, jgarzik, neilb, gibbs, linux-raid On Fri, 26 Mar 2004 13:19:28 -0600 Kevin Corry wrote: | On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote: | > On 2004-03-25T13:42:12, | > | > Jeff Garzik <jgarzik@pobox.com> said: | > > >and -5). And we've talked for a long time about wanting to port RAID-1 | > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't | > > > started on any such work, or even had any significant discussions about | > > > *how* to do it. I can't | > > | > > let's have that discussion :) | > | > Nice 2.7 material, and parts I've always wanted to work on. (Including | > making the entire partition scanning user-space on top of DM too.) | | Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think | we've already proved this is possible. We really only need to work on making | early-userspace a little easier to use. | | > KS material? | | Sounds good to me. Ditto. I didn't see much conclusion to this thread, other than Neil's good suggestions. (maybe on some other list that I don't read?) I wouldn't want this or any other projects to have to wait for the kernel summit. Email has worked well for many years...let's try to keep it working. :) -- ~Randy "You can't do anything without having to do something else first." -- Belefant's Law ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik 2004-03-25 18:48 ` Jeff Garzik 2004-03-25 22:04 ` Lars Marowsky-Bree @ 2004-03-25 23:35 ` Justin T. Gibbs 2004-03-26 0:13 ` Jeff Garzik 2004-03-26 19:15 ` Kevin Corry 3 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-25 23:35 UTC (permalink / raw) To: Jeff Garzik, Kevin Corry; +Cc: linux-kernel, Neil Brown, linux-raid > I respectfully disagree with the EMD folks that a userland approach is > impossible, given all the failure scenarios. I've never said that it was impossible, just unwise. I believe that a userland approach offers no benefit over allowing the kernel to perform all meta-data operations. The end result of such an approach (given feature and robustness parity with the EMD solution) is a larger resident side, code duplication, and more complicated configuration/management interfaces. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:35 ` Justin T. Gibbs @ 2004-03-26 0:13 ` Jeff Garzik 2004-03-26 17:43 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-26 0:13 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid Justin T. Gibbs wrote: >>I respectfully disagree with the EMD folks that a userland approach is >>impossible, given all the failure scenarios. > > > I've never said that it was impossible, just unwise. I believe > that a userland approach offers no benefit over allowing the kernel > to perform all meta-data operations. The end result of such an > approach (given feature and robustness parity with the EMD solution) > is a larger resident side, code duplication, and more complicated > configuration/management interfaces. There is some code duplication, yes. But the right userspace solution does not have a larger RSS, and has _less_ complicated management interfaces. A key benefit of "do it in userland" is a clear gain in flexibility, simplicity, and debuggability (if that's a word). But it's hard. It requires some deep thinking. It's a whole lot easier to do everything in the kernel -- but that doesn't offer you the protections of userland, particularly separate address spaces from the kernel, and having to try harder to crash the kernel. :) Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 0:13 ` Jeff Garzik @ 2004-03-26 17:43 ` Justin T. Gibbs 2004-03-28 0:06 ` Lincoln Dale 2004-03-28 0:30 ` Jeff Garzik 0 siblings, 2 replies; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-26 17:43 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid >>> I respectfully disagree with the EMD folks that a userland approach is >>> impossible, given all the failure scenarios. >> >> >> I've never said that it was impossible, just unwise. I believe >> that a userland approach offers no benefit over allowing the kernel >> to perform all meta-data operations. The end result of such an >> approach (given feature and robustness parity with the EMD solution) >> is a larger resident side, code duplication, and more complicated >> configuration/management interfaces. > > There is some code duplication, yes. But the right userspace solution > does not have a larger RSS, and has _less_ complicated management > interfaces. > > A key benefit of "do it in userland" is a clear gain in flexibility, > simplicity, and debuggability (if that's a word). This is just as much hand waving as, 'All that just screams "do it in userland".' <sigh> I posted a rather detailed, technical, analysis of what I believe would be required to make this work correctly using a userland approach. The only response I've received is from Neil Brown. Please, point out, in a technical fashion, how you would address the feature set being proposed: o Rebuilds o Auto-array enumeration o Meta-data updates for topology changes (failed members, spare activation) o Meta-data updates for "safe mode" o Array creation/deletion o "Hot member addition" Only then can a true comparative analysis of which solution is "less complex", "more maintainable", and "smaller" be performed. > But it's hard. It requires some deep thinking. It's a whole lot easier > to do everything in the kernel -- but that doesn't offer you the > protections of userland, particularly separate address spaces from the > kernel, and having to try harder to crash the kernel. :) A crash in any component of a RAID solution that prevents automatic failover and rebuilds without customer intervention is unacceptable. Whether it crashes your kernel or not is really not that important other than the customer will probably notice that their data is no longer protected *sooner* if the system crashes. In other-words, the solution must be *correct* regardless of where it resides. Saying that doing a portion of it in userland allows it to safely be buggier seems a very strange argument. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 17:43 ` Justin T. Gibbs @ 2004-03-28 0:06 ` Lincoln Dale 2004-03-30 17:54 ` Justin T. Gibbs 2004-03-28 0:30 ` Jeff Garzik 1 sibling, 1 reply; 56+ messages in thread From: Lincoln Dale @ 2004-03-28 0:06 UTC (permalink / raw) To: Justin T. Gibbs Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid At 03:43 AM 27/03/2004, Justin T. Gibbs wrote: >I posted a rather detailed, technical, analysis of what I believe would >be required to make this work correctly using a userland approach. The >only response I've received is from Neil Brown. Please, point out, in >a technical fashion, how you would address the feature set being proposed: i'll have a go. your position is one of "put it all in the kernel". Jeff, Neil, Kevin et al is one of "it can live in userspace". to that end, i agree with the userspace approach. the way i personally believe that it SHOULD happen is that you tie your metadata format (and RAID format, if its different to others) into DM. you boot up using an initrd where you can start some form of userspace management daemon from initrd. you can have your binary (userspace) tools started from initrd which can populate the tables for all disks/filesystems, including pivoting to a new root filesystem if need-be. the only thing your BIOS/int13h redirection needs to do is be able to provide sufficient information to be capable of loading the kernel and the initial ramdisk. perhaps that means that you guys could provide enhancements to grub/lilo if they are insufficient for things like finding a secondary copy of initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things the "open source way" and help improve the overall tools, if the end goal ends up being the same: enabling YOUR system to work better?) moving forward, perhaps initrd will be deprecated in favour of initramfs - but until then, there isn't any downside to this approach that i can see. with all this in mind, and the basic premise being that as a minimum, the kernel has booted, and initrd is working then answering your other points: > o Rebuilds userspace is running. rebuilds are simply a process of your userspace tools recognising that there are disk groups in a inconsistent state, and don't bring them online, but rather, do whatever is necessary to rebuild them. nothing says that you cannot have a KERNEL-space 'helper' to help do the rebuild.. > o Auto-array enumeration your userspace tool can receive notification (via udev/hotplug) when new disks/devices appear. from there, your userspace tool can read whatever metadata exists on the disk, and use that to enumerate whatever block devices exist. perhaps DM needs some hooks to be able to do this - but i believe that the DM v4 ioctls cover this already. > o Meta-data updates for topology changes (failed members, spare activation) a failed member may be as a result of a disk being pulled out. for such an event, udev/hotplug should tell your userspace daemon. a failed member may be as a result of lots of I/O errors. perhaps there is work needed in the linux block layer to indicate some form of hotplug event such as 'excessive errors', perhaps its something needed in the DM layer. in either case, it isn't out of the question that userspace can be notified. for a "spare activation", once again, that can be done entirely from userspace. > o Meta-data updates for "safe mode" seems implementation specific to me. > o Array creation/deletion the short answer here is "how does one create or remove DM/LVM/MD partitions today?" it certainly isn't in the kernel ... > o "Hot member addition" this should also be possible today. i haven't looked too closely at whether there are sufficient interfaces for quiescence of I/O or not - but once again, if not, why not implement something that can be used for all? >Only then can a true comparative analysis of which solution is "less >complex", "more maintainable", and "smaller" be performed. there may be less lines of code involved in "entirely in kernel" for YOUR hardware -- but what about when 4 other storage vendors come out with such a card? what if someone wants to use your card in conjunction with the storage being multipathed or replicated automatically? what about when someone wants to create snapshots for backups? all that functionality has to then go into your EMD driver. Adaptec may decide all that is too hard -- at which point, your product may become obsolete as the storage paradigms have moved beyond what your EMD driver is capable of. if you could tie it into DM -- which i believe to be the defacto path forward for lots of this cool functionality -- you gain this kind of functionality gratis -- or at least with minimal effort to integrate. better yet, Linux as a whole benefits from your involvement -- your time/effort isn't put into something specific to your hardware -- but rather your time/effort is put into something that can be used by all. this conversation really sounds like the same one you had with James about the SCSI Mid layer and why you just have to bypass items there and do your own proprietary things. in summary, i don't believe you should be focussing on a short-term viiew of "but its more lines of code", but rather a more big-picture view of "overall, there will be LESS lines of code" and "it will fit better into the overall device-mapper/block-remapper functionality" within the kernel. cheers, lincoln. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-28 0:06 ` Lincoln Dale @ 2004-03-30 17:54 ` Justin T. Gibbs 0 siblings, 0 replies; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:54 UTC (permalink / raw) To: Lincoln Dale Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid > At 03:43 AM 27/03/2004, Justin T. Gibbs wrote: >> I posted a rather detailed, technical, analysis of what I believe would >> be required to make this work correctly using a userland approach. The >> only response I've received is from Neil Brown. Please, point out, in >> a technical fashion, how you would address the feature set being proposed: > > i'll have a go. > > your position is one of "put it all in the kernel". > Jeff, Neil, Kevin et al is one of "it can live in userspace". Please don't misrepresent or over simplify my statements. What I have said is that meta-data reading and writing should occur in only one place. Since, as has already been acknowledged by many, meta-data updates are required in the kernel, that means this support should be handled in the kernel. Any other solution adds complexity and size to the solution. > to that end, i agree with the userspace approach. > the way i personally believe that it SHOULD happen is that you tie > your metadata format (and RAID format, if its different to others) into DM. Saying how you think something should happen without any technical argument for it, doesn't help me to understand the benefits of your approach. ... > perhaps that means that you guys could provide enhancements to grub/lilo > if they are insufficient for things like finding a secondary copy of > initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things > the "open source way" and help improve the overall tools, if the end goal > ends up being the same: enabling YOUR system to work better?) I don't understand your argument. We have improved an already existing opensource driver to provide this functionality. This is not the OpenSource way? > then answering your other points: Again, you have presented strategies that may or may not work, but no technical arguments for their superiority over placing meta-data in the kernel. > there may be less lines of code involved in "entirely in kernel" for YOUR > hardware -- but what about when 4 other storage vendors come out with such > a card? There will be less lines of code total for any vendor that decides to add a new meta-data type. All the vendor has to do is provide a meta-data module. There are no changes to the userland utilities (they know nothing about specific meta-data formats), to the RAID transform modules, or to the core of EMD. If this were not the case, there would be little point to the EMD work. > what if someone wants to use your card in conjunction with the storage > being multipathed or replicated automatically? > what about when someone wants to create snapshots for backups? > > all that functionality has to then go into your EMD driver. No. DM already works on any block device exported to the kernel. EMD exports its devices as block devices. Thus, all of the DM functionality you are talking about is also available for EMD. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 17:43 ` Justin T. Gibbs 2004-03-28 0:06 ` Lincoln Dale @ 2004-03-28 0:30 ` Jeff Garzik 1 sibling, 0 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-28 0:30 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid Justin T. Gibbs wrote: > o Rebuilds > 90% kernel, AFAICS, otherwise you have races with requests that the driver is actively satisfying > o Auto-array enumeration userspace > o Meta-data updates for "safe mode" unsure of the definition of safe mode > o Array creation/deletion of entire arrays? can mostly be done in userspace, but deletion also needs to update controller-wide metadata, which might be stored on active arrays. > o "Hot member addition" userspace prepares, kernel completes [moved this down in your list] > o Meta-data updates for topology changes (failed members, spare activation) [warning: this is a tangent from the userspace sub-thread/topic] the kernel, of course, must manage topology, otherwise things Don't Get Done, and requests don't do where they should. :) Part of the value of device mapper is that it provides container objects for multi-disk groups, and a common method of messing around with those container objects. You clearly recognized the same need in emd... but I don't think we want two different pieces of code doing the same basic thing. I do think that metadata management needs to be fairly cleanly separately (I like what emd did, there) such that a user needs three in-kernel pieces: * device mapper * generic raid1 engine * personality module "personality" would be where the specifics of the metadata management lived, and it would be responsible for handling the specifics of non-hot-path events that nonetheless still need to be in the kernel. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:42 ` Jeff Garzik ` (2 preceding siblings ...) 2004-03-25 23:35 ` Justin T. Gibbs @ 2004-03-26 19:15 ` Kevin Corry 2004-03-26 20:45 ` Justin T. Gibbs 3 siblings, 1 reply; 56+ messages in thread From: Kevin Corry @ 2004-03-26 19:15 UTC (permalink / raw) To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid On Thursday 25 March 2004 12:42 pm, Jeff Garzik wrote: > > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at > > some point in the future, primarily for some of the reasons I mentioned > > above. Obviously linear.c and raid0.c don't really need to be ported. DM > > provides equivalent functionality, the discovery/activation can be driven > > from user-space, and no in-kernel status updating is necessary (unlike > > RAID-1 and -5). And we've talked for a long time about wanting to port > > RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we > > haven't started on any such work, or even had any significant discussions > > about *how* to do it. I can't > > let's have that discussion :) Great! Where do we begin? :) > I'd like to focus on the "additional requirements" you mention, as I > think that is a key area for consideration. > > There is a certain amount of metadata that -must- be updated at runtime, > as you recognize. Over and above what MD already cares about, DDF and > its cousins introduce more items along those lines: event logs, bad > sector logs, controller-level metadata... these are some of the areas I > think Justin/Scott are concerned about. I'm sure these things could be accomodated within DM. Nothing in DM prevents having some sort of in-kernel metadata knowledge. In fact, other DM modules already do - dm-snapshot and the above mentioned dm-mirror both need to do some amount of in-kernel status updating. But I see this as completely separate from in-kernel device discovery (which we seem to agree is the wrong direction). And IMO, well designed metadata will make this "split" very obvious, so it's clear which parts of the metadata the kernel can use for status, and which parts are purely for identification (which the kernel thus ought to be able to ignore). The main point I'm trying to get across here is that DM provides a simple yet extensible kernel framework for a variety of storage management tasks, including a lot more than just RAID. I think it would be a huge benefit for the RAID drivers to make use of this framework to provide functionality beyond what is currently available. > My take on things... the configuration of RAID arrays got a lot more > complex with DDF and "host RAID" in general. Association of RAID arrays > based on specific hardware controllers. Silently building RAID0+1 > stacked arrays out of non-RAID block devices the kernel presents. By this I assume you mean RAID devices that don't contain any type of on-disk metadata (e.g. MD superblocks). I don't see this as a huge hurdle. As long as the device drivers (SCIS, IDE, etc) export the necessary identification info through sysfs, user-space tools can contain the policies necessary to allow them to detect which disks belong together in a RAID device, and then tell the kernel to activate said RAID device. This sounds a lot like how Christophe Varoqui has been doing things in his new multipath tools. > Failing over when one of the drives the kernel presents does not respond. > > All that just screams "do it in userland". > > OTOH, once the devices are up and running, kernel needs update some of > that configuration itself. Hot spare lists are an easy example, but any > time the state of the overall RAID array changes, some host RAID > formats, more closely tied to hardware than MD, may require > configuration metadata changes when some hardware condition(s) change. Certainly. Of course, I see things like adding and removing hot-spares and removing stale/faulty disks as something that can be driven from user-space. For example, for adding a new hot-spare, with DM it's as simple as loading a new mapping that contains the new disk, then telling DM to switch the device mapping (which implies a suspend/resume of I/O). And if necessary, such a user-space tool can be activated by hotplug events triggered by the insertion of a new disk into the system, making the process effectively transparent to the user. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 19:15 ` Kevin Corry @ 2004-03-26 20:45 ` Justin T. Gibbs 2004-03-27 15:39 ` Kevin Corry 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-26 20:45 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid >> There is a certain amount of metadata that -must- be updated at runtime, >> as you recognize. Over and above what MD already cares about, DDF and >> its cousins introduce more items along those lines: event logs, bad >> sector logs, controller-level metadata... these are some of the areas I >> think Justin/Scott are concerned about. > > I'm sure these things could be accommodated within DM. Nothing in DM prevents > having some sort of in-kernel metadata knowledge. In fact, other DM modules > already do - dm-snapshot and the above mentioned dm-mirror both need to do > some amount of in-kernel status updating. But I see this as completely > separate from in-kernel device discovery (which we seem to agree is the wrong > direction). And IMO, well designed metadata will make this "split" very > obvious, so it's clear which parts of the metadata the kernel can use for > status, and which parts are purely for identification (which the kernel thus > ought to be able to ignore). We don't have control over the meta-data formats being used by the industry. Coming up with a solution that only works for "Linux Engineered Meta-data formats" removes any possibility of supporting things like DDF, Adaptec ASR, and a host of other meta-data formats that can be plugged into things like EMD. In the two cases we are supporting today with EMD, the records required for doing discovery reside in the same sectors as those that need to be updated at runtime from some "in-core" context. > The main point I'm trying to get across here is that DM provides a simple yet > extensible kernel framework for a variety of storage management tasks, > including a lot more than just RAID. I think it would be a huge benefit for > the RAID drivers to make use of this framework to provide functionality > beyond what is currently available. DM is a transform layer that has the ability to pause I/O while that transform is updated from userland. That's all it provides. As such, it is perfectly suited to some types of logical volume management applications. But that is as far as it goes. It does not have any support for doing "sync/resync/scrub" type operations or any generic support for doing anything with meta-data. In all of the examples you have presented so far, you have not explained how this part of the equation is handled. Sure, adding a member to a RAID1 is trivial. Just pause the I/O, update the transform, and let it go. Unfortunately, that new member is not in sync with the rest. The transform must be aware of this and only trust the member below the sync mark. How is this information communicated to the transform? Who updates the sync mark? Who copies the data to the new member while guaranteeing that an in-flight write does not occur to the area being synced? If you intend to add all of this to DM, then it is no longer any "simpler" or more extensible than EMD. Don't take my arguments the wrong way. I believe that DM is useful for what it was designed for: LVM. It does not, however, provide the machinery required for it to replace a generic RAID stack. Could you merge a RAID stack into DM. Sure. Its only software. But for it to be robust, the same types of operations MD/EMD perform in kernel space will have to be done there too. The simplicity of DM is part of why it is compelling. My belief is that merging RAID into DM will compromise this simplicity and divert DM from what it was designed to do - provide LVM transforms. As for RAID discovery, this is the trivial portion of RAID. For an extra 10% or less of code in a meta-data module, you get RAID discovery. You also get a single point of access to the meta-data, avoid duplicated code, and complex kernel/user interfaces. There seems to be a consistent feeling that it is worth compromising all of these benefits just to push this 10% of the meta-data handling code out of the kernel (and inflate it by 5 or 6 X duplicating code already in the kernel). Where are the benefits of this userland approach? -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-26 20:45 ` Justin T. Gibbs @ 2004-03-27 15:39 ` Kevin Corry 2004-03-30 17:03 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Kevin Corry @ 2004-03-27 15:39 UTC (permalink / raw) To: linux-kernel, Justin T. Gibbs Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel On Friday 26 March 2004 2:45 pm, Justin T. Gibbs wrote: > We don't have control over the meta-data formats being used by the > industry. Coming up with a solution that only works for "Linux Engineered > Meta-data formats" removes any possibility of supporting things like DDF, > Adaptec ASR, and a host of other meta-data formats that can be plugged into > things like EMD. In the two cases we are supporting today with EMD, the > records required for doing discovery reside in the same sectors as those > that need to be updated at runtime from some "in-core" context. Well, there's certainly no guarantee that the "industry" will get it right. In this case, it seems that they didn't. But even given that we don't have ideal metadata formats, it's still possible to do discovery and a number of other management tasks from user-space. > > The main point I'm trying to get across here is that DM provides a simple > > yet extensible kernel framework for a variety of storage management > > tasks, including a lot more than just RAID. I think it would be a huge > > benefit for the RAID drivers to make use of this framework to provide > > functionality beyond what is currently available. > > DM is a transform layer that has the ability to pause I/O while that > transform is updated from userland. That's all it provides. I think the DM developers would disagree with you on this point. > As such, > it is perfectly suited to some types of logical volume management > applications. But that is as far as it goes. It does not have any > support for doing "sync/resync/scrub" type operations or any generic > support for doing anything with meta-data. The core DM driver would not and should not be handling these operations. These are handled in modules specific to one type of mapping. There's no need for the DM core to know anything about any metadata. If one particular module (e.g. dm-mirror) needs to support one or more metadata formats, it's free to do so. On the other hand, DM *does* provide services that make "sync/resync" a great deal simpler for such a module. It provides simple services for performing synchronous or asynchronous I/O to pages or vm areas. It provides a service for performing copies from one block-device area to another. The dm-mirror module uses these for this very purpose. If we need additional "libraries" for common RAID tasks (e.g. parity calculations) we can certainly add them. > In all of the examples you > have presented so far, you have not explained how this part of the equation > is handled. Sure, adding a member to a RAID1 is trivial. Just pause the > I/O, update the transform, and let it go. Unfortunately, that new member > is not in sync with the rest. The transform must be aware of this and only > trust the member below the sync mark. How is this information communicated > to the transform? Who updates the sync mark? Who copies the data to the > new member while guaranteeing that an in-flight write does not occur to the > area being synced? Before the new disk is added to the raid1, user-space is responsible for writing an initial state to that disk, effectively marking it as completely dirty and unsynced. When the new table is loaded, part of the "resume" is for the module to read any metadata and do any initial setup that's necessary. In this particular example, it means the new disk would start with all of its "regions" marked "dirty", and all the regions would need to be synced from corresponding "clean" regions on another disk in the set. If the previously-existing disks were part-way through a sync when the table was switched, their metadata would indicate where the current "sync mark" was located. The module could then continue the sync from where it left off, including the new disk that was just added. When the sync completed, it might have to scan back to the beginning of the new disk to see if had any remaining dirty regions that needed to be synced before that disk was completely clean. And of course the I/O-mapping path just has to be smart enough to know which regions are dirty and avoid sending live I/O to those. (And I'm sure Joe or Alasdair could provide a better in-depth explanation of the current dm-mirror module than I'm trying to. This is obviously a very high-level overview.) This process is somewhat similar to how dm-snapshot works. If it reads an empty header structure, it assumes it's a new snapshot, and starts with an empty hash table. If it reads a previously existing header, it continues to read the on-disk COW tables and constructs the necessary in-memory hash-table to represent that initial state. > If you intend to add all of this to DM, then it is no > longer any "simpler" or more extensible than EMD. Sure it is. Because very little (if any) of this needs to affect the core DM driver, that core remains as simple and extensible as it currently is. The extra complexity only really affects the new modules that would handle RAID. > Don't take my arguments the wrong way. I believe that DM is useful > for what it was designed for: LVM. It does not, however, provide the > machinery required for it to replace a generic RAID stack. Could > you merge a RAID stack into DM. Sure. Its only software. But for > it to be robust, the same types of operations MD/EMD perform in kernel > space will have to be done there too. > > The simplicity of DM is part of why it is compelling. My belief is that > merging RAID into DM will compromise this simplicity and divert DM from > what it was designed to do - provide LVM transforms. I disagree. The simplicity of the core DM driver really isn't at stake here. We're only talking about adding a few relatively complex target modules. And with DM you get the benefit of a very simple user/kernel interface. > As for RAID discovery, this is the trivial portion of RAID. For an extra > 10% or less of code in a meta-data module, you get RAID discovery. You > also get a single point of access to the meta-data, avoid duplicated code, > and complex kernel/user interfaces. There seems to be a consistent feeling > that it is worth compromising all of these benefits just to push this 10% > of the meta-data handling code out of the kernel (and inflate it by 5 or > 6 X duplicating code already in the kernel). Where are the benefits of > this userland approach? I've got to admit, this whole discussion is very ironic. Two years ago I was exactly where you are today, pushing for in-kernel discover, a variety of metadata modules, internal opaque device stacking, etc, etc. I can only imagine that hch is laughing his ass off now that I'm the one arguing for moving all this stuff to user-space. I don't honestly expect to suddenly change your mind on all these issues. A lot of work has obviously gone into EMD, and I definitely know how hard it can be when the community isn't greeting your suggestions with open arms. And I'm certainly not saying the EMD method isn't a potentially viable approach. But it doesn't seem to be the approach the community is looking for. We faced the same resistance two years ago. It took months of arguing with the community and arguing amongst ourselves before we finally decided to move EVMS to user-space and use MD and DM. It was a decision that meant essentially throwing away an enormous amount of work from several people. It was an incredibly hard choice, but I really believe now that it was the right decision. It was the direction the community wanted to move in, and the only way for our project to truely survive was to move with them. So feel free to continue to develop and promote EMD. I'm not trying to stop you and I don't mind having competition for finding the best way to do RAID in Linux. But I can tell you from experience that EMD is going to face a good bit of opposition based on its current design and you might want to take that into consideration. I am interested in discussing if and how RAID could be supported under Device-Mapper (or some other "merging" of these two drivers). Jeff and Lars have shown some interest, and I certainly hope we can convince Neil and Joe that this is a good direction. Maybe it can be done and maybe it can't. I personally think it can be, and I'd at least like to have that discussion and find out. -- Kevin Corry kevcorry@us.ibm.com http://evms.sourceforge.net/ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-27 15:39 ` Kevin Corry @ 2004-03-30 17:03 ` Justin T. Gibbs 2004-03-30 17:15 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:03 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel > Well, there's certainly no guarantee that the "industry" will get it right. In > this case, it seems that they didn't. But even given that we don't have ideal > metadata formats, it's still possible to do discovery and a number of other > management tasks from user-space. I have never proposed that management activities be performed solely within the kernel. My position has been that meta-data parsing and updating has to be core-resident for any solution that handles advanced RAID functionality and that spliting out any portion of those roles to userland just complicates the solution. >> it is perfectly suited to some types of logical volume management >> applications. But that is as far as it goes. It does not have any >> support for doing "sync/resync/scrub" type operations or any generic >> support for doing anything with meta-data. > > The core DM driver would not and should not be handling these operations. > These are handled in modules specific to one type of mapping. There's no > need for the DM core to know anything about any metadata. If one particular > module (e.g. dm-mirror) needs to support one or more metadata formats, it's > free to do so. That's unfortunate considering that the meta-data formats we are talking about already have the capability of expressing RAID 1(E),4,5,6. There has to be a common meta-data framework in order to avoid this duplication. >> In all of the examples you >> have presented so far, you have not explained how this part of the equation >> is handled. ... > Before the new disk is added to the raid1, user-space is responsible for > writing an initial state to that disk, effectively marking it as completely > dirty and unsynced. When the new table is loaded, part of the "resume" is for > the module to read any metadata and do any initial setup that's necessary. In > this particular example, it means the new disk would start with all of its > "regions" marked "dirty", and all the regions would need to be synced from > corresponding "clean" regions on another disk in the set. > > If the previously-existing disks were part-way through a sync when the table > was switched, their metadata would indicate where the current "sync mark" was > located. The module could then continue the sync from where it left off, > including the new disk that was just added. When the sync completed, it might > have to scan back to the beginning of the new disk to see if had any remaining > dirty regions that needed to be synced before that disk was completely clean. > > And of course the I/O-mapping path just has to be smart enough to know which > regions are dirty and avoid sending live I/O to those. > > (And I'm sure Joe or Alasdair could provide a better in-depth explanation of > the current dm-mirror module than I'm trying to. This is obviously a very > high-level overview.) So all of this complexity is still in the kernel. The only difference is that the meta-data can *also* be manipulated from userspace. In order for this to be safe, the mirror must be suspended (meta-data becomes stable), the meta-data must be re-read by the userland program, the meta-data must be updated, the mapping must be updated, the mirror must be resumed, and the mirror must revalidate all meta-data. How do you avoid deadlock in this process? Does the userland daemon, which must be core resident in this case, pre-allocate buffers for reading and writing the meta-data? The dm-raid1 module also appears to intrinsicly trust its mapping and the contents of its meta-data (simple magic number check). It seems to me that the kernel should validate all of its inputs regardless of whether the ioctls that are used to present them are only supposed to be used by a "trusted daemon". All of this adds up to more complexity. Your argument seems to be that, since DM avoids this complexity in its core, this is a better solution, but I am more interested in the least complex, most easily maintained total solution. >> The simplicity of DM is part of why it is compelling. My belief is that >> merging RAID into DM will compromise this simplicity and divert DM from >> what it was designed to do - provide LVM transforms. > > I disagree. The simplicity of the core DM driver really isn't at stake here. > We're only talking about adding a few relatively complex target modules. And > with DM you get the benefit of a very simple user/kernel interface. The simplicity of the user/kernel interface is not what is at stake here. With EMD, you can perform all of the same operations talked about above, in just as few ioctl calls. The only difference is that the kernel and only the kernel, reads and modifies the metadata. There are actually fewer steps for the userland application than before. This becomes even more evident as more meta-data modules are added. > I don't honestly expect to suddenly change your mind on all these issues. > A lot of work has obviously gone into EMD, and I definitely know how hard it > can be when the community isn't greeting your suggestions with open arms. I honestly don't care if the final solution is EMD, DM, or XYZ so long as that solution is correct, supportable, and covers all of the scenarios required for robust RAID support. That is the crux of the argument, not "please love my code". -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:03 ` Justin T. Gibbs @ 2004-03-30 17:15 ` Jeff Garzik 2004-03-30 17:35 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-30 17:15 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: > The dm-raid1 module also appears to intrinsicly trust its mapping and the > contents of its meta-data (simple magic number check). It seems to me that > the kernel should validate all of its inputs regardless of whether the > ioctls that are used to present them are only supposed to be used by a > "trusted daemon". The kernel should not be validating -trusted- userland inputs. Root is allowed to scrag the disk, violate limits, and/or crash his own machine. A simple example is requiring userland, when submitting ATA taskfiles via an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). If the data phase is specified incorrectly, you kill the OS driver's ATA host state machine, and the results are very unpredictable. Since this is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the required details right (just like following a spec). > I honestly don't care if the final solution is EMD, DM, or XYZ so long > as that solution is correct, supportable, and covers all of the scenarios > required for robust RAID support. That is the crux of the argument, not > "please love my code". hehe. I think we all agree here... Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:15 ` Jeff Garzik @ 2004-03-30 17:35 ` Justin T. Gibbs 2004-03-30 17:46 ` Jeff Garzik 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 0 siblings, 2 replies; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-30 17:35 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel > The kernel should not be validating -trusted- userland inputs. Root is > allowed to scrag the disk, violate limits, and/or crash his own machine. > > A simple example is requiring userland, when submitting ATA taskfiles via > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). > If the data phase is specified incorrectly, you kill the OS driver's ATA > host wwtate machine, and the results are very unpredictable. Since this > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the > required details right (just like following a spec). That's unfortunate for those using ATA. A command submitted from userland to the SCSI drivers I've written that causes a protocol violation will be detected, result in appropriate recovery, and a nice diagnostic that can be used to diagnose the problem. Part of this is because I cannot know if the protocol violation stems from a target defect, the input from the user or, for that matter, from the kernel. The main reason is for robustness and ease of debugging. In SCSI case, there is almost no run-time cost, and the system will stop before data corruption occurs. In the meta-data case we've been discussing in terms of EMD, there is no runtime cost, the validation has to occur somewhere anyway, and in many cases some validation is already required to avoid races with external events. If the validation is done in the kernel, then you get the benefit of nice diagnostics instead of strange crashes that are difficult to debug. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:35 ` Justin T. Gibbs @ 2004-03-30 17:46 ` Jeff Garzik 2004-03-30 18:04 ` Justin T. Gibbs 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 1 sibling, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-30 17:46 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>The kernel should not be validating -trusted- userland inputs. Root is >>allowed to scrag the disk, violate limits, and/or crash his own machine. >> >>A simple example is requiring userland, when submitting ATA taskfiles via >>an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). >>If the data phase is specified incorrectly, you kill the OS driver's ATA >>host wwtate machine, and the results are very unpredictable. Since this >>is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the >>required details right (just like following a spec). > > > That's unfortunate for those using ATA. A command submitted from userland Required, since one cannot know the data phase of vendor-specific commands. > to the SCSI drivers I've written that causes a protocol violation will > be detected, result in appropriate recovery, and a nice diagnostic that > can be used to diagnose the problem. Part of this is because I cannot know > if the protocol violation stems from a target defect, the input from the > user or, for that matter, from the kernel. The main reason is for robustness Well, * the target is not _issuing_ commands, * any user issuing incorrect commands/cdbs is not your bug, * and kernel code issuing incorrect cmands/cdbs isn't your bug either Particularly, checking whether the kernel is doing something wrong, or wrong, just wastes cycles. That's not a scalable way to code... if every driver and Linux subsystem did that, things would be unbearable slow. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:46 ` Jeff Garzik @ 2004-03-30 18:04 ` Justin T. Gibbs 2004-03-30 21:47 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-30 18:04 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel >> That's unfortunate for those using ATA. A command submitted from userland > > Required, since one cannot know the data phase of vendor-specific commands. So you are saying that this presents an unrecoverable situation? > Particularly, checking whether the kernel is doing something wrong, or wrong, > just wastes cycles. That's not a scalable way to code... if every driver > and Linux subsystem did that, things would be unbearable slow. Hmm. I've never had someone tell me that my SCSI drivers are slow. I don't think that your statement is true in the general case. My belief is that validation should occur where it is cheap and efficient to do so. More expensive checks should be pushed into diagnostic code that is disabled by default, but the code *should be there*. In any event, for RAID meta-data, we're talking about code that is *not* in the common or time critical path of the kernel. A few dozen lines of validation code there has almost no impact on the size of the kernel and yields huge benefits for debugging and maintaining the code. This is even more the case in Linux the end user is often your test lab. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 18:04 ` Justin T. Gibbs @ 2004-03-30 21:47 ` Jeff Garzik 2004-03-30 22:12 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Jeff Garzik @ 2004-03-30 21:47 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>>That's unfortunate for those using ATA. A command submitted from userland >> >>Required, since one cannot know the data phase of vendor-specific commands. > > > So you are saying that this presents an unrecoverable situation? No, I'm saying that the data phase need not have a bunch of in-kernel checks, it should be generated correctly from the source. >>Particularly, checking whether the kernel is doing something wrong, or wrong, >>just wastes cycles. That's not a scalable way to code... if every driver >>and Linux subsystem did that, things would be unbearable slow. > > > Hmm. I've never had someone tell me that my SCSI drivers are slow. This would be noticed in the CPU utilization area. Your drivers are probably a long way from being CPU-bound. > I don't think that your statement is true in the general case. My > belief is that validation should occur where it is cheap and efficient > to do so. More expensive checks should be pushed into diagnostic code > that is disabled by default, but the code *should be there*. In any event, > for RAID meta-data, we're talking about code that is *not* in the common > or time critical path of the kernel. A few dozen lines of validation code > there has almost no impact on the size of the kernel and yields huge > benefits for debugging and maintaining the code. This is even more > the case in Linux the end user is often your test lab. It doesn't scale terribly well, because the checks themselves become a source of bugs. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 21:47 ` Jeff Garzik @ 2004-03-30 22:12 ` Justin T. Gibbs 2004-03-30 22:34 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-30 22:12 UTC (permalink / raw) To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel >> So you are saying that this presents an unrecoverable situation? > > No, I'm saying that the data phase need not have a bunch of in-kernel > checks, it should be generated correctly from the source. The SCSI drivers validate the controller's data phase based on the expected phase presented to them from an upper layer. I never talked about adding checks that make little sense or are overly expensive. You seem to equate validation with huge expense. That is just not the general case. >> Hmm. I've never had someone tell me that my SCSI drivers are slow. > > This would be noticed in the CPU utilization area. Your drivers are > probably a long way from being CPU-bound. I very much doubt that. There are perhaps four or five tests in the I/O path where some value already in a cache line that has to be accessed anyway is compared against a constant. We're talking about something down in the noise of any type of profiling you could perform. As I said, validation makes sense where there is basically no-cost to do it. >> I don't think that your statement is true in the general case. My >> belief is that validation should occur where it is cheap and efficient >> to do so. More expensive checks should be pushed into diagnostic code >> that is disabled by default, but the code *should be there*. In any event, >> for RAID meta-data, we're talking about code that is *not* in the common >> or time critical path of the kernel. A few dozen lines of validation code >> there has almost no impact on the size of the kernel and yields huge >> benefits for debugging and maintaining the code. This is even more >> the case in Linux the end user is often your test lab. > > It doesn't scale terribly well, because the checks themselves become a > source of bugs. So now the complaint is that validation code is somehow harder to write and maintain than the rest of the code? -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 22:12 ` Justin T. Gibbs @ 2004-03-30 22:34 ` Jeff Garzik 0 siblings, 0 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-30 22:34 UTC (permalink / raw) To: Justin T. Gibbs Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel Justin T. Gibbs wrote: >>>So you are saying that this presents an unrecoverable situation? >> >>No, I'm saying that the data phase need not have a bunch of in-kernel >>checks, it should be generated correctly from the source. > > > The SCSI drivers validate the controller's data phase based on the > expected phase presented to them from an upper layer. I never talked > about adding checks that make little sense or are overly expensive. You > seem to equate validation with huge expense. That is just not the > general case. > > >>>Hmm. I've never had someone tell me that my SCSI drivers are slow. >> >>This would be noticed in the CPU utilization area. Your drivers are >>probably a long way from being CPU-bound. > > > I very much doubt that. There are perhaps four or five tests in the > I/O path where some value already in a cache line that has to be accessed > anyway is compared against a constant. We're talking about something > down in the noise of any type of profiling you could perform. As I said, > validation makes sense where there is basically no-cost to do it. > > >>>I don't think that your statement is true in the general case. My >>>belief is that validation should occur where it is cheap and efficient >>>to do so. More expensive checks should be pushed into diagnostic code >>>that is disabled by default, but the code *should be there*. In any event, >>>for RAID meta-data, we're talking about code that is *not* in the common >>>or time critical path of the kernel. A few dozen lines of validation code >>>there has almost no impact on the size of the kernel and yields huge >>>benefits for debugging and maintaining the code. This is even more >>>the case in Linux the end user is often your test lab. >> >>It doesn't scale terribly well, because the checks themselves become a >>source of bugs. > > > So now the complaint is that validation code is somehow harder to write > and maintain than the rest of the code? Actually, yes. Validation of random user input has always been a source of bugs (usually in edge cases), in Linux and in other operating systems. It is often the area where security bugs are found. Basically you want to avoid add checks for conditions that don't occur in properly written software, and make sure that the kernel always generates correct requests. Obviously that excludes anything on the target side, but other than that... in userland, a priveleged user is free to do anything they wish, including violate protocols, cook their disk, etc. Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-30 17:35 ` Justin T. Gibbs 2004-03-30 17:46 ` Jeff Garzik @ 2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz 1 sibling, 0 replies; 56+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-30 18:11 UTC (permalink / raw) To: Justin T. Gibbs, Jeff Garzik Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel On Tuesday 30 of March 2004 19:35, Justin T. Gibbs wrote: > > The kernel should not be validating -trusted- userland inputs. Root is > > allowed to scrag the disk, violate limits, and/or crash his own machine. > > > > A simple example is requiring userland, when submitting ATA taskfiles via > > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.). > > If the data phase is specified incorrectly, you kill the OS driver's ATA > > host wwtate machine, and the results are very unpredictable. Since this > > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get > > the required details right (just like following a spec). > > That's unfortunate for those using ATA. A command submitted from userland > to the SCSI drivers I've written that causes a protocol violation will > be detected, result in appropriate recovery, and a nice diagnostic that > can be used to diagnose the problem. Part of this is because I cannot know > if the protocol violation stems from a target defect, the input from the > user or, for that matter, from the kernel. The main reason is for > robustness and ease of debugging. In SCSI case, there is almost no > run-time cost, and the system will stop before data corruption occurs. In In ATA case detection of protocol violation is not possible w/o checking every possible command opcode. Even if implemented (notice that checking commands coming from kernel is out of question - for performance reasons) this breaks for future and vendor specific commands. > the meta-data case we've been discussing in terms of EMD, there is no > runtime cost, the validation has to occur somewhere anyway, and in many > cases some validation is already required to avoid races with external > events. If the validation is done in the kernel, then you get the benefit > of nice diagnostics instead of strange crashes that are difficult to debug. Unless code that crashes is the one doing validation. ;-) Bartlomiej ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 18:00 ` Kevin Corry 2004-03-25 18:42 ` Jeff Garzik @ 2004-03-25 22:59 ` Justin T. Gibbs 2004-03-25 23:44 ` Lars Marowsky-Bree 1 sibling, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-25 22:59 UTC (permalink / raw) To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid >> Independent DM efforts have already started supporting MD raid0/1 >> metadata from what I understand, though these efforts don't seem to post >> to linux-kernel or linux-raid much at all. :/ > > I post on lkml.....occasionally. :) ... > This decision was not based on any real dislike of the MD driver, but rather > for the benefits that are gained by using Device-Mapper. In particular, > Device-Mapper provides the ability to change out the device mapping on the > fly, by temporarily suspending I/O, changing the table, and resuming the I/O > I'm sure many of you know this already. But I'm not sure everyone fully > understands how powerful a feature this is. For instance, it means EVMS can > now expand RAID-linear devices online. While that particular example may not > sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to > Device-Mapper, this feature would then allow you to do stuff like add new > "active" members to a RAID-1 online (think changing from 2-way mirror to > 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online > simply by adding a new disk (assuming other limitations, e.g. a single > stripe-zone). Unfortunately, these are things the MD driver can't do online, > because you need to completely stop the MD device before making such changes > (to prevent the kernel and user-space from trampling on the same metadata), > and MD won't stop the device if it's open (i.e. if it's mounted or if you > have other device (LVM) built on top of MD). Often times this means you need > to boot to a rescue-CD to make these types of configuration changes. We should be clear about your argument here. It is not that DM makes generic morphing easy and possible, it is that with DM the most basic types of morphing (no data striping or de-striping) is easily accomplished. You sight two examples: 1) Adding another member to a RAID-1. While MD may not allow this to occur while the array is operational, EMD does. This is possible because there is only one entity controlling the meta-data. 2) Converting a RAID0 to a RAID4 while possible with DM is not particularly interesting from an end user perspective. The fact of the matter is that neither EMD nor DM provide a generic morphing capability. If this is desirable, we can discuss how it could be achieved, but my initial belief is that attempting any type of complicated morphing from userland would be slow, prone to deadlocks, and thus difficult to achieve in a fashion that guaranteed no loss of data in the face of unexpected system restarts. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 22:59 ` Justin T. Gibbs @ 2004-03-25 23:44 ` Lars Marowsky-Bree 2004-03-26 0:03 ` Justin T. Gibbs 0 siblings, 1 reply; 56+ messages in thread From: Lars Marowsky-Bree @ 2004-03-25 23:44 UTC (permalink / raw) To: Justin T. Gibbs, Kevin Corry, linux-kernel Cc: Jeff Garzik, Neil Brown, linux-raid On 2004-03-25T15:59:00, "Justin T. Gibbs" <gibbs@scsiguy.com> said: > The fact of the matter is that neither EMD nor DM provide a generic > morphing capability. If this is desirable, we can discuss how it could > be achieved, but my initial belief is that attempting any type of > complicated morphing from userland would be slow, prone to deadlocks, > and thus difficult to achieve in a fashion that guaranteed no loss of > data in the face of unexpected system restarts. Uhm. DM sort of does (at least where the morphing amounts to resyncing a part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc). Freeze, load new mapping, continue. I agree that more complex morphings (RAID1->RAID5 or vice-versa in particular) are more difficult to get right, but are not that often needed online - or if they are, typically such scenarios will have enough temporary storage to create the new target, RAID1 over, disconnect the old part and free it, which will work just fine with DM. Sincerely, Lars Marowsky-Brée <lmb@suse.de> -- High Availability & Clustering \ ever tried. ever failed. no matter. SUSE Labs | try again. fail again. fail better. Research & Development, SUSE LINUX AG \ -- Samuel Beckett - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-25 23:44 ` Lars Marowsky-Bree @ 2004-03-26 0:03 ` Justin T. Gibbs 0 siblings, 0 replies; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-26 0:03 UTC (permalink / raw) To: Lars Marowsky-Bree, Kevin Corry, linux-kernel Cc: Jeff Garzik, Neil Brown, linux-raid > Uhm. DM sort of does (at least where the morphing amounts to resyncing a > part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc). > Freeze, load new mapping, continue. The point is that these trivial "morphings" can be achieved with limited effort regardless of whether you do it via EMD or DM. Implementing this in EMD could be achieved with perhaps 8 hours work with no significant increase in code size or complexity. This is part of why I find them "uninteresting". If we really want to talk about generic morphing, I think you'll find that DM is no better suited to this task than MD or its derivatives. > I agree that more complex morphings (RAID1->RAID5 or vice-versa in > particular) are more difficult to get right, but are not that often > needed online - or if they are, typically such scenarios will have > enough temporary storage to create the new target, RAID1 over, > disconnect the old part and free it, which will work just fine with DM. The most common requests that we hear from customers are: o single -> R1 Equally possible with MD or DM assuming your singles are accessed via a volume manager. Without that support the user will have to dismount and remount storage. o R1 -> R10 This should require just double the number of active members. This is not possible today with either DM or MD. Only "migration" is possible. o R1 -> R5 o R5 -> R1 These typically occur when data access patterns change for the customer. Again not possible with DM or MD today. All of these are important to some subset of customers and are, to my mind, required if you want to claim even basic morphing capability. If you are allowing the "cop-out" of using a volume manager to substitute data-migration for true morphing, then MD is almost as well suited to that task as DM. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* "Enhanced" MD code avaible for review @ 2004-03-17 18:14 Justin T. Gibbs 2004-03-17 19:18 ` Jeff Garzik 0 siblings, 1 reply; 56+ messages in thread From: Justin T. Gibbs @ 2004-03-17 18:14 UTC (permalink / raw) To: linux-raid; +Cc: justin_gibbs [ I tried sending this last night from my Adaptec email address and have yet to see it on the list. Sorry if this is dup for any of you. ] For the past few months, Adaptec Inc, has been working to enhance MD. The goals of this project are: o Allow fully pluggable meta-data modules o Add support for Adaptec ASR (aka HostRAID) and DDF (Disk Data Format) meta-data types. Both of these formats are understood natively by certain vendor BIOSes meaning that arrays can be booted from transparently. o Improve the ability of MD to auto-configure arrays. o Support multi-level arrays transparently yet allow proper event notification across levels when the topology is known to MD. o Create a more generic "work item" framework which is used to support array initialization, rebuild, and verify operations as well as miscellaneous tasks that a meta-data or RAID personality may need to perform from a thread context (e.g. spare activation where meta-data records may need to be sequenced carefully). o Modify the MD ioctl interface to allow the creation of management utilities that are meta-data format agnostic. A snapshot of this work is now available here: http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz This snapshot includes support for RAID0, RAID1, and the Adaptec ASR and DDF meta-data formats. Additional RAID personalities and support for the Super90 and Super 1 meta-data formats will be added in the coming weeks, the end goal being to provide a superset of the functionality in the current MD. A patch to fs/partitions/check.c is also required for this release to function correctly: http://people.freebsd.org/~gibbs/linux/SRC/md_announce_whole_device.diff As the file name implies, this patch exposes not only partitions on devices, but all "base" block devices to MD. This is required to support meta-data formats like ASR and DDF that typically operate on the whole device. Nothing in the implementation prevents any meta-data format from being used on a partition, but BIOS boot support is only available in the non-partitioned mode. Since the current MD notification scheme does not allow MD to receive notifications unless it is statically compiled into the kernel, we would like to work with the community to develop a more generic notification scheme to which modules, such as MD, can dynamically register. Until that occurs, these EMD snapshots will require at least md.c to be a static component of the kernel. For those wanting to test out this snapshot with an Adaptec HostRAID U320 SCSI controller, you will need to update your kernel to use version 2.0.8 of the aic79xx driver. This driver defaults to attaching to 790X controllers operating in HostRAID mode in addition to those in direct SCSI mode. This feature can be disabled using a module or kernel command option. Driver source and BK send patches for this driver can be found here: http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316-tar.gz http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316.bksend.gz Architectural Notes =================== The major areas of change in "EMD" can be categorized into: 1) "Object Oriented" Data structure changes These changes are the basis for allowing RAID personalities to transparently operate on "disks" or "arrays" as member objects. While it has always been possible to create multi-level arrays in MD using block layer stacking, our approach allows MD to also stack internally. Once a given RAID or meta-data personality is converted to the new structures, this "feature" comes at no cost. The benefit to stacking internally, which requires a meta-data format that supports this, is that array state can propagate up and down the topology without the loss of information inherent in using the block layer to traverse levels of an array. 2) Opcode based interfaces. Rather than add additional method vectors to either the RAID personality or meta-data personality objects, the new code uses only a few methods that are parameterized. This has allowed us to create a fairly rich interface between the core and the personalities without overly bloating personality "classes". 3) WorkItems Workitems provide a generic framework for queuing work to a thread context. Workitems include a "control" method as well as a "handler" method. This separation allows, for example, a RAID personality to use the generic sync handler while trapping the "open", "close", and "free" of any sync workitems. Since both handlers can be tailored to the individual workitem that is queued, this removes the need to overload one or more interfaces in the personalities. It also means that any code in MD can make use of this framework - it is not tied to particular objects or modules in the system. 4) "Syncable Volume" Support All of the transaction accounting necessary to support redundant arrays has been abstracted out into a few inline functions. With the inclusion of a "sync support" structure in a RAID personality's private data structure area and the use of these functions, the generic sync framework is fully available. The sync algorithm is also now more like that in 2.4.X - with some updates to improve performance. Two contiguous sync ranges are employed so that sync I/O can be pending while the lock range is extended and new sync I/O is stalled waiting for normal I/O writes that might conflict with the new range complete. The syncer updates its stats more frequently than in the past so that it can more quickly react to changes in the normal I/O load. Syncer backoff is also disabled anytime there is pending I/O blocked on the syncer's locked region. RAID personalities have full control over the size of the sync windows used so that they can be optimized based on RAID layout policy. 5) IOCTL Interface "EMD" now performs all of its configuration via an "mdctl" character device. Since one of our goals is to remove any knowledge of meta-data type in the user control programs, initial meta-data stamping and configuration validation occurs in the kernel. In general, the meta-data modules already need this validation code in order to support auto-configuration, so adding this capability adds little to the overall size of EMD. It does, however, require a few additional ioctls to support things like querying the maximum "coerced" size of a disk targeted for a new array, or enumerating the names of installed meta-data modules, etc. This area of EMD is still in very active development and we expect to provide a drop of an "emdadm" utility later this week. 6) Meta-data and Topology State To support pluggable meta-data modules which may have diverse policies, all embedded knowledge of the MD SuperBlock formats has been removed. In general, the meta-data modules "bid" on incoming devices that they can manage. The high bidder is then asked to configure the disk into a reasonable topology that can be managed by a RAID personality and the MD core. The bidding process allows a more "native" meta-data module to outbid a module that can handle the same format in "compatibility" mode. It also allows the user to load a meta-data module update during install scenarios even if an older module is compiled statically into the kernel. Once the topology is created, all information needed for normal operation is available to the MD core and/or RAID personalities via direct variable access (at times protected by locks or atomic ops of course). Array or member state changes occur via calling into the meta-data personality associated with that object. The meta-data personality is then responsible for changing the state visible to the rest of the code and notifying interested parties. This async design means that a RAID module noticing an I/O failure on one member and posting that event to one meta-data module, may cause a chain of notifications all the way to the top-level array object owned by another RAID/meta-data personality. The entire topology is reference counted such that objects will only disappear from the topology once they have transitioned to the FAILED state and all I/O (each I/O holds a reference) ceases. 7) Correction of RAID0 Transform The RAID0 transform's "merge function" assumes that the incoming bio's starting sector is the same as what will be presented to its make_request function. In the case of a partitioned MD device, the starting sector is shifted by the partition offset for the target offset. Unfortunately, the merge functions are not notified of the partition transform, so RAID0 would often reject requests that span "chunk" boundaries once shifted. The fix employed here is to determine if a partition transform will occur and take this into account in the merge function. Adaptec is currently validating EMD through formal testing while continuing the build-out of new features. Our hope is to gather feedback from the Linux community and adjust our approach to satisfy the community's requirements. We look forward to your comments, suggestions, and review of this project. -- Justin ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 18:14 Justin T. Gibbs @ 2004-03-17 19:18 ` Jeff Garzik 2004-03-17 19:32 ` Christoph Hellwig 2004-03-17 21:18 ` Scott Long 0 siblings, 2 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-17 19:18 UTC (permalink / raw) To: Justin T. Gibbs; +Cc: linux-raid, justin_gibbs, Linux Kernel Justin T. Gibbs wrote: > [ I tried sending this last night from my Adaptec email address and have > yet to see it on the list. Sorry if this is dup for any of you. ] Included linux-kernel in the CC (and also bounced this post there). > For the past few months, Adaptec Inc, has been working to enhance MD. The FAQ from several corners is going to be "why not DM?", so I would humbly request that you (or Scott Long) re-post some of that rationale here... > The goals of this project are: > > o Allow fully pluggable meta-data modules yep, needed > o Add support for Adaptec ASR (aka HostRAID) and DDF > (Disk Data Format) meta-data types. Both of these > formats are understood natively by certain vendor > BIOSes meaning that arrays can be booted from transparently. yep, needed For those who don't know, DDF is particularly interesting. A storage industry association, "SNIA", has gotten most of the software and hardware RAID folks to agree on a common, vendor-neutral on-disk format. Pretty historic, IMO :) Since this will be appearing on most of the future RAID hardware, Linux users will be left out in a big way if this isn't supported. EARLY DRAFT spec for DDF was posted on snia.org at http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf > o Improve the ability of MD to auto-configure arrays. hmmmm. Maybe in my language this means "improve ability for low-level drivers to communicate RAID support to upper layers"? > o Support multi-level arrays transparently yet allow > proper event notification across levels when the > topology is known to MD. I'll need to see the code to understand what this means, much less whether it is needed ;-) > o Create a more generic "work item" framework which is > used to support array initialization, rebuild, and > verify operations as well as miscellaneous tasks that > a meta-data or RAID personality may need to perform > from a thread context (e.g. spare activation where > meta-data records may need to be sequenced carefully). This is interesting. (guessing) sort of like a pluggable finite state machine? > o Modify the MD ioctl interface to allow the creation > of management utilities that are meta-data format > agnostic. I'm thinking that for 2.6, it is much better to use a more tightly defined interface via a Linux character driver. Userland write(2)'s packets of data (h/w raid commands or software raid configuration commands), and read(2)'s the responses. ioctl's are a pain for 32->64-bit translation layers. Using a read/write interface allows one to create an interface that requires no translation layer -- a big deal for AMD64 and IA32e processors moving forward -- and it also gives one a lot more control over the interface. See, we need what I described _anyway_, as a chrdev-based interface to sending and receiving ATA taskfiles or SCSI cdb's. It would be IMO simple to extend this to a looks-a-lot-like-ioctl raid_op interface. > A snapshot of this work is now available here: > > http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz Your email didn't say... this appears to be for 2.6, correct? > This snapshot includes support for RAID0, RAID1, and the Adaptec > ASR and DDF meta-data formats. Additional RAID personalities and > support for the Super90 and Super 1 meta-data formats will be added > in the coming weeks, the end goal being to provide a superset of > the functionality in the current MD. groovy > Since the current MD notification scheme does not allow MD to receive > notifications unless it is statically compiled into the kernel, we > would like to work with the community to develop a more generic > notification scheme to which modules, such as MD, can dynamically > register. Until that occurs, these EMD snapshots will require at > least md.c to be a static component of the kernel. You would just need a small stub that holds a notifier pointer, yes? > Architectural Notes > =================== > The major areas of change in "EMD" can be categorized into: > > 1) "Object Oriented" Data structure changes > > These changes are the basis for allowing RAID personalities > to transparently operate on "disks" or "arrays" as member > objects. While it has always been possible to create > multi-level arrays in MD using block layer stacking, our > approach allows MD to also stack internally. Once a given > RAID or meta-data personality is converted to the new > structures, this "feature" comes at no cost. The benefit > to stacking internally, which requires a meta-data format > that supports this, is that array state can propagate up > and down the topology without the loss of information > inherent in using the block layer to traverse levels of an > array. I have a feeling that consensus will prefer that we fix the block layer, and then figure out the best way to support "automatic stacking" -- since DDF and presumeably other RAID formats will require automatic setup of raid0+1, etc. Are there RAID-specific issues here, that do not apply to e.g. multipathing, which I've heard needs more information at the block layer? > 2) Opcode based interfaces. > > Rather than add additional method vectors to either the > RAID personality or meta-data personality objects, the new > code uses only a few methods that are parameterized. This > has allowed us to create a fairly rich interface between > the core and the personalities without overly bloating > personality "classes". Modulo what I said above, about the chrdev userland interface, we want to avoid this. You're already going down the wrong road by creating more untyped interfaces... static int raid0_raidop(mdk_member_t *member, int op, void *arg) { switch (op) { case MDK_RAID_OP_MSTATE_CHANGED: The preferred model is to create a single marshalling module (a la net/core/ethtool.c) that converts the ioctls we must support into a fully typed function call interface (a la struct ethtool_ops). > 3) WorkItems > > Workitems provide a generic framework for queuing work to > a thread context. Workitems include a "control" method as > well as a "handler" method. This separation allows, for > example, a RAID personality to use the generic sync handler > while trapping the "open", "close", and "free" of any sync > workitems. Since both handlers can be tailored to the > individual workitem that is queued, this removes the need > to overload one or more interfaces in the personalities. > It also means that any code in MD can make use of this > framework - it is not tied to particular objects or modules > in the system. Makes sense, though I wonder if we'll want to make this more generic. hardware RAID drivers might want to use this sort of stuff internally? > 4) "Syncable Volume" Support > > All of the transaction accounting necessary to support > redundant arrays has been abstracted out into a few inline > functions. With the inclusion of a "sync support" structure > in a RAID personality's private data structure area and the > use of these functions, the generic sync framework is fully > available. The sync algorithm is also now more like that > in 2.4.X - with some updates to improve performance. Two > contiguous sync ranges are employed so that sync I/O can > be pending while the lock range is extended and new sync > I/O is stalled waiting for normal I/O writes that might > conflict with the new range complete. The syncer updates > its stats more frequently than in the past so that it can > more quickly react to changes in the normal I/O load. Syncer > backoff is also disabled anytime there is pending I/O blocked > on the syncer's locked region. RAID personalities have > full control over the size of the sync windows used so that > they can be optimized based on RAID layout policy. interesting. makes sense on the surface, I'll have to think some more... > 5) IOCTL Interface > > "EMD" now performs all of its configuration via an "mdctl" > character device. Since one of our goals is to remove any > knowledge of meta-data type in the user control programs, > initial meta-data stamping and configuration validation > occurs in the kernel. In general, the meta-data modules > already need this validation code in order to support > auto-configuration, so adding this capability adds little > to the overall size of EMD. It does, however, require a > few additional ioctls to support things like querying the > maximum "coerced" size of a disk targeted for a new array, > or enumerating the names of installed meta-data modules, > etc. > > This area of EMD is still in very active development and we expect > to provide a drop of an "emdadm" utility later this week. I haven't evaluated yet the ioctl interface. I do understand the need to play alongside the existing md interface, but if there are huge numbers of additions, it would be preferred to just use the chrdev straightaway. Such a chrdev would be easily portable to 2.4.x kernels too :) > 7) Correction of RAID0 Transform > > The RAID0 transform's "merge function" assumes that the > incoming bio's starting sector is the same as what will be > presented to its make_request function. In the case of a > partitioned MD device, the starting sector is shifted by > the partition offset for the target offset. Unfortunately, > the merge functions are not notified of the partition > transform, so RAID0 would often reject requests that span > "chunk" boundaries once shifted. The fix employed here is > to determine if a partition transform will occur and take > this into account in the merge function. interesting > Adaptec is currently validating EMD through formal testing while > continuing the build-out of new features. Our hope is to gather > feedback from the Linux community and adjust our approach to satisfy > the community's requirements. We look forward to your comments, > suggestions, and review of this project. Thanks much for working with the Linux community. One overall comment on merging into 2.6: the patch will need to be broken up into pieces. It's OK if each piece is dependent on the prior one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot for review to see the evolution, and it also helps flush out problems you might not have even noticed. e.g. - add concept of member, and related helper functions - use member functions/structs in raid drivers raid0.c, etc. - fix raid0 transform - add ioctls needed in order for DDF to be useful - add DDF format etc. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 19:18 ` Jeff Garzik @ 2004-03-17 19:32 ` Christoph Hellwig 2004-03-17 20:02 ` Jeff Garzik 2004-03-17 21:18 ` Scott Long 1 sibling, 1 reply; 56+ messages in thread From: Christoph Hellwig @ 2004-03-17 19:32 UTC (permalink / raw) To: Jeff Garzik; +Cc: Justin T. Gibbs, linux-raid, Linux Kernel On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote: > > o Allow fully pluggable meta-data modules > > yep, needed Well, this is pretty much the EVMS route we all heavily argued against. Most of the metadata shouldn't be visible in the kernel at all. > > o Improve the ability of MD to auto-configure arrays. > > hmmmm. Maybe in my language this means "improve ability for low-level > drivers to communicate RAID support to upper layers"? I think he's talking about the deprecated raid autorun feature. Again something that is completely misplaced in the kernel. (ågain EVMS light) > > o Support multi-level arrays transparently yet allow > > proper event notification across levels when the > > topology is known to MD. > > I'll need to see the code to understand what this means, much less > whether it is needed ;-) I think he mean the broken inter-driver raid stacking mentioned below. Why do I have to thing of EVMS when for each feature?.. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 19:32 ` Christoph Hellwig @ 2004-03-17 20:02 ` Jeff Garzik 0 siblings, 0 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-17 20:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Justin T. Gibbs, linux-raid, Linux Kernel Christoph Hellwig wrote: > On Wed, Mar 17, 2004 at 02:18:25PM -0500, Jeff Garzik wrote: > >>> o Allow fully pluggable meta-data modules >> >>yep, needed > > > Well, this is pretty much the EVMS route we all heavily argued against. > Most of the metadata shouldn't be visible in the kernel at all. _some_ metadata is required at runtime, and must be in the kernel. I agree that a lot of configuration doesn't necessarily need to be in the kernel. But stuff like bad sector and event logs, and other bits are still needed at runtime. >>> o Improve the ability of MD to auto-configure arrays. >> >>hmmmm. Maybe in my language this means "improve ability for low-level >>drivers to communicate RAID support to upper layers"? > > > I think he's talking about the deprecated raid autorun feature. Again > something that is completely misplaced in the kernel. (ågain EVMS light) Indeed, but I'll let him and the code illuminate the meaning :) Jeff - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 19:18 ` Jeff Garzik 2004-03-17 19:32 ` Christoph Hellwig @ 2004-03-17 21:18 ` Scott Long 2004-03-17 21:35 ` Jeff Garzik ` (2 more replies) 1 sibling, 3 replies; 56+ messages in thread From: Scott Long @ 2004-03-17 21:18 UTC (permalink / raw) To: Jeff Garzik; +Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Jeff Garzik wrote: > Justin T. Gibbs wrote: > > [ I tried sending this last night from my Adaptec email address and have > > yet to see it on the list. Sorry if this is dup for any of you. ] > > Included linux-kernel in the CC (and also bounced this post there). > > > > For the past few months, Adaptec Inc, has been working to enhance MD. > > The FAQ from several corners is going to be "why not DM?", so I would > humbly request that you (or Scott Long) re-post some of that rationale > here... > > > > The goals of this project are: > > > > o Allow fully pluggable meta-data modules > > yep, needed > > > > o Add support for Adaptec ASR (aka HostRAID) and DDF > > (Disk Data Format) meta-data types. Both of these > > formats are understood natively by certain vendor > > BIOSes meaning that arrays can be booted from transparently. > > yep, needed > > For those who don't know, DDF is particularly interesting. A storage > industry association, "SNIA", has gotten most of the software and > hardware RAID folks to agree on a common, vendor-neutral on-disk format. > Pretty historic, IMO :) Since this will be appearing on most of the > future RAID hardware, Linux users will be left out in a big way if this > isn't supported. > > EARLY DRAFT spec for DDF was posted on snia.org at > http://www.snia.org/tech_activities/ddftwg/DDFTrial-UseDraft_0_45.pdf > > > > o Improve the ability of MD to auto-configure arrays. > > hmmmm. Maybe in my language this means "improve ability for low-level > drivers to communicate RAID support to upper layers"? > No, this is full auto-configuration support at boot-time, and when drives are hot-added. I think that you comment applies to the next item, and yes, you are correct. > > > o Support multi-level arrays transparently yet allow > > proper event notification across levels when the > > topology is known to MD. > > I'll need to see the code to understand what this means, much less > whether it is needed ;-) > > > > o Create a more generic "work item" framework which is > > used to support array initialization, rebuild, and > > verify operations as well as miscellaneous tasks that > > a meta-data or RAID personality may need to perform > > from a thread context (e.g. spare activation where > > meta-data records may need to be sequenced carefully). > > This is interesting. (guessing) sort of like a pluggable finite state > machine? > More or less, yes. We needed a way to bridge the gap from an error being reported in an interrupt context to being able to allocate memory and do blocking I/O from a thread context. The md_error() interface already existed to do this, but was way too primitive for our needs. It had no way to handle cascading or compound events. > > > o Modify the MD ioctl interface to allow the creation > > of management utilities that are meta-data format > > agnostic. > > I'm thinking that for 2.6, it is much better to use a more tightly > defined interface via a Linux character driver. Userland write(2)'s > packets of data (h/w raid commands or software raid configuration > commands), and read(2)'s the responses. > > ioctl's are a pain for 32->64-bit translation layers. Using a > read/write interface allows one to create an interface that requires no > translation layer -- a big deal for AMD64 and IA32e processors moving > forward -- and it also gives one a lot more control over the interface. > I'm not exactly sure what the difference is here. Both the ioctl and read/write paths copy data in and out of the kernel. The ioctl method is a little bit easier since you don't have to stream in a chunk of data before knowing what to do with it. ANd I also don't see how read/write protect you from endian and 64/32-bit issues better than ioctl. If you write your code cleanly and correctly, it's a moot point. > See, we need what I described _anyway_, as a chrdev-based interface to > sending and receiving ATA taskfiles or SCSI cdb's. > > It would be IMO simple to extend this to a looks-a-lot-like-ioctl > raid_op interface. > > > > A snapshot of this work is now available here: > > > > http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz > > Your email didn't say... this appears to be for 2.6, correct? > > > > This snapshot includes support for RAID0, RAID1, and the Adaptec > > ASR and DDF meta-data formats. Additional RAID personalities and > > support for the Super90 and Super 1 meta-data formats will be added > > in the coming weeks, the end goal being to provide a superset of > > the functionality in the current MD. > > groovy > > > > Since the current MD notification scheme does not allow MD to receive > > notifications unless it is statically compiled into the kernel, we > > would like to work with the community to develop a more generic > > notification scheme to which modules, such as MD, can dynamically > > register. Until that occurs, these EMD snapshots will require at > > least md.c to be a static component of the kernel. > > You would just need a small stub that holds a notifier pointer, yes? > I think that we are flexible on this. We have an implementation from several years ago that records partition type information and passes it around in the notification message so that consumers can register for distinct types of disks/partitions/etc. Our needs aren't that complex, but we would be happy to share it anyways since it is useful. > > > Architectural Notes > > =================== > > The major areas of change in "EMD" can be categorized into: > > > > 1) "Object Oriented" Data structure changes > > > > These changes are the basis for allowing RAID personalities > > to transparently operate on "disks" or "arrays" as member > > objects. While it has always been possible to create > > multi-level arrays in MD using block layer stacking, our > > approach allows MD to also stack internally. Once a given > > RAID or meta-data personality is converted to the new > > structures, this "feature" comes at no cost. The benefit > > to stacking internally, which requires a meta-data format > > that supports this, is that array state can propagate up > > and down the topology without the loss of information > > inherent in using the block layer to traverse levels of an > > array. > > I have a feeling that consensus will prefer that we fix the block layer, > and then figure out the best way to support "automatic stacking" -- > since DDF and presumeably other RAID formats will require automatic > setup of raid0+1, etc. > > Are there RAID-specific issues here, that do not apply to e.g. > multipathing, which I've heard needs more information at the block layer? > No, the issue is, how do you propagate events through the block layer? EIO/EINVAL/etc error codes just don't cut it. Also, many metadata formats are unified, in that even though the arrays are stacked, the metadata sees the entire picture. Updates might need to touch every disk in the compound array, not just a certain sub-array. The stacking that we do internal to MD is still fairly clean and doesn't prevent one from stacking outside of MD. > > > 2) Opcode based interfaces. > > > > Rather than add additional method vectors to either the > > RAID personality or meta-data personality objects, the new > > code uses only a few methods that are parameterized. This > > has allowed us to create a fairly rich interface between > > the core and the personalities without overly bloating > > personality "classes". > > Modulo what I said above, about the chrdev userland interface, we want > to avoid this. You're already going down the wrong road by creating > more untyped interfaces... > > static int raid0_raidop(mdk_member_t *member, int op, void *arg) > { > switch (op) { > case MDK_RAID_OP_MSTATE_CHANGED: > > The preferred model is to create a single marshalling module (a la > net/core/ethtool.c) that converts the ioctls we must support into a > fully typed function call interface (a la struct ethtool_ops). > These OPS don't exist soley for the userland ap. They also exist for communicating between the raid transform and metadata modules. > > > 3) WorkItems > > > > Workitems provide a generic framework for queuing work to > > a thread context. Workitems include a "control" method as > > well as a "handler" method. This separation allows, for > > example, a RAID personality to use the generic sync handler > > while trapping the "open", "close", and "free" of any sync > > workitems. Since both handlers can be tailored to the > > individual workitem that is queued, this removes the need > > to overload one or more interfaces in the personalities. > > It also means that any code in MD can make use of this > > framework - it is not tied to particular objects or modules > > in the system. > > Makes sense, though I wonder if we'll want to make this more generic. > hardware RAID drivers might want to use this sort of stuff internally? > If you want to make it into a more generic kernel service, that fine. However, I'm not quite sure what kind of work items a hardware raid driver will need. The whole point there is to hide what's going on ;-) > > > 4) "Syncable Volume" Support > > > > All of the transaction accounting necessary to support > > redundant arrays has been abstracted out into a few inline > > functions. With the inclusion of a "sync support" structure > > in a RAID personality's private data structure area and the > > use of these functions, the generic sync framework is fully > > available. The sync algorithm is also now more like that > > in 2.4.X - with some updates to improve performance. Two > > contiguous sync ranges are employed so that sync I/O can > > be pending while the lock range is extended and new sync > > I/O is stalled waiting for normal I/O writes that might > > conflict with the new range complete. The syncer updates > > its stats more frequently than in the past so that it can > > more quickly react to changes in the normal I/O load. Syncer > > backoff is also disabled anytime there is pending I/O blocked > > on the syncer's locked region. RAID personalities have > > full control over the size of the sync windows used so that > > they can be optimized based on RAID layout policy. > > interesting. makes sense on the surface, I'll have to think some more... > > > > 5) IOCTL Interface > > > > "EMD" now performs all of its configuration via an "mdctl" > > character device. Since one of our goals is to remove any > > knowledge of meta-data type in the user control programs, > > initial meta-data stamping and configuration validation > > occurs in the kernel. In general, the meta-data modules > > already need this validation code in order to support > > auto-configuration, so adding this capability adds little > > to the overall size of EMD. It does, however, require a > > few additional ioctls to support things like querying the > > maximum "coerced" size of a disk targeted for a new array, > > or enumerating the names of installed meta-data modules, > > etc. > > > > This area of EMD is still in very active development and we expect > > to provide a drop of an "emdadm" utility later this week. > > I haven't evaluated yet the ioctl interface. I do understand the need > to play alongside the existing md interface, but if there are huge > numbers of additions, it would be preferred to just use the chrdev > straightaway. Such a chrdev would be easily portable to 2.4.x kernels > too :) > > > > 7) Correction of RAID0 Transform > > > > The RAID0 transform's "merge function" assumes that the > > incoming bio's starting sector is the same as what will be > > presented to its make_request function. In the case of a > > partitioned MD device, the starting sector is shifted by > > the partition offset for the target offset. Unfortunately, > > the merge functions are not notified of the partition > > transform, so RAID0 would often reject requests that span > > "chunk" boundaries once shifted. The fix employed here is > > to determine if a partition transform will occur and take > > this into account in the merge function. > > interesting > > > > Adaptec is currently validating EMD through formal testing while > > continuing the build-out of new features. Our hope is to gather > > feedback from the Linux community and adjust our approach to satisfy > > the community's requirements. We look forward to your comments, > > suggestions, and review of this project. > > Thanks much for working with the Linux community. > > One overall comment on merging into 2.6: the patch will need to be > broken up into pieces. It's OK if each piece is dependent on the prior > one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot > for review to see the evolution, and it also helps flush out problems > you might not have even noticed. e.g. > - add concept of member, and related helper functions > - use member functions/structs in raid drivers raid0.c, etc. > - fix raid0 transform > - add ioctls needed in order for DDF to be useful > - add DDF format > etc. > We can provide our Perforce changelogs (just like we do for SCSI). Scott ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 21:18 ` Scott Long @ 2004-03-17 21:35 ` Jeff Garzik 2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz 2004-03-18 1:56 ` viro 2 siblings, 0 replies; 56+ messages in thread From: Jeff Garzik @ 2004-03-17 21:35 UTC (permalink / raw) To: Scott Long; +Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Scott Long wrote: > Jeff Garzik wrote: >> Modulo what I said above, about the chrdev userland interface, we want >> to avoid this. You're already going down the wrong road by creating >> more untyped interfaces... >> >> static int raid0_raidop(mdk_member_t *member, int op, void *arg) >> { >> switch (op) { >> case MDK_RAID_OP_MSTATE_CHANGED: >> >> The preferred model is to create a single marshalling module (a la >> net/core/ethtool.c) that converts the ioctls we must support into a >> fully typed function call interface (a la struct ethtool_ops). >> > > These OPS don't exist soley for the userland ap. They also exist for > communicating between the raid transform and metadata modules. Nod -- kernel internal calls should _especially_ be type-explicit, not typeless ioctl-like APIs. >> One overall comment on merging into 2.6: the patch will need to be >> broken up into pieces. It's OK if each piece is dependent on the prior >> one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot >> for review to see the evolution, and it also helps flush out problems >> you might not have even noticed. e.g. >> - add concept of member, and related helper functions >> - use member functions/structs in raid drivers raid0.c, etc. >> - fix raid0 transform >> - add ioctls needed in order for DDF to be useful >> - add DDF format >> etc. >> > > We can provide our Perforce changelogs (just like we do for SCSI). What I'm saying is, emd needs to be submitted to the kernel just like Neil Brown submits patches to Andrew, etc. This is how everybody else submits and maintains Linux kernel code. There needs to be N patches, one patch per email, that successively introduces new code, or modifies existing code. Absent of all other issues, one huge patch that completely updates md isn't going to be acceptable, no matter how nifty or well-tested it is... Jeff ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 21:18 ` Scott Long 2004-03-17 21:35 ` Jeff Garzik @ 2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz 2004-03-18 0:23 ` Scott Long 2004-03-18 1:56 ` viro 2 siblings, 1 reply; 56+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-17 21:45 UTC (permalink / raw) To: Scott Long, Jeff Garzik Cc: Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel On Wednesday 17 of March 2004 22:18, Scott Long wrote: > Jeff Garzik wrote: > > Justin T. Gibbs wrote: > > > [ I tried sending this last night from my Adaptec email address and > > > have yet to see it on the list. Sorry if this is dup for any of you. > > > ] > > > > Included linux-kernel in the CC (and also bounced this post there). > > > > > For the past few months, Adaptec Inc, has been working to enhance MD. > > > > The FAQ from several corners is going to be "why not DM?", so I would > > humbly request that you (or Scott Long) re-post some of that rationale > > here... This is #1 question so... why not DM? 8) Regards, Bartlomiej ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz @ 2004-03-18 0:23 ` Scott Long 2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Scott Long @ 2004-03-18 0:23 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Bartlomiej Zolnierkiewicz wrote: > On Wednesday 17 of March 2004 22:18, Scott Long wrote: > > Jeff Garzik wrote: > > > Justin T. Gibbs wrote: > > > > [ I tried sending this last night from my Adaptec email address and > > > > have yet to see it on the list. Sorry if this is dup for any of > you. > > > > ] > > > > > > Included linux-kernel in the CC (and also bounced this post there). > > > > > > > For the past few months, Adaptec Inc, has been working to > enhance MD. > > > > > > The FAQ from several corners is going to be "why not DM?", so I would > > > humbly request that you (or Scott Long) re-post some of that rationale > > > here... > > This is #1 question so... why not DM? 8) > > Regards, > Bartlomiej > The primary feature of any RAID implementation is reliability. Reliability is a surprisingly hard goal. Making sure that your data is available and trustworthy under real-world scenarios is a lot harder than it sounds. This has been a significant focus of ours on MD, and is the primary reason why we chose MD as the foundation of our work. Storage is the foundation of everything that you do with your computer. It needs to work regardless of what happened to your filesystem on the last crash, regardless of whether or not you have the latest initrd tools, regardless of what rpms you've kept up to date on, regardless if your userland works, regardless of what libc you are using this week, etc. With DM, what happens when your initrd gets accidentally corrupted? What happens when the kernel and userland pieces get out of sync? Maybe you are booting off of a single drive and only using DM arrays for secondary storage, but maybe you're not. If something goes wrong with DM, how do you boot? Secondly, our target here is to interoperate with hardware components that run outside the scope of Linux. The HostRAID or DDF BIOS is going to create an array using it's own format. It's not going to have any knowledge of DM config files, initrd, ramfs, etc. However, the end user is still going to expect to be able to seamlessly install onto that newly created array, maybe move that array to another system, whatever, and have it all Just Work. Has anyone heard of a hardware RAID card that requires you to run OS-specific commands in order to access the arrays on it? Of course not. The point here is to make software raid just as easy to the end user. The third, and arguably most important issue is the need for reliable error recovery. With the DM model, error recovery would be done in userland. Errors generated during I/O would be kicked to a userland app that would then drive the recovery-spare activation-rebuild sequence. That's fine, but what if something happens that prevents the userland tool from running? Maybe it was a daemon that became idle and got swapped out to disk, but now you can't swap it back in because your I/O is failing. Or maybe it needs to activate a helper module or read a config file, but again it can't because i/o is failing. What if it crashes. What if the source code gets out of sync with the kernel interface. What if you upgrade glibc and it stops working for whatever unknown reason. Some have suggested in the past that these userland tools get put into ramfs and locked into memory. If you do that, then it might as well be part of the kernel anyways. It's consuming the same memory, if not more, than the equivalent code in the kernel (likely a lot more since you'd have to static link it). And you still have the downsides of it possibly getting out of date with the kernel. So what are the upsides? MD is not terribly heavy-weight. As a monolithic module of DDF+ASR+R0+R1 it's about 65k in size. That's 1/2 the size of your average SCSI driver these days, and no one is advocating putting those into userland. It just doesn't make sense to sacrifice reliability for the phantom goal of 'reducing kernel bloat'. Scott ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-18 0:23 ` Scott Long @ 2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz 2004-03-18 6:38 ` Stefan Smietanowski 2004-03-20 13:07 ` Arjan van de Ven 2 siblings, 0 replies; 56+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2004-03-18 1:55 UTC (permalink / raw) To: Scott Long Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel On Thursday 18 of March 2004 01:23, Scott Long wrote: > Bartlomiej Zolnierkiewicz wrote: > > On Wednesday 17 of March 2004 22:18, Scott Long wrote: > > > Jeff Garzik wrote: > > > > Justin T. Gibbs wrote: > > > > > [ I tried sending this last night from my Adaptec email address > > > > > and have yet to see it on the list. Sorry if this is dup for any > > > > > of > > > > you. > > > > > > > ] > > > > > > > > Included linux-kernel in the CC (and also bounced this post there). > > > > > > > > > For the past few months, Adaptec Inc, has been working to > > > > enhance MD. > > > > > > The FAQ from several corners is going to be "why not DM?", so I > > > > would humbly request that you (or Scott Long) re-post some of that > > > > rationale here... > > > > This is #1 question so... why not DM? 8) > > > > Regards, > > Bartlomiej > > The primary feature of any RAID implementation is reliability. > Reliability is a surprisingly hard goal. Making sure that your > data is available and trustworthy under real-world scenarios is > a lot harder than it sounds. This has been a significant focus > of ours on MD, and is the primary reason why we chose MD as the > foundation of our work. Okay. > Storage is the foundation of everything that you do with your > computer. It needs to work regardless of what happened to your > filesystem on the last crash, regardless of whether or not you > have the latest initrd tools, regardless of what rpms you've kept > up to date on, regardless if your userland works, regardless of > what libc you are using this week, etc. I'm thinking about initrd+klibc not rpms+libc, fs is a lower level than DM - fs crash is not a problem here. > With DM, what happens when your initrd gets accidentally corrupted? The same what happens when your kernel image gets corrupted, probability is similar. > What happens when the kernel and userland pieces get out of sync? The same what happens when your kernel driver gets out of sync. > Maybe you are booting off of a single drive and only using DM arrays > for secondary storage, but maybe you're not. If something goes wrong > with DM, how do you boot? The same what happens when "something" wrong goes with kernel. > Secondly, our target here is to interoperate with hardware components > that run outside the scope of Linux. The HostRAID or DDF BIOS is > going to create an array using it's own format. It's not going to > have any knowledge of DM config files, initrd, ramfs, etc. However, It doesn't need any knowledge of config files, initrd, ramfs etc. > the end user is still going to expect to be able to seamlessly install > onto that newly created array, maybe move that array to another system, > whatever, and have it all Just Work. Has anyone heard of a hardware > RAID card that requires you to run OS-specific commands in order to > access the arrays on it? Of course not. The point here is to make > software raid just as easy to the end user. It won't require user to run any commands. RAID card gets detected and initialized -> hotplug event happens -> user-land configuration tools executed etc. > The third, and arguably most important issue is the need for reliable > error recovery. With the DM model, error recovery would be done in > userland. Errors generated during I/O would be kicked to a userland > app that would then drive the recovery-spare activation-rebuild > sequence. That's fine, but what if something happens that prevents > the userland tool from running? Maybe it was a daemon that became > idle and got swapped out to disk, but now you can't swap it back in > because your I/O is failing. Or maybe it needs to activate a helper > module or read a config file, but again it can't because i/o is I see valid points here but ramfs can be used etc. > failing. What if it crashes. What if the source code gets out of sync > with the kernel interface. What if you upgrade glibc and it stops > working for whatever unknown reason. glibc is not needed/recommend here. > Some have suggested in the past that these userland tools get put into > ramfs and locked into memory. If you do that, then it might as well be > part of the kernel anyways. It's consuming the same memory, if not > more, than the equivalent code in the kernel (likely a lot more since > you'd have to static link it). And you still have the downsides of it > possibly getting out of date with the kernel. So what are the upsides? Faster/easier development - user-space apps don't OOPS. :-) Somebody else than kernel people have to update user-land. :-) > MD is not terribly heavy-weight. As a monolithic module of > DDF+ASR+R0+R1 it's about 65k in size. That's 1/2 the size of your > average SCSI driver these days, and no one is advocating putting those SCSI driver is a low-level stuff - it needs direct hardware access. Even 65k is still a bloat - think about vendor kernel including support for all possible RAID flavors. If they are modular - they require initrd so may as well be put to user-land. > into userland. It just doesn't make sense to sacrifice reliability > for the phantom goal of 'reducing kernel bloat'. ATARAID drivers are just moving in this direction... ASR+DDF will also follow this way... sooner or later... Regards, Bartlomiej ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-18 0:23 ` Scott Long 2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz @ 2004-03-18 6:38 ` Stefan Smietanowski 2004-03-20 13:07 ` Arjan van de Ven 2 siblings, 0 replies; 56+ messages in thread From: Stefan Smietanowski @ 2004-03-18 6:38 UTC (permalink / raw) To: Scott Long Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Hi. <snip beginning of discsussion about DDF, etc> > With DM, what happens when your initrd gets accidentally corrupted? > What happens when the kernel and userland pieces get out of sync? > Maybe you are booting off of a single drive and only using DM arrays > for secondary storage, but maybe you're not. If something goes wrong > with DM, how do you boot? Tell me something... Do you guys release a driver for WinXP as an example? You don't have to answer that really as it's obvious that you do. Do you in the installation program recompile the windows kernel so that your driver is monolithic? The answer is most presumably no - that's not how it's done there. Ok. Your example states "what if initrd gets corrupted" and my example is "what if you driver file(s) get corrupted?" and my example is equally important to a module in linux as it is a driver in windows. Now, since you do supply a windows driver and that driver is NOT statically linked to the windows kernel why is it that you believe a meta driver (which MD really is in a sense) needs special treatment (static linking into the kernel) when for instance a driver for a piece of hardware doesn't? If you have disk corruption so far that your initrd is corrupted I would seriously suggest NOT booting that OS that's on that drive regardless of anything else and sticking it in another box OR booting from rescue media of some sort. // Stefan ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-18 0:23 ` Scott Long 2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz 2004-03-18 6:38 ` Stefan Smietanowski @ 2004-03-20 13:07 ` Arjan van de Ven 2004-03-21 23:42 ` Scott Long 2 siblings, 1 reply; 56+ messages in thread From: Arjan van de Ven @ 2004-03-20 13:07 UTC (permalink / raw) To: Scott Long Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 1606 bytes --] > With DM, what happens when your initrd gets accidentally corrupted? What happens if your vmlinuz accidentally gets corrupted? If your initrd is toast the module for your root fs doesn't load either. Duh. > What happens when the kernel and userland pieces get out of sync? > Maybe you are booting off of a single drive and only using DM arrays > for secondary storage, but maybe you're not. If something goes wrong > with DM, how do you boot? If you loose 10 disks out of your raid array, how do you boot ? > > Secondly, our target here is to interoperate with hardware components > that run outside the scope of Linux. The HostRAID or DDF BIOS is > going to create an array using it's own format. It's not going to > have any knowledge of DM config files, DM doesn't need/use config files. > initrd, ramfs, etc. However, > the end user is still going to expect to be able to seamlessly install > onto that newly created array, maybe move that array to another system, > whatever, and have it all Just Work. Has anyone heard of a hardware > RAID card that requires you to run OS-specific commands in order to > access the arrays on it? Of course not. The point here is to make > software raid just as easy to the end user. And that is an easy task for distribution makers (or actually the people who make the initrd creation software). I'm sorry, I'm not buying your arguments and consider 100% the wrong direction. I'm hoping that someone with a bit more time than me will write the DDF device mapper target so that I can use it for my kernels... ;) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-20 13:07 ` Arjan van de Ven @ 2004-03-21 23:42 ` Scott Long 2004-03-22 9:05 ` Arjan van de Ven 0 siblings, 1 reply; 56+ messages in thread From: Scott Long @ 2004-03-21 23:42 UTC (permalink / raw) To: arjanv Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Arjan van de Ven wrote: >>With DM, what happens when your initrd gets accidentally corrupted? > > > What happens if your vmlinuz accidentally gets corrupted? If your initrd > is toast the module for your root fs doesn't load either. Duh. The point here is to minimize points of failure. > > >>What happens when the kernel and userland pieces get out of sync? >>Maybe you are booting off of a single drive and only using DM arrays >>for secondary storage, but maybe you're not. If something goes wrong >>with DM, how do you boot? > > > If you loose 10 disks out of your raid array, how do you boot ? That's a silly statement and has nothing to do with the argument. > > >>Secondly, our target here is to interoperate with hardware components >>that run outside the scope of Linux. The HostRAID or DDF BIOS is >>going to create an array using it's own format. It's not going to >>have any knowledge of DM config files, > > > DM doesn't need/use config files. > >>initrd, ramfs, etc. However, >>the end user is still going to expect to be able to seamlessly install >>onto that newly created array, maybe move that array to another system, >>whatever, and have it all Just Work. Has anyone heard of a hardware >>RAID card that requires you to run OS-specific commands in order to >>access the arrays on it? Of course not. The point here is to make >>software raid just as easy to the end user. > > > And that is an easy task for distribution makers (or actually the people > who make the initrd creation software). > > I'm sorry, I'm not buying your arguments and consider 100% the wrong > direction. I'm hoping that someone with a bit more time than me will > write the DDF device mapper target so that I can use it for my > kernels... ;) > Well, code speaks louder than words, as this group loves to say. I eagerly await your code. Barring that, I eagerly await a technical argument, rather than an emotional "you're wrong because I'm right" argument. Scott ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-21 23:42 ` Scott Long @ 2004-03-22 9:05 ` Arjan van de Ven 2004-03-22 21:59 ` Scott Long 0 siblings, 1 reply; 56+ messages in thread From: Arjan van de Ven @ 2004-03-22 9:05 UTC (permalink / raw) To: Scott Long Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 452 bytes --] On Mon, 2004-03-22 at 00:42, Scott Long wrote: > Well, code speaks louder than words, as this group loves to say. I > eagerly await your code. Barring that, I eagerly await a technical > argument, rather than an emotional "you're wrong because I'm right" > argument. I think that all the arguments for using DM are techinical arguments not emotional ones. oh well.. you're free to write your code I'm free to not use it in my kernels ;) [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-22 9:05 ` Arjan van de Ven @ 2004-03-22 21:59 ` Scott Long 2004-03-23 6:48 ` Arjan van de Ven 0 siblings, 1 reply; 56+ messages in thread From: Scott Long @ 2004-03-22 21:59 UTC (permalink / raw) To: arjanv Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel Arjan van de Ven wrote: > On Mon, 2004-03-22 at 00:42, Scott Long wrote: > > >>Well, code speaks louder than words, as this group loves to say. I >>eagerly await your code. Barring that, I eagerly await a technical >>argument, rather than an emotional "you're wrong because I'm right" >>argument. > > > I think that all the arguments for using DM are techinical arguments not > emotional ones. oh well.. you're free to write your code I'm free to not > use it in my kernels ;) Ok, the technical arguments I've heard in favor of the DM approach is that it reduces kernel bloat. That fair, and I certainly agree with not putting the kitchen sink into the kernel. Our position on EMD is that it's a special case because you want to reduce the number of failure modes, and that it doesn't contribute in a significant way to the kernel size. Your response to that our arguments don't matter since your mind is already made up. That's the barrier I'm trying to break through and have a techincal discussion on. Scott ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-22 21:59 ` Scott Long @ 2004-03-23 6:48 ` Arjan van de Ven 0 siblings, 0 replies; 56+ messages in thread From: Arjan van de Ven @ 2004-03-23 6:48 UTC (permalink / raw) To: Scott Long Cc: Bartlomiej Zolnierkiewicz, Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel [-- Attachment #1: Type: text/plain, Size: 1073 bytes --] On Mon, Mar 22, 2004 at 02:59:29PM -0700, Scott Long wrote: > >I think that all the arguments for using DM are techinical arguments not > >emotional ones. oh well.. you're free to write your code I'm free to not > >use it in my kernels ;) > > Ok, the technical arguments I've heard in favor of the DM approach is > that it reduces kernel bloat. That fair, and I certainly agree with not > putting the kitchen sink into the kernel. Our position on EMD is that > it's a special case because you want to reduce the number of failure > modes, and that it doesn't contribute in a significant way to the kernel > size. There are serveral dozen such formats as DDF, should those be put in too? And then the next step is built in multipathing or stacking or .. or .... And pretty soon you're back at the EVMS 1.0 situation. I see the general kernel direction be to move such autodetection to early userland (there's a reason DM and not EVMS1.0 is in the kernel, afaics even the EVMS guys now agree that this was the right move); EMD is a step in the opposite direction. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: "Enhanced" MD code avaible for review 2004-03-17 21:18 ` Scott Long 2004-03-17 21:35 ` Jeff Garzik 2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz @ 2004-03-18 1:56 ` viro 2 siblings, 0 replies; 56+ messages in thread From: viro @ 2004-03-18 1:56 UTC (permalink / raw) To: Scott Long Cc: Jeff Garzik, Justin T. Gibbs, linux-raid, Gibbs, Justin, Linux Kernel On Wed, Mar 17, 2004 at 02:18:01PM -0700, Scott Long wrote: > >One overall comment on merging into 2.6: the patch will need to be > >broken up into pieces. It's OK if each piece is dependent on the prior > >one, and it's OK if there are 20, 30, even 100 pieces. It helps a lot > >for review to see the evolution, and it also helps flush out problems > >you might not have even noticed. e.g. > > - add concept of member, and related helper functions > > - use member functions/structs in raid drivers raid0.c, etc. > > - fix raid0 transform > > - add ioctls needed in order for DDF to be useful > > - add DDF format > > etc. > > > > We can provide our Perforce changelogs (just like we do for SCSI). TA: "you must submit a solution, not just an answer" CALC101 student: "but I've checked the answer, it's OK" TA: "I'm sorry, it's not enough" <student hands a pile of paper covered with snippets of text and calculations> Student: "All right, here are all notes I've made while solving the problem. Happy now?" TA: <exasperated sigh> "Not really" ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2004-03-31 17:07 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1AOTW-4Vx-7@gated-at.bofh.it>
[not found] ` <1AOTW-4Vx-5@gated-at.bofh.it>
2004-03-18 1:33 ` "Enhanced" MD code avaible for review Andi Kleen
2004-03-18 2:00 ` Jeff Garzik
2004-03-20 9:58 ` Jamie Lokier
2004-03-19 20:19 Justin T. Gibbs
2004-03-23 5:05 ` Neil Brown
2004-03-23 6:23 ` Justin T. Gibbs
2004-03-24 2:26 ` Neil Brown
2004-03-24 19:09 ` Matt Domsch
2004-03-25 2:21 ` Jeff Garzik
2004-03-25 18:00 ` Kevin Corry
2004-03-25 18:42 ` Jeff Garzik
2004-03-25 18:48 ` Jeff Garzik
2004-03-25 23:46 ` Justin T. Gibbs
2004-03-26 0:01 ` Jeff Garzik
2004-03-26 0:10 ` Justin T. Gibbs
2004-03-26 0:14 ` Jeff Garzik
2004-03-25 22:04 ` Lars Marowsky-Bree
2004-03-26 19:19 ` Kevin Corry
2004-03-31 17:07 ` Randy.Dunlap
2004-03-25 23:35 ` Justin T. Gibbs
2004-03-26 0:13 ` Jeff Garzik
2004-03-26 17:43 ` Justin T. Gibbs
2004-03-28 0:06 ` Lincoln Dale
2004-03-30 17:54 ` Justin T. Gibbs
2004-03-28 0:30 ` Jeff Garzik
2004-03-26 19:15 ` Kevin Corry
2004-03-26 20:45 ` Justin T. Gibbs
2004-03-27 15:39 ` Kevin Corry
2004-03-30 17:03 ` Justin T. Gibbs
2004-03-30 17:15 ` Jeff Garzik
2004-03-30 17:35 ` Justin T. Gibbs
2004-03-30 17:46 ` Jeff Garzik
2004-03-30 18:04 ` Justin T. Gibbs
2004-03-30 21:47 ` Jeff Garzik
2004-03-30 22:12 ` Justin T. Gibbs
2004-03-30 22:34 ` Jeff Garzik
2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz
2004-03-25 22:59 ` Justin T. Gibbs
2004-03-25 23:44 ` Lars Marowsky-Bree
2004-03-26 0:03 ` Justin T. Gibbs
-- strict thread matches above, loose matches on Subject: below --
2004-03-17 18:14 Justin T. Gibbs
2004-03-17 19:18 ` Jeff Garzik
2004-03-17 19:32 ` Christoph Hellwig
2004-03-17 20:02 ` Jeff Garzik
2004-03-17 21:18 ` Scott Long
2004-03-17 21:35 ` Jeff Garzik
2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz
2004-03-18 0:23 ` Scott Long
2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz
2004-03-18 6:38 ` Stefan Smietanowski
2004-03-20 13:07 ` Arjan van de Ven
2004-03-21 23:42 ` Scott Long
2004-03-22 9:05 ` Arjan van de Ven
2004-03-22 21:59 ` Scott Long
2004-03-23 6:48 ` Arjan van de Ven
2004-03-18 1:56 ` viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).