Re: "Enhanced" MD code avaible for review

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: "Enhanced" MD code avaible for review
@ 2004-03-19 20:19 Justin T. Gibbs
  2004-03-23  5:05 ` Neil Brown
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-19 20:19 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

[ CC trimmed since all those on the CC line appear to be on the lists ... ]

Lets take a step back and focus on a few of the points to which we can
hopefully all agree:

o Any successful solution will have to have "meta-data modules" for
  active arrays "core resident" in order to be robust.  This
  requirement stems from the need to avoid deadlock during error
  recovery scenarios that must block "normal I/O" to the array while
  meta-data operations take place.

o It is desirable for arrays to auto-assemble based on recorded
  meta-data.  This includes the ability to have a user hot-insert
  a "cold spare", have the system recognize it as a spare (based
  on the meta-data resident on it) and activate it if necessary to
  restore a degraded array.

o Child devices of an array should only be accessible through the
  array while the array is in a configured state (bd_claim'ed).
  This avoids situations where a user can subvert the integrity of
  the array by performing "rogue I/O" to an array member.

Concentrating on just these three, we come to the conclusion that
whether the solution comes via "early user fs" or kernel modules,
the resident size of the solution *will* include the cost for
meta-data support.  In either case, the user is able to tailor their
system to include only the support necessary for their individual
system to operate.

If we want to argue the merits of either approach based on just the
sheer size of resident code, I have little doubt that the kernel
module approach will prove smaller:

 o No need for "mdadm" or some other daemon to be locked resident in
   memory.   This alone saves you having a locked copy of klibc or
   any other user libraries core resident.  The kernel modules
   leverage the kernel APIs that already have to be core resident
   to satisfy the needs of other parts of the kernel which also
   helps in reducing its size.

 o Initial RAM disk data can be discarded after modules are loaded at
   boot time.

Putting the size argument aside for a moment, lets explore how a
userland solution could satisfy just the above three requirements.

How is meta-data updated on child members of an array while that
array is on-line?  Remember that these operations occur with some
frequency.  MD includes "safe-mode" support where redundant arrays
are marked clean any time writes cease for a predetermined, fairly
short, amount of time.  The userland app cannot access the component
devices directly since they are bd_claim'ed.  Even if that mechanism
is somehow subverted, how do we guarantee that these meta-data
writes do not cause a deadlock?  In the case of a transition from
Read-only to Write mode, all writes are blocked to the array (this
must be the case for "Dirty" state to be accurate).  It seems to
me that you must then provide extra code to not only pre-allocate
buffers for the userland app to do its work, but also provide a
"back-door" interface for these operations to take place.

The argument has also been made that shifting some of this code out
to a userland app "simplifies" the solution and perhaps even makes
it easier to develop.  Comparing the two approaches we have:

UserFS:
      o Kernel Driver + "enhanced interface to userland daemon"
      o Userland Daemon (core resident)
      o Userland Meta-Data modules
      o Userland Management tool
	 - This tool needs to interface to the daemon and
	   perhaps also the kernel driver.

Kernel:
      o Kernel RAID Transform Drivers
      o Kernel Meta-Data modules
      o Simple Userland Mangement
	tool with no meta-data knowledge

So two questions arise from this analysis:

1) Are meta-data modules easier to code up or more robust as user
   or kernel modules?  I believe that doing these outside the kernel
   will make them larger and more complex while also losing the
   ability to have meta-data modules weigh in on rapidly occurring
   events without incurring performance tradeoffs.  Regardless of
   where they reside, these modules must be robust.  A kernel Oops
   or a segfault in the daemon is unacceptable to the end user.
   Saying that a segfault is less harmful in some way than an Oops
   when we're talking about the users data completely misses the
   point of why people use RAID.

2) What added complexity is incurred by supporting both a core
   resident daemon as well as management interfaces to the daemon
   and potentially the kernel module?  I have not fully thought
   through the corner cases such an approach would expose, so I
   cannot quantify this cost.  There are certainly more components
   to get right and keep synchronized.

In the end, I find it hard to justify inventing all of the userland
machinery necessary to make this work just to avoid roughly ~2K
lines of code per-metadata module from being part of the kernel.
The ASR module for example, which is only required by those that
need support for this meta-data type, is only 19K with all of its
debugging printks and code enabled, unstripped.  Are there benefits
to the userland approach that I'm missing?

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-19 20:19 "Enhanced" MD code avaible for review Justin T. Gibbs
@ 2004-03-23  5:05 ` Neil Brown
  2004-03-23  6:23   ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Neil Brown @ 2004-03-23  5:05 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel

On Friday March 19, gibbs@scsiguy.com wrote:
> [ CC trimmed since all those on the CC line appear to be on the lists ... ]
> 
> Lets take a step back and focus on a few of the points to which we can
> hopefully all agree:
> 
> o Any successful solution will have to have "meta-data modules" for
>   active arrays "core resident" in order to be robust.  This
>   requirement stems from the need to avoid deadlock during error
>   recovery scenarios that must block "normal I/O" to the array while
>   meta-data operations take place.

I agree.
'Linear' and 'raid0' arrays don't really need metadata support in the
kernel as their metadata is essentially read-only.
There are interesting applications for raid1 without metadata, but I
think that for all raid personalities where metadata might need to be
updated in an error condition to preserve data integrity, the kernel
should know enough about the metadata to perform that update.

It would be nice to keep the in-kernel knowledge to a minimum, though
some metadata formats probably make that hard.

> 
> o It is desirable for arrays to auto-assemble based on recorded
>   meta-data.  This includes the ability to have a user hot-insert
>   a "cold spare", have the system recognize it as a spare (based
>   on the meta-data resident on it) and activate it if necessary to
>   restore a degraded array.

Certainly.  It doesn't follow that the auto-assembly has to happen
within the kernel.  Having it all done in user-space makes it much
easier to control/configure.

I think the best way to describe my attitude to auto-assembly is that
it could be needs-driven rather than availability-driven.

needs-driven means: if the user asks to access an array that doesn't
  exist, then try to find the bits and assemble it.
availability driven means: find all the devices that could be part of
  an array, and combine as many of them as possible together into
  arrays.

Currently filesystems are needs-driven.  At boot time, only to root
filesystem, which has been identified somehow, gets mounted. 
Then the init scripts mount any others that are needed.
We don't have any hunting around for filesystem superblocks and
mounting the filesystems just in case they are needed.

Currently partitions are (sufficiently) needs-driven.  It is true that
any partitionable devices has it's partitions presented.  However the
existence of partitions does not affect access to the whole device at
all.  Only once the partitions are claimed is the whole-device
blocked. 

Providing that auto-assembly of arrays works the same way (needs
driven), I am happy for arrays to auto-assemble.
I happen to think this most easily done in user-space.

With DDF format metadata, there is a concept of 'imported' arrays,
which basically means arrays from some other controller that have been
attached to the current controller.

Part of my desire for needs-driven assembly is that I don't want to
inadvertently assemble 'imported' arrays.
A DDF controller has NVRAM or a hardcoded serial number to help avoid
this.  A generic Linux machine doesn't.

I could possibly be happy with auto-assembly where a kernel parameter
of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF
arrays that have a controler-id (or whatever it is) of xx.yy.zz.

This is probably simple enough to live entirely in the kernel.

> 
> o Child devices of an array should only be accessible through the
>   array while the array is in a configured state (bd_claim'ed).
>   This avoids situations where a user can subvert the integrity of
>   the array by performing "rogue I/O" to an array member.

bd_claim doesn't and (I believe) shouldn't stop access from
user-space.
It does stop a number of sorts of access that would expect exclusive
access. 

But back to your original post:  I suspect there is lots of valuable
stuff in your emd patch, but as you have probably gathered, big
patches are not the way we work around here, and with good reason.

If you would like to identify isolated pieces of functionality, create
patches to implement them, and submit them for review I will be quite
happy to review them and, when appropriate, forward them to
Andrew/Linus.
I suggest you start with less controversial changes and work your way
forward.

NeilBrown

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-23  5:05 ` Neil Brown
@ 2004-03-23  6:23   ` Justin T. Gibbs
  2004-03-24  2:26     ` Neil Brown
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-23  6:23 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, linux-kernel

>> o Any successful solution will have to have "meta-data modules" for
>>   active arrays "core resident" in order to be robust.  This

...

> I agree.
> 'Linear' and 'raid0' arrays don't really need metadata support in the
> kernel as their metadata is essentially read-only.
> There are interesting applications for raid1 without metadata, but I
> think that for all raid personalities where metadata might need to be
> updated in an error condition to preserve data integrity, the kernel
> should know enough about the metadata to perform that update.
> 
> It would be nice to keep the in-kernel knowledge to a minimum, though
> some metadata formats probably make that hard.

Can you further explain why you want to limit the kernel's knowledge
and where you would separate the roles between kernel and userland?

In reviewing one of our typical metadata modules, perhaps 80% of the code
is generic meta-data record parsing and state conversion logic that would
have to be retained in the kernel to perform "minimal meta-data updates".
Some high portion of this 80% (less the portion that builds the in-kernel
data structures to manipulate and update meta-data) would also need to be
replicated into a user-land utility for any type of separation of labor to
be possible.  The remaining 20% of the kernel code deals with validation of
user meta-data creation requests.  This code is relatively small since
it leverages all of the other routines that are already required for
the operational requirements of the module.

Splitting the roles bring up some important issues:

1) Code duplication.

Depending on the complexity of the meta-data format being supported,
the amount of code duplication between userland and kernel modules
may be quite large.  Any time code is duplicated, the solution is
prone to getting out of sync - bugs are fixed in one copy of the code
but not another.

2) Solution Complexity

Two entities understand how to read and manipulate the meta-data.
Policies and APIs must be created to ensure that only one entity
is performing operations on the meta-data at a time.  This is true
even if one entity is primarily a read-only "client".  For example,
a meta-data module may defer meta-data updates in some instances
(e.g. rebuild checkpointing) until the meta-data is closed (writing
the checkpoint sooner doesn't make sense considering that you should
restart your scrub, rebuild or verify if the system is not safely
shutdown).  How does the userland client get the most up-to-date
information?  This is just one of the problems in this area.

3) Size

Due to code duplication, the total solution will be larger in
code size.

What benefits of operating in userland outweigh these issues?

>> o It is desirable for arrays to auto-assemble based on recorded
>>   meta-data.  This includes the ability to have a user hot-insert
>>   a "cold spare", have the system recognize it as a spare (based
>>   on the meta-data resident on it) and activate it if necessary to
>>   restore a degraded array.
> 
> Certainly.  It doesn't follow that the auto-assembly has to happen
> within the kernel.  Having it all done in user-space makes it much
> easier to control/configure.
> 
> I think the best way to describe my attitude to auto-assembly is that
> it could be needs-driven rather than availability-driven.
> 
> needs-driven means: if the user asks to access an array that doesn't
>   exist, then try to find the bits and assemble it.
> availability driven means: find all the devices that could be part of
>   an array, and combine as many of them as possible together into
>   arrays.
> 
> Currently filesystems are needs-driven.  At boot time, only to root
> filesystem, which has been identified somehow, gets mounted. 
> Then the init scripts mount any others that are needed.
> We don't have any hunting around for filesystem superblocks and
> mounting the filesystems just in case they are needed.

Are filesystems the correct analogy?  Consider that a user's attempt
to mount a filesystem by label requires that all of the "block devices"
that might contain that filesystem be enumerated automatically by
the system.  In this respect, the system is treating an MD device in
exactly the same way as a SCSI or IDE disk.  The array must be exported
to the system on an "availability basis" in order for the "needs-driven"
features of the system to behave as expected.

> Currently partitions are (sufficiently) needs-driven.  It is true that
> any partitionable devices has it's partitions presented.  However the
> existence of partitions does not affect access to the whole device at
> all.  Only once the partitions are claimed is the whole-device
> blocked. 

This seems a slight digression from your earlier argument.  Is your
concern that the arrays are auto-enumerated, or that the act of enumerating
them prevents the component devices from being accessed (due to bd_clam)?

> Providing that auto-assembly of arrays works the same way (needs
> driven), I am happy for arrays to auto-assemble.
> I happen to think this most easily done in user-space.

I don't know how to reconcile a needs based approach with system
features that require arrays to be exported as soon as they are
detected.

> With DDF format metadata, there is a concept of 'imported' arrays,
> which basically means arrays from some other controller that have been
> attached to the current controller.
> 
> Part of my desire for needs-driven assembly is that I don't want to
> inadvertently assemble 'imported' arrays.
> A DDF controller has NVRAM or a hardcoded serial number to help avoid
> this.  A generic Linux machine doesn't.
> 
> I could possibly be happy with auto-assembly where a kernel parameter
> of DDF=xx.yy.zz was taken to mean that we "need" to assemble all DDF
> arrays that have a controler-id (or whatever it is) of xx.yy.zz.
> 
> This is probably simple enough to live entirely in the kernel.

The concept of "importing" an array doesn't really make sense in
the case of MD's DDF.  To fully take advantage of features like
a controller BIOS's ability to natively boot an array, the disks
for that domain must remain in that controller's domain.  Determining
the domain to assign to new arrays will require input from the user
since there is limited topology information available to MD.  The
user will also have the ability to assign newly created  arrays to
the "MD Domain" which is not tied to any particular controller domain.

...

> But back to your original post:  I suspect there is lots of valuable
> stuff in your emd patch, but as you have probably gathered, big
> patches are not the way we work around here, and with good reason.
> 
> If you would like to identify isolated pieces of functionality, create
> patches to implement them, and submit them for review I will be quite
> happy to review them and, when appropriate, forward them to
> Andrew/Linus.
> I suggest you start with less controversial changes and work your way
> forward.

One suggestion that was recently raised was to present these changes
in the form of an alternate "EMD" driver to avoid any potential
breakage of the existing MD.  Do you have any opinion on this?

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-23  6:23   ` Justin T. Gibbs
@ 2004-03-24  2:26     ` Neil Brown
  2004-03-24 19:09       ` Matt Domsch
  2004-03-25  2:21       ` Jeff Garzik
  0 siblings, 2 replies; 38+ messages in thread
From: Neil Brown @ 2004-03-24  2:26 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-raid, linux-kernel

On Monday March 22, gibbs@scsiguy.com wrote:
> >> o Any successful solution will have to have "meta-data modules" for
> >>   active arrays "core resident" in order to be robust.  This
> 
> ...
> 
> > I agree.
> > 'Linear' and 'raid0' arrays don't really need metadata support in the
> > kernel as their metadata is essentially read-only.
> > There are interesting applications for raid1 without metadata, but I
> > think that for all raid personalities where metadata might need to be
> > updated in an error condition to preserve data integrity, the kernel
> > should know enough about the metadata to perform that update.
> > 
> > It would be nice to keep the in-kernel knowledge to a minimum, though
> > some metadata formats probably make that hard.
> 
> Can you further explain why you want to limit the kernel's knowledge
> and where you would separate the roles between kernel and userland?

General caution.
It is generally harder the change mistakes in the kernel than it is to
change mistakes in userspace, and similarly it is easer to add
functionality and configurability in userspace.  A design that puts
the control in userspace is therefore preferred.  A design that ties
you to working through a narrow user-kernel interface is disliked.
A design that gives easy control to user-space, and allows the kernel
to do simple things simply is probably best.

I'm not particularly concerned with code size and code duplication.  A
clean, expressive design is paramount.

> 2) Solution Complexity
> 
> Two entities understand how to read and manipulate the meta-data.
> Policies and APIs must be created to ensure that only one entity
> is performing operations on the meta-data at a time.  This is true
> even if one entity is primarily a read-only "client".  For example,
> a meta-data module may defer meta-data updates in some instances
> (e.g. rebuild checkpointing) until the meta-data is closed (writing
> the checkpoint sooner doesn't make sense considering that you should
> restart your scrub, rebuild or verify if the system is not safely
> shutdown).  How does the userland client get the most up-to-date
> information?  This is just one of the problems in this area.

If the kernel and userspace both need to know about metadata, then the
design must make clear how they communicate.  

> 
> > Currently partitions are (sufficiently) needs-driven.  It is true that
> > any partitionable devices has it's partitions presented.  However the
> > existence of partitions does not affect access to the whole device at
> > all.  Only once the partitions are claimed is the whole-device
> > blocked. 
> 
> This seems a slight digression from your earlier argument.  Is your
> concern that the arrays are auto-enumerated, or that the act of enumerating
> them prevents the component devices from being accessed (due to
> bd_clam)?

Primarily the latter.  But also that the act of enumerating them may
cause an update to an underlying devices (e.g. metadata update or
resync).  That is what I am particularly uncomfortable about.

> 
> > Providing that auto-assembly of arrays works the same way (needs
> > driven), I am happy for arrays to auto-assemble.
> > I happen to think this most easily done in user-space.
> 
> I don't know how to reconcile a needs based approach with system
> features that require arrays to be exported as soon as they are
> detected.
> 

Maybe if arrays were auto-assembled in a read-only mode that guaranteed
not to write to the devices *at*all* and did not bd_claim them.

When they are needed (either though some explicit set-writable command
or through an implicit first-write) then the underlying components are
bd_claimed.  If that succeeds, the array becomes "live".  If it fails,
it stays read-only.

> 
> > But back to your original post:  I suspect there is lots of valuable
> > stuff in your emd patch, but as you have probably gathered, big
> > patches are not the way we work around here, and with good reason.
> > 
> > If you would like to identify isolated pieces of functionality, create
> > patches to implement them, and submit them for review I will be quite
> > happy to review them and, when appropriate, forward them to
> > Andrew/Linus.
> > I suggest you start with less controversial changes and work your way
> > forward.
> 
> One suggestion that was recently raised was to present these changes
> in the form of an alternate "EMD" driver to avoid any potential
> breakage of the existing MD.  Do you have any opinion on this?

Choice is good.  Competition is good.  I would not try to interfere
with you creating a new "emd" driver that didn't interfere with "md". 
What Linus would think of it I really don't know.  It is certainly not
impossible that he would accept it.

However I'm not sure that having three separate device-array systems
(dm, md, emd) is actually a good idea.  It would probably be really
good to unite md and dm somehow, but no-one seems really keen on
actually doing the work.

I seriously think the best long-term approach for your emd work is to
get it integrated into md.  I do listen to reason and I am not
completely head-strong, but I do have opinions, and you would need to
put in the effort to convincing me.

NeilBrown

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-24  2:26     ` Neil Brown
@ 2004-03-24 19:09       ` Matt Domsch
  2004-03-25  2:21       ` Jeff Garzik
  1 sibling, 0 replies; 38+ messages in thread
From: Matt Domsch @ 2004-03-24 19:09 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel

On Wed, Mar 24, 2004 at 01:26:47PM +1100, Neil Brown wrote:
> On Monday March 22, gibbs@scsiguy.com wrote:
> > One suggestion that was recently raised was to present these changes
> > in the form of an alternate "EMD" driver to avoid any potential
> > breakage of the existing MD.  Do you have any opinion on this?
> 
> I seriously think the best long-term approach for your emd work is to
> get it integrated into md.  I do listen to reason and I am not
> completely head-strong, but I do have opinions, and you would need to
> put in the effort to convincing me.

I completely agree that long-term, md and emd need to be the same.
However, watching the pain that the IDE changes took in early 2.5, I'd
like to see emd be merged alongside md for the short-term while the
kinks get worked out, keeping in mind the desire to merge them
together again soon as that happens. 

Thanks,
Matt

-- 
Matt Domsch
Sr. Software Engineer, Lead Engineer
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-24  2:26     ` Neil Brown
  2004-03-24 19:09       ` Matt Domsch
@ 2004-03-25  2:21       ` Jeff Garzik
  2004-03-25 18:00         ` Kevin Corry
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-25  2:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: Justin T. Gibbs, linux-raid, linux-kernel

Neil Brown wrote:
> Choice is good.  Competition is good.  I would not try to interfere
> with you creating a new "emd" driver that didn't interfere with "md". 
> What Linus would think of it I really don't know.  It is certainly not
> impossible that he would accept it.

Agreed.

Independent DM efforts have already started supporting MD raid0/1 
metadata from what I understand, though these efforts don't seem to post 
to linux-kernel or linux-raid much at all.  :/

> However I'm not sure that having three separate device-array systems
> (dm, md, emd) is actually a good idea.  It would probably be really
> good to unite md and dm somehow, but no-one seems really keen on
> actually doing the work.

I would be disappointed if all the work that has gone into the MD driver 
is simply obsoleted by new DM targets.  Particularly RAID 1/5/6.

You pretty much echoed my sentiments exactly...  ideally md and dm can 
be bound much more tightly to each other.  For example, convert md's 
raid[0156].c into device mapper targets...  but indeed, nobody has 
stepped up to do that so far.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25  2:21       ` Jeff Garzik
@ 2004-03-25 18:00         ` Kevin Corry
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 22:59           ` Justin T. Gibbs
  0 siblings, 2 replies; 38+ messages in thread
From: Kevin Corry @ 2004-03-25 18:00 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid

On Wednesday 24 March 2004 8:21 pm, Jeff Garzik wrote:
> Neil Brown wrote:
> > Choice is good.  Competition is good.  I would not try to interfere
> > with you creating a new "emd" driver that didn't interfere with "md".
> > What Linus would think of it I really don't know.  It is certainly not
> > impossible that he would accept it.
>
> Agreed.
>
> Independent DM efforts have already started supporting MD raid0/1
> metadata from what I understand, though these efforts don't seem to post
> to linux-kernel or linux-raid much at all.  :/

I post on lkml.....occasionally. :)

I'm guessing you're referring to EVMS in that comment, since we have done 
*part* of what you just described. EVMS has always had a plugin to recognize 
MD devices, and has been using the MD driver for quite some time (along with 
using Device-Mapper for non-MD stuff). However, as of our most recent release 
(earlier this month), we switched to using Device-Mapper for MD RAID-linear 
and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" 
module (both required to support LVM volumes), and it was a rather trivial 
exercise to switch to activating these RAID devices using DM instead of MD.

This decision was not based on any real dislike of the MD driver, but rather 
for the benefits that are gained by using Device-Mapper. In particular, 
Device-Mapper provides the ability to change out the device mapping on the 
fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
I'm sure many of you know this already. But I'm not sure everyone fully 
understands how powerful a feature this is. For instance, it means EVMS can 
now expand RAID-linear devices online. While that particular example may not 
sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to 
Device-Mapper, this feature would then allow you to do stuff like add new 
"active" members to a RAID-1 online (think changing from 2-way mirror to 
3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online 
simply by adding a new disk (assuming other limitations, e.g. a single 
stripe-zone). Unfortunately, these are things the MD driver can't do online, 
because you need to completely stop the MD device before making such changes 
(to prevent the kernel and user-space from trampling on the same metadata), 
and MD won't stop the device if it's open (i.e. if it's mounted or if you 
have other device (LVM) built on top of MD). Often times this means you need 
to boot to a rescue-CD to make these types of configuration changes.

As for not posting this information on lkml and/or linux-raid, I do apologize 
if this is something you would like to have been informed of. Most of the 
recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken 
that to mean the folks on the list aren't terribly interested in EVMS 
developments. And since EVMS is a completely user-space tool and this 
decision didn't affect any kernel components, I didn't think it was really 
relevent to mention here. We usually discuss such things on 
evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to 
cross-post to lkml more often if it's something that might be pertinent.

> > However I'm not sure that having three separate device-array systems
> > (dm, md, emd) is actually a good idea.  It would probably be really
> > good to unite md and dm somehow, but no-one seems really keen on
> > actually doing the work.
>
> I would be disappointed if all the work that has gone into the MD driver
> is simply obsoleted by new DM targets.  Particularly RAID 1/5/6.
>
> You pretty much echoed my sentiments exactly...  ideally md and dm can
> be bound much more tightly to each other.  For example, convert md's
> raid[0156].c into device mapper targets...  but indeed, nobody has
> stepped up to do that so far.

We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some 
point in the future, primarily for some of the reasons I mentioned above. 
Obviously linear.c and raid0.c don't really need to be ported. DM provides 
equivalent functionality, the discovery/activation can be driven from 
user-space, and no in-kernel status updating is necessary (unlike RAID-1 and 
-5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 
(and now RAID-6) to Device-Mapper targets, but we haven't started on any such 
work, or even had any significant discussions about *how* to do it. I can't 
imagine we would try this without at least involving Neil and other folks 
from linux-raid, since it would be nice to actually reuse as much of the 
existing MD code as possible (especially for RAID-5 and -6). I have no desire 
to try to rewrite those from scratch.

Device-Mapper does currently contain a mirroring module (still just in Joe's 
-udm tree), which has primarily been used to provide online-move 
functionality in LVM2 and EVMS. They've recently added support for persistent 
logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 
has some additional requirements for updating status in its superblock at 
runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring 
module could still be used, with the possibility of either adding 
MD-RAID-1-specific information to the persistent-log module, or simply as an 
additional log type.

So, if this is the direction everyone else would like to see MD and DM take, 
we'd be happy to help out.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:00         ` Kevin Corry
@ 2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
                               ` (3 more replies)
  2004-03-25 22:59           ` Justin T. Gibbs
  1 sibling, 4 replies; 38+ messages in thread
From: Jeff Garzik @ 2004-03-25 18:42 UTC (permalink / raw)
  To: Kevin Corry; +Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid

Kevin Corry wrote:
> I'm guessing you're referring to EVMS in that comment, since we have done 
> *part* of what you just described. EVMS has always had a plugin to recognize 
> MD devices, and has been using the MD driver for quite some time (along with 
> using Device-Mapper for non-MD stuff). However, as of our most recent release 
> (earlier this month), we switched to using Device-Mapper for MD RAID-linear 
> and RAID-0 devices. Device-Mapper has always had a "linear" and a "striped" 
> module (both required to support LVM volumes), and it was a rather trivial 
> exercise to switch to activating these RAID devices using DM instead of MD.

nod

> This decision was not based on any real dislike of the MD driver, but rather 
> for the benefits that are gained by using Device-Mapper. In particular, 
> Device-Mapper provides the ability to change out the device mapping on the 
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
> I'm sure many of you know this already. But I'm not sure everyone fully 
> understands how powerful a feature this is. For instance, it means EVMS can 
> now expand RAID-linear devices online. While that particular example may not 
[...]

Sounds interesting but is mainly an implementation detail for the 
purposes of this discussion...

Some of this emd may want to use, for example.

> As for not posting this information on lkml and/or linux-raid, I do apologize 
> if this is something you would like to have been informed of. Most of the 
> recent mentions of EVMS on this list seem to fall on deaf ears, so I've taken 
> that to mean the folks on the list aren't terribly interested in EVMS 
> developments. And since EVMS is a completely user-space tool and this 
> decision didn't affect any kernel components, I didn't think it was really 
> relevent to mention here. We usually discuss such things on 
> evms-devel@lists.sf.net or dm-devel@redhat.com, but I'll be happy to 
> cross-post to lkml more often if it's something that might be pertinent.

Understandable...  for the stuff that impacts MD some mention of the 
work, on occasion, to linux-raid and/or linux-kernel would be useful.

I'm mainly looking at it from a standpoint of making sure that all the 
various RAID efforts are not independent of each other.

> We're obviously pretty keen on seeing MD and Device-Mapper "merge" at some 
> point in the future, primarily for some of the reasons I mentioned above. 
> Obviously linear.c and raid0.c don't really need to be ported. DM provides 
> equivalent functionality, the discovery/activation can be driven from 
> user-space, and no in-kernel status updating is necessary (unlike RAID-1 and 
> -5). And we've talked for a long time about wanting to port RAID-1 and RAID-5 
> (and now RAID-6) to Device-Mapper targets, but we haven't started on any such 
> work, or even had any significant discussions about *how* to do it. I can't 

let's have that discussion :)

> imagine we would try this without at least involving Neil and other folks 
> from linux-raid, since it would be nice to actually reuse as much of the 
> existing MD code as possible (especially for RAID-5 and -6). I have no desire 
> to try to rewrite those from scratch.

<cheers>

> Device-Mapper does currently contain a mirroring module (still just in Joe's 
> -udm tree), which has primarily been used to provide online-move 
> functionality in LVM2 and EVMS. They've recently added support for persistent 
> logs, so it's possible for a mirror to survive a reboot. Of course, MD RAID-1 
> has some additional requirements for updating status in its superblock at 
> runtime. I'd hope that in porting RAID-1 to DM, the core of the DM mirroring 
> module could still be used, with the possibility of either adding 
> MD-RAID-1-specific information to the persistent-log module, or simply as an 
> additional log type.

WRT specific implementation, I would hope for the reverse -- that the 
existing, known, well-tested MD raid1 code would be used.  But perhaps 
that's a naive impression...  Folks with more knowledge of the 
implementation can make that call better than I.

I'd like to focus on the "additional requirements" you mention, as I 
think that is a key area for consideration.

There is a certain amount of metadata that -must- be updated at runtime, 
as you recognize.  Over and above what MD already cares about, DDF and 
its cousins introduce more items along those lines:  event logs, bad 
sector logs, controller-level metadata...  these are some of the areas I 
think Justin/Scott are concerned about.

My take on things...  the configuration of RAID arrays got a lot more 
complex with DDF and "host RAID" in general.  Association of RAID arrays 
based on specific hardware controllers.  Silently building RAID0+1 
stacked arrays out of non-RAID block devices the kernel presents. 
Failing over when one of the drives the kernel presents does not respond.

All that just screams "do it in userland".

OTOH, once the devices are up and running, kernel needs update some of 
that configuration itself.  Hot spare lists are an easy example, but any 
time the state of the overall RAID array changes, some host RAID 
formats, more closely tied to hardware than MD, may require 
configuration metadata changes when some hardware condition(s) change.

I respectfully disagree with the EMD folks that a userland approach is 
impossible, given all the failure scenarios.  In a userland approach, 
there -will- be some duplicated metadata-management code between 
userland and the kernel.  But for configuration _above_ the 
single-raid-array level, I think that's best left to userspace.

There will certainly be a bit of intra-raid-array management code in the 
kernel, including configuration updating.  I agree to its necessity... 
but that doesn't mean that -all- configuration/autorun stuff needs to be 
in the kernel.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
@ 2004-03-25 18:48             ` Jeff Garzik
  2004-03-25 23:46               ` Justin T. Gibbs
  2004-03-25 22:04             ` Lars Marowsky-Bree
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-25 18:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Kevin Corry, Neil Brown, Justin T. Gibbs, linux-raid

Jeff Garzik wrote:
> My take on things...  the configuration of RAID arrays got a lot more 
> complex with DDF and "host RAID" in general.  Association of RAID arrays 
> based on specific hardware controllers.  Silently building RAID0+1 
> stacked arrays out of non-RAID block devices the kernel presents. 
> Failing over when one of the drives the kernel presents does not respond.
> 
> All that just screams "do it in userland".

Just so there is no confusion...  the "failing over...in userland" thing 
I mention is _only_ during discovery of the root disk.

Similar code would need to go into the bootloader, for controllers that 
do not present the entire RAID array as a faked BIOS INT drive.

	Jeff




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:48             ` Jeff Garzik
@ 2004-03-25 23:46               ` Justin T. Gibbs
  2004-03-26  0:01                 ` Jeff Garzik
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 23:46 UTC (permalink / raw)
  To: Jeff Garzik, linux-kernel; +Cc: Kevin Corry, Neil Brown, linux-raid

> Jeff Garzik wrote:
> 
> Just so there is no confusion...  the "failing over...in userland" thing I
> mention is _only_ during discovery of the root disk.

None of the solutions being talked about perform "failing over" in
userland.  The RAID transforms which perform this operation are kernel
resident in DM, MD, and EMD.  Perhaps you are talking about spare
activation and rebuild?

> Similar code would need to go into the bootloader, for controllers that do
> not present the entire RAID array as a faked BIOS INT drive.

None of the solutions presented here are attempting to make RAID
transforms operate from the boot loader environment without BIOS
support.  I see this as a completely tangental problem to what is
being discussed.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:46               ` Justin T. Gibbs
@ 2004-03-26  0:01                 ` Jeff Garzik
  2004-03-26  0:10                   ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:01 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>Jeff Garzik wrote:
>>
>>Just so there is no confusion...  the "failing over...in userland" thing I
>>mention is _only_ during discovery of the root disk.
> 
> 
> None of the solutions being talked about perform "failing over" in
> userland.  The RAID transforms which perform this operation are kernel
> resident in DM, MD, and EMD.  Perhaps you are talking about spare
> activation and rebuild?

This is precisely why I sent the second email, and made the 
qualification I did :)

For a "do it in userland" solution, an initrd or initramfs piece 
examines the system configuration, and assembles physical disks into 
RAID arrays based on the information it finds.  I was mainly implying 
that an initrd solution would have to provide some primitive failover 
initially, before the kernel is bootstrapped...  much like a bootloader 
that supports booting off a RAID1 array would need to do.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:01                 ` Jeff Garzik
@ 2004-03-26  0:10                   ` Justin T. Gibbs
  2004-03-26  0:14                     ` Jeff Garzik
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-26  0:10 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

>> None of the solutions being talked about perform "failing over" in
>> userland.  The RAID transforms which perform this operation are kernel
>> resident in DM, MD, and EMD.  Perhaps you are talking about spare
>> activation and rebuild?
> 
> This is precisely why I sent the second email, and made the qualification
> I did :)
> 
> For a "do it in userland" solution, an initrd or initramfs piece examines
> the system configuration, and assembles physical disks into RAID arrays
> based on the information it finds.  I was mainly implying that an initrd
> solution would have to provide some primitive failover initially, before
> the kernel is bootstrapped...  much like a bootloader that supports booting
> off a RAID1 array would need to do.

"Failover" (i.e. redirecting a read to a viable member) will not occur
via userland at all.  The initrd solution just has to present all available
members to the kernel interface performing the RAID transform.  There
is no need for "special failover handling" during bootstrap in either
case.

--
Justin


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:10                   ` Justin T. Gibbs
@ 2004-03-26  0:14                     ` Jeff Garzik
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:14 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: linux-kernel, Kevin Corry, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>>None of the solutions being talked about perform "failing over" in
>>>userland.  The RAID transforms which perform this operation are kernel
>>>resident in DM, MD, and EMD.  Perhaps you are talking about spare
>>>activation and rebuild?
>>
>>This is precisely why I sent the second email, and made the qualification
>>I did :)
>>
>>For a "do it in userland" solution, an initrd or initramfs piece examines
>>the system configuration, and assembles physical disks into RAID arrays
>>based on the information it finds.  I was mainly implying that an initrd
>>solution would have to provide some primitive failover initially, before
>>the kernel is bootstrapped...  much like a bootloader that supports booting
>>off a RAID1 array would need to do.
> 
> 
> "Failover" (i.e. redirecting a read to a viable member) will not occur
> via userland at all.  The initrd solution just has to present all available
> members to the kernel interface performing the RAID transform.  There
> is no need for "special failover handling" during bootstrap in either
> case.

hmmm, yeah, agreed.

	Jeff





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
@ 2004-03-25 22:04             ` Lars Marowsky-Bree
  2004-03-26 19:19               ` Kevin Corry
  2004-03-25 23:35             ` Justin T. Gibbs
  2004-03-26 19:15             ` Kevin Corry
  3 siblings, 1 reply; 38+ messages in thread
From: Lars Marowsky-Bree @ 2004-03-25 22:04 UTC (permalink / raw)
  To: Jeff Garzik, Kevin Corry
  Cc: linux-kernel, Neil Brown, Justin T. Gibbs, linux-raid

On 2004-03-25T13:42:12,
   Jeff Garzik <jgarzik@pobox.com> said:

> >and -5). And we've talked for a long time about wanting to port RAID-1 and 
> >RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't started 
> >on any such work, or even had any significant discussions about *how* to 
> >do it. I can't 
> let's have that discussion :)

Nice 2.7 material, and parts I've always wanted to work on. (Including
making the entire partition scanning user-space on top of DM too.)

KS material?

> My take on things...  the configuration of RAID arrays got a lot more 
> complex with DDF and "host RAID" in general.

And then add all the other stuff, like scenarios where half of your RAID
is "somewhere" on the network via nbd, iSCSI or whatever and all the
other possible stackings... Definetely user-space material, and partly
because it /needs/ to have the input from the volume managers to do the
sane things.

The point about this implying that the superblock parsing/updating logic
needs to be duplicated between userspace and kernel land is valid too
though, and I'm keen on resolving this in a way which doesn't suck...

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	      \ ever tried. ever failed. no matter.
SUSE Labs			      | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ 	-- Samuel Beckett

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 22:04             ` Lars Marowsky-Bree
@ 2004-03-26 19:19               ` Kevin Corry
  2004-03-31 17:07                 ` Randy.Dunlap
  0 siblings, 1 reply; 38+ messages in thread
From: Kevin Corry @ 2004-03-26 19:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: Lars Marowsky-Bree, Jeff Garzik, Neil Brown, Justin T. Gibbs,
	linux-raid

On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
> On 2004-03-25T13:42:12,
>
>    Jeff Garzik <jgarzik@pobox.com> said:
> > >and -5). And we've talked for a long time about wanting to port RAID-1
> > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
> > > started on any such work, or even had any significant discussions about
> > > *how* to do it. I can't
> >
> > let's have that discussion :)
>
> Nice 2.7 material, and parts I've always wanted to work on. (Including
> making the entire partition scanning user-space on top of DM too.)

Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think 
we've already proved this is possible. We really only need to work on making 
early-userspace a little easier to use.

> KS material?

Sounds good to me.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 19:19               ` Kevin Corry
@ 2004-03-31 17:07                 ` Randy.Dunlap
  0 siblings, 0 replies; 38+ messages in thread
From: Randy.Dunlap @ 2004-03-31 17:07 UTC (permalink / raw)
  To: Kevin Corry; +Cc: linux-kernel, lmb, jgarzik, neilb, gibbs, linux-raid

On Fri, 26 Mar 2004 13:19:28 -0600 Kevin Corry wrote:

| On Thursday 25 March 2004 4:04 pm, Lars Marowsky-Bree wrote:
| > On 2004-03-25T13:42:12,
| >
| >    Jeff Garzik <jgarzik@pobox.com> said:
| > > >and -5). And we've talked for a long time about wanting to port RAID-1
| > > > and RAID-5 (and now RAID-6) to Device-Mapper targets, but we haven't
| > > > started on any such work, or even had any significant discussions about
| > > > *how* to do it. I can't
| > >
| > > let's have that discussion :)
| >
| > Nice 2.7 material, and parts I've always wanted to work on. (Including
| > making the entire partition scanning user-space on top of DM too.)
| 
| Couldn't agree more. Whether using EVMS or kpartx or some other tool, I think 
| we've already proved this is possible. We really only need to work on making 
| early-userspace a little easier to use.
| 
| > KS material?
| 
| Sounds good to me.

Ditto.

I didn't see much conclusion to this thread, other than Neil's
good suggestions.  (maybe on some other list that I don't read?)

I wouldn't want this or any other projects to have to wait for the
kernel summit.  Email has worked well for many years...let's
try to keep it working.  :)

--
~Randy
"You can't do anything without having to do something else first."
-- Belefant's Law

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
  2004-03-25 18:48             ` Jeff Garzik
  2004-03-25 22:04             ` Lars Marowsky-Bree
@ 2004-03-25 23:35             ` Justin T. Gibbs
  2004-03-26  0:13               ` Jeff Garzik
  2004-03-26 19:15             ` Kevin Corry
  3 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 23:35 UTC (permalink / raw)
  To: Jeff Garzik, Kevin Corry; +Cc: linux-kernel, Neil Brown, linux-raid

> I respectfully disagree with the EMD folks that a userland approach is
> impossible, given all the failure scenarios.

I've never said that it was impossible, just unwise.  I believe
that a userland approach offers no benefit over allowing the kernel
to perform all meta-data operations.  The end result of such an
approach (given feature and robustness parity with the EMD solution)
is a larger resident side, code duplication, and more complicated
configuration/management interfaces.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:35             ` Justin T. Gibbs
@ 2004-03-26  0:13               ` Jeff Garzik
  2004-03-26 17:43                 ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-26  0:13 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>>I respectfully disagree with the EMD folks that a userland approach is
>>impossible, given all the failure scenarios.
> 
> 
> I've never said that it was impossible, just unwise.  I believe
> that a userland approach offers no benefit over allowing the kernel
> to perform all meta-data operations.  The end result of such an
> approach (given feature and robustness parity with the EMD solution)
> is a larger resident side, code duplication, and more complicated
> configuration/management interfaces.

There is some code duplication, yes.  But the right userspace solution 
does not have a larger RSS, and has _less_ complicated management 
interfaces.  A key benefit of "do it in userland" is a clear gain in 
flexibility, simplicity, and debuggability (if that's a word).

But it's hard.  It requires some deep thinking.  It's a whole lot easier 
to do everything in the kernel -- but that doesn't offer you the 
protections of userland, particularly separate address spaces from the 
kernel, and having to try harder to crash the kernel.  :)

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26  0:13               ` Jeff Garzik
@ 2004-03-26 17:43                 ` Justin T. Gibbs
  2004-03-28  0:06                   ` Lincoln Dale
  2004-03-28  0:30                   ` Jeff Garzik
  0 siblings, 2 replies; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-26 17:43 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

>>> I respectfully disagree with the EMD folks that a userland approach is
>>> impossible, given all the failure scenarios.
>> 
>> 
>> I've never said that it was impossible, just unwise.  I believe
>> that a userland approach offers no benefit over allowing the kernel
>> to perform all meta-data operations.  The end result of such an
>> approach (given feature and robustness parity with the EMD solution)
>> is a larger resident side, code duplication, and more complicated
>> configuration/management interfaces.
> 
> There is some code duplication, yes.  But the right userspace solution
> does not have a larger RSS, and has _less_ complicated management
> interfaces.
>
> A key benefit of "do it in userland" is a clear gain in flexibility,
> simplicity, and debuggability (if that's a word).

This is just as much hand waving as, 'All that just screams "do it in
userland".' <sigh>

I posted a rather detailed, technical, analysis of what I believe would
be required to make this work correctly using a userland approach.  The
only response I've received is from Neil Brown.  Please, point out, in
a technical fashion, how you would address the feature set being proposed:

 o Rebuilds
 o Auto-array enumeration
 o Meta-data updates for topology changes (failed members, spare activation)
 o Meta-data updates for "safe mode"
 o Array creation/deletion
 o "Hot member addition"

Only then can a true comparative analysis of which solution is "less
complex", "more maintainable", and "smaller" be performed.

> But it's hard.  It requires some deep thinking.  It's a whole lot easier
> to do everything in the kernel -- but that doesn't offer you the
> protections of userland, particularly separate address spaces from the
> kernel, and having to try harder to crash the kernel.  :)

A crash in any component of a RAID solution that prevents automatic
failover and rebuilds without customer intervention is unacceptable.
Whether it crashes your kernel or not is really not that important other
than the customer will probably notice that their data is no longer
protected *sooner* if the system crashes.  In other-words, the solution
must be *correct* regardless of where it resides.  Saying that doing
a portion of it in userland allows it to safely be buggier seems a very
strange argument.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 17:43                 ` Justin T. Gibbs
@ 2004-03-28  0:06                   ` Lincoln Dale
  2004-03-30 17:54                     ` Justin T. Gibbs
  2004-03-28  0:30                   ` Jeff Garzik
  1 sibling, 1 reply; 38+ messages in thread
From: Lincoln Dale @ 2004-03-28  0:06 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid

At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>I posted a rather detailed, technical, analysis of what I believe would
>be required to make this work correctly using a userland approach.  The
>only response I've received is from Neil Brown.  Please, point out, in
>a technical fashion, how you would address the feature set being proposed:

i'll have a go.

your position is one of "put it all in the kernel".
Jeff, Neil, Kevin et al is one of "it can live in userspace".

to that end, i agree with the userspace approach.
the way i personally believe that it SHOULD happen is that you tie your 
metadata format (and RAID format, if its different to others) into DM.

you boot up using an initrd where you can start some form of userspace 
management daemon from initrd.
you can have your binary (userspace) tools started from initrd which can 
populate the tables for all disks/filesystems, including pivoting to a new 
root filesystem if need-be.

the only thing your BIOS/int13h redirection needs to do is be able to 
provide sufficient information to be capable of loading the kernel and the 
initial ramdisk.
perhaps that means that you guys could provide enhancements to grub/lilo if 
they are insufficient for things like finding a secondary copy of 
initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things 
the "open source way" and help improve the overall tools, if the end goal 
ends up being the same: enabling YOUR system to work better?)

moving forward, perhaps initrd will be deprecated in favour of initramfs - 
but until then, there isn't any downside to this approach that i can see.

with all this in mind, and the basic premise being that as a minimum, the 
kernel has booted, and initrd is working
then answering your other points:

>  o Rebuilds

userspace is running.
rebuilds are simply a process of your userspace tools recognising that 
there are disk groups in a inconsistent state, and don't bring them online, 
but rather, do whatever is necessary to rebuild them.
nothing says that you cannot have a KERNEL-space 'helper' to help do the 
rebuild..

>  o Auto-array enumeration

your userspace tool can receive notification (via udev/hotplug) when new 
disks/devices appear.  from there, your userspace tool can read whatever 
metadata exists on the disk, and use that to enumerate whatever block 
devices exist.

perhaps DM needs some hooks to be able to do this - but i believe that the 
DM v4 ioctls cover this already.

>  o Meta-data updates for topology changes (failed members, spare activation)

a failed member may be as a result of a disk being pulled out.  for such an 
event, udev/hotplug should tell your userspace daemon.
a failed member may be as a result of lots of I/O errors.  perhaps there is 
work needed in the linux block layer to indicate some form of hotplug event 
such as 'excessive errors', perhaps its something needed in the DM 
layer.  in either case, it isn't out of the question that userspace can be 
notified.

for a "spare activation", once again, that can be done entirely from userspace.

>  o Meta-data updates for "safe mode"

seems implementation specific to me.

>  o Array creation/deletion

the short answer here is "how does one create or remove DM/LVM/MD 
partitions today?"
it certainly isn't in the kernel ...

>  o "Hot member addition"

this should also be possible today.
i haven't looked too closely at whether there are sufficient interfaces for 
quiescence of I/O or not - but once again, if not, why not implement 
something that can be used for all?

>Only then can a true comparative analysis of which solution is "less
>complex", "more maintainable", and "smaller" be performed.

there may be less lines of code involved in "entirely in kernel" for YOUR 
hardware --
but what about when 4 other storage vendors come out with such a card?
what if someone wants to use your card in conjunction with the storage 
being multipathed or replicated automatically?
what about when someone wants to create snapshots for backups?

all that functionality has to then go into your EMD driver.

Adaptec may decide all that is too hard -- at which point, your product may 
become obsolete as the storage paradigms have moved beyond what your EMD 
driver is capable of.
if you could tie it into DM -- which i believe to be the defacto path 
forward for lots of this cool functionality -- you gain this kind of 
functionality gratis -- or at least with minimal effort to integrate.

better yet, Linux as a whole benefits from your involvement -- your 
time/effort isn't put into something specific to your hardware -- but 
rather your time/effort is put into something that can be used by all.

this conversation really sounds like the same one you had with James about 
the SCSI Mid layer and why you just have to bypass items there and do your 
own proprietary things.  in summary, i don't believe you should be 
focussing on a short-term viiew of "but its more lines of code", but rather 
a more big-picture view of "overall, there will be LESS lines of code" and 
"it will fit better into the overall device-mapper/block-remapper 
functionality" within the kernel.

cheers,

lincoln.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-28  0:06                   ` Lincoln Dale
@ 2004-03-30 17:54                     ` Justin T. Gibbs
  0 siblings, 0 replies; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:54 UTC (permalink / raw)
  To: Lincoln Dale
  Cc: Jeff Garzik, Kevin Corry, linux-kernel, Neil Brown, linux-raid

> At 03:43 AM 27/03/2004, Justin T. Gibbs wrote:
>> I posted a rather detailed, technical, analysis of what I believe would
>> be required to make this work correctly using a userland approach.  The
>> only response I've received is from Neil Brown.  Please, point out, in
>> a technical fashion, how you would address the feature set being proposed:
> 
> i'll have a go.
> 
> your position is one of "put it all in the kernel".
> Jeff, Neil, Kevin et al is one of "it can live in userspace".

Please don't misrepresent or over simplify my statements.  What
I have said is that meta-data reading and writing should occur in
only one place.  Since, as has already been acknowledged by many,
meta-data updates are required in the kernel, that means this support
should be handled in the kernel.  Any other solution adds complexity
and size to the solution.

> to that end, i agree with the userspace approach.
> the way i personally believe that it SHOULD happen is that you tie
> your metadata format (and RAID format, if its different to others) into DM.

Saying how you think something should happen without any technical
argument for it, doesn't help me to understand the benefits of your
approach.

...

> perhaps that means that you guys could provide enhancements to grub/lilo
> if they are insufficient for things like finding a secondary copy of
> initrd/vmlinuz. (if such issues exist, wouldn't it be better to do things
> the "open source way" and help improve the overall tools, if the end goal
> ends up being the same: enabling YOUR system to work better?)

I don't understand your argument.  We have improved an already existing
opensource driver to provide this functionality.  This is not the
OpenSource way?

> then answering your other points:

Again, you have presented strategies that may or may not work, but
no technical arguments for their superiority over placing meta-data
in the kernel.

> there may be less lines of code involved in "entirely in kernel" for YOUR
> hardware -- but what about when 4 other storage vendors come out with such
> a card?

There will be less lines of code total for any vendor that decides to
add a new meta-data type.  All the vendor has to do is provide a meta-data
module.  There are no changes to the userland utilities (they know nothing
about specific meta-data formats), to the RAID transform modules, or to
the core of EMD.  If this were not the case, there would be little point
to the EMD work.

> what if someone wants to use your card in conjunction with the storage
> being multipathed or replicated automatically?
> what about when someone wants to create snapshots for backups?
> 
> all that functionality has to then go into your EMD driver.

No.  DM already works on any block device exported to the kernel.
EMD exports its devices as block devices.  Thus, all of the DM
functionality you are talking about is also available for EMD.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 17:43                 ` Justin T. Gibbs
  2004-03-28  0:06                   ` Lincoln Dale
@ 2004-03-28  0:30                   ` Jeff Garzik
  1 sibling, 0 replies; 38+ messages in thread
From: Jeff Garzik @ 2004-03-28  0:30 UTC (permalink / raw)
  To: Justin T. Gibbs; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid

Justin T. Gibbs wrote:
>  o Rebuilds

	> 90% kernel, AFAICS, otherwise you have races with
	requests that the driver is actively satisfying


>  o Auto-array enumeration

	userspace


>  o Meta-data updates for "safe mode"

	unsure of the definition of safe mode


>  o Array creation/deletion


	of entire arrays?  can mostly be done in userspace, but deletion
	also needs to update controller-wide metadata, which might be
	stored on active arrays.


>  o "Hot member addition"

	userspace prepares, kernel completes

[moved this down in your list]
>  o Meta-data updates for topology changes (failed members, spare activation)

[warning: this is a tangent from the userspace sub-thread/topic]

	the kernel, of course, must manage topology, otherwise things
	Don't Get Done, and requests don't do where they should.  :)

	Part of the value of device mapper is that it provides container
	objects for multi-disk groups, and a common method of messing
	around with those container objects.  You clearly recognized the
	same need in emd... but I don't think we want two different
	pieces of code doing the same basic thing.


	I do think that metadata management needs to be fairly cleanly
	separately (I like what emd did, there) such that a user needs
	three in-kernel pieces:
	* device mapper
	* generic raid1 engine
	* personality module

	"personality" would be where the specifics of the metadata
	management lived, and it would be responsible for handling the
	specifics of non-hot-path events that nonetheless still need
	to be in the kernel.




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:42           ` Jeff Garzik
                               ` (2 preceding siblings ...)
  2004-03-25 23:35             ` Justin T. Gibbs
@ 2004-03-26 19:15             ` Kevin Corry
  2004-03-26 20:45               ` Justin T. Gibbs
  3 siblings, 1 reply; 38+ messages in thread
From: Kevin Corry @ 2004-03-26 19:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jeff Garzik, Neil Brown, Justin T. Gibbs, linux-raid

On Thursday 25 March 2004 12:42 pm, Jeff Garzik wrote:
> > We're obviously pretty keen on seeing MD and Device-Mapper "merge" at
> > some point in the future, primarily for some of the reasons I mentioned
> > above. Obviously linear.c and raid0.c don't really need to be ported. DM
> > provides equivalent functionality, the discovery/activation can be driven
> > from user-space, and no in-kernel status updating is necessary (unlike
> > RAID-1 and -5). And we've talked for a long time about wanting to port
> > RAID-1 and RAID-5 (and now RAID-6) to Device-Mapper targets, but we
> > haven't started on any such work, or even had any significant discussions
> > about *how* to do it. I can't
>
> let's have that discussion :)

Great! Where do we begin? :)

> I'd like to focus on the "additional requirements" you mention, as I
> think that is a key area for consideration.
>
> There is a certain amount of metadata that -must- be updated at runtime,
> as you recognize.  Over and above what MD already cares about, DDF and
> its cousins introduce more items along those lines:  event logs, bad
> sector logs, controller-level metadata...  these are some of the areas I
> think Justin/Scott are concerned about.

I'm sure these things could be accomodated within DM. Nothing in DM prevents 
having some sort of in-kernel metadata knowledge. In fact, other DM modules 
already do - dm-snapshot and the above mentioned dm-mirror both need to do 
some amount of in-kernel status updating. But I see this as completely 
separate from in-kernel device discovery (which we seem to agree is the wrong 
direction). And IMO, well designed metadata will make this "split" very 
obvious, so it's clear which parts of the metadata the kernel can use for 
status, and which parts are purely for identification (which the kernel thus 
ought to be able to ignore).

The main point I'm trying to get across here is that DM provides a simple yet 
extensible kernel framework for a variety of storage management tasks, 
including a lot more than just RAID. I think it would be a huge benefit for 
the RAID drivers to make use of this framework to provide functionality 
beyond what is currently available.

> My take on things...  the configuration of RAID arrays got a lot more
> complex with DDF and "host RAID" in general.  Association of RAID arrays
> based on specific hardware controllers.  Silently building RAID0+1
> stacked arrays out of non-RAID block devices the kernel presents.

By this I assume you mean RAID devices that don't contain any type of on-disk 
metadata (e.g. MD superblocks). I don't see this as a huge hurdle. As long as 
the device drivers (SCIS, IDE, etc) export the necessary identification info 
through sysfs, user-space tools can contain the policies necessary to allow 
them to detect which disks belong together in a RAID device, and then tell 
the kernel to activate said RAID device. This sounds a lot like how 
Christophe Varoqui has been doing things in his new multipath tools.

> Failing over when one of the drives the kernel presents does not respond.
>
> All that just screams "do it in userland".
>
> OTOH, once the devices are up and running, kernel needs update some of
> that configuration itself.  Hot spare lists are an easy example, but any
> time the state of the overall RAID array changes, some host RAID
> formats, more closely tied to hardware than MD, may require
> configuration metadata changes when some hardware condition(s) change.

Certainly. Of course, I see things like adding and removing hot-spares and 
removing stale/faulty disks as something that can be driven from user-space. 
For example, for adding a new hot-spare, with DM it's as simple as loading a 
new mapping that contains the new disk, then telling DM to switch the device 
mapping (which implies a suspend/resume of I/O). And if necessary, such a 
user-space tool can be activated by hotplug events triggered by the insertion 
of a new disk into the system, making the process effectively transparent to 
the user.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 19:15             ` Kevin Corry
@ 2004-03-26 20:45               ` Justin T. Gibbs
  2004-03-27 15:39                 ` Kevin Corry
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-26 20:45 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid

>> There is a certain amount of metadata that -must- be updated at runtime,
>> as you recognize.  Over and above what MD already cares about, DDF and
>> its cousins introduce more items along those lines:  event logs, bad
>> sector logs, controller-level metadata...  these are some of the areas I
>> think Justin/Scott are concerned about.
> 
> I'm sure these things could be accommodated within DM. Nothing in DM prevents 
> having some sort of in-kernel metadata knowledge. In fact, other DM modules 
> already do - dm-snapshot and the above mentioned dm-mirror both need to do 
> some amount of in-kernel status updating. But I see this as completely 
> separate from in-kernel device discovery (which we seem to agree is the wrong 
> direction). And IMO, well designed metadata will make this "split" very 
> obvious, so it's clear which parts of the metadata the kernel can use for 
> status, and which parts are purely for identification (which the kernel thus 
> ought to be able to ignore).

We don't have control over the meta-data formats being used by the industry.
Coming up with a solution that only works for "Linux Engineered Meta-data
formats" removes any possibility of supporting things like DDF, Adaptec
ASR, and a host of other meta-data formats that can be plugged into things
like EMD.  In the two cases we are supporting today with EMD, the records
required for doing discovery reside in the same sectors as those that need
to be updated at runtime from some "in-core" context.

> The main point I'm trying to get across here is that DM provides a simple yet 
> extensible kernel framework for a variety of storage management tasks, 
> including a lot more than just RAID. I think it would be a huge benefit for 
> the RAID drivers to make use of this framework to provide functionality 
> beyond what is currently available.

DM is a transform layer that has the ability to pause I/O while that
transform is updated from userland.  That's all it provides.  As such,
it is perfectly suited to some types of logical volume management
applications.  But that is as far as it goes.  It does not have any
support for doing "sync/resync/scrub" type operations or any generic
support for doing anything with meta-data.  In all of the examples you
have presented so far, you have not explained how this part of the equation
is handled.  Sure, adding a member to a RAID1 is trivial.  Just pause the
I/O, update the transform, and let it go.  Unfortunately, that new member
is not in sync with the rest.  The transform must be aware of this and only
trust the member below the sync mark.  How is this information communicated
to the transform?  Who updates the sync mark?  Who copies the data to the
new member while guaranteeing that an in-flight write does not occur to the
area being synced?  If you intend to add all of this to DM, then it is no
longer any "simpler" or more extensible than EMD.

Don't take my arguments the wrong way.  I believe that DM is useful
for what it was designed for: LVM.  It does not, however, provide the
machinery required for it to replace a generic RAID stack.  Could
you merge a RAID stack into DM.  Sure.  Its only software.  But for
it to be robust, the same types of operations MD/EMD perform in kernel
space will have to be done there too.

The simplicity of DM is part of why it is compelling.  My belief is that
merging RAID into DM will compromise this simplicity and divert DM from
what it was designed to do - provide LVM transforms.

As for RAID discovery, this is the trivial portion of RAID.  For an extra
10% or less of code in a meta-data module, you get RAID discovery.  You
also get a single point of access to the meta-data, avoid duplicated code,
and complex kernel/user interfaces.  There seems to be a consistent feeling
that it is worth compromising all of these benefits just to push this 10%
of the meta-data handling code out of the kernel (and inflate it by 5 or
6 X duplicating code already in the kernel).  Where are the benefits of
this userland approach?

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-26 20:45               ` Justin T. Gibbs
@ 2004-03-27 15:39                 ` Kevin Corry
  2004-03-28  9:11                   ` [dm-devel] " christophe varoqui
  2004-03-30 17:03                   ` Justin T. Gibbs
  0 siblings, 2 replies; 38+ messages in thread
From: Kevin Corry @ 2004-03-27 15:39 UTC (permalink / raw)
  To: linux-kernel, Justin T. Gibbs
  Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel

On Friday 26 March 2004 2:45 pm, Justin T. Gibbs wrote:
> We don't have control over the meta-data formats being used by the
> industry. Coming up with a solution that only works for "Linux Engineered
> Meta-data formats" removes any possibility of supporting things like DDF,
> Adaptec ASR, and a host of other meta-data formats that can be plugged into
> things like EMD.  In the two cases we are supporting today with EMD, the
> records required for doing discovery reside in the same sectors as those
> that need to be updated at runtime from some "in-core" context.

Well, there's certainly no guarantee that the "industry" will get it right. In
this case, it seems that they didn't. But even given that we don't have ideal
metadata formats, it's still possible to do discovery and a number of other
management tasks from user-space.

> > The main point I'm trying to get across here is that DM provides a simple
> > yet extensible kernel framework for a variety of storage management
> > tasks, including a lot more than just RAID. I think it would be a huge
> > benefit for the RAID drivers to make use of this framework to provide
> > functionality beyond what is currently available.
>
> DM is a transform layer that has the ability to pause I/O while that
> transform is updated from userland.  That's all it provides.

I think the DM developers would disagree with you on this point.

> As such, 
> it is perfectly suited to some types of logical volume management
> applications.  But that is as far as it goes.  It does not have any
> support for doing "sync/resync/scrub" type operations or any generic
> support for doing anything with meta-data.

The core DM driver would not and should not be handling these operations.
These are handled in modules specific to one type of mapping. There's no
need for the DM core to know anything about any metadata. If one particular
module (e.g. dm-mirror) needs to support one or more metadata formats, it's
free to do so.

On the other hand, DM *does* provide services that make "sync/resync" a great
deal simpler for such a module. It provides simple services for performing
synchronous or asynchronous I/O to pages or vm areas. It provides a service
for performing copies from one block-device area to another. The dm-mirror
module uses these for this very purpose. If we need additional "libraries"
for common RAID tasks (e.g. parity calculations) we can certainly add them.

> In all of the examples you 
> have presented so far, you have not explained how this part of the equation
> is handled.  Sure, adding a member to a RAID1 is trivial.  Just pause the
> I/O, update the transform, and let it go.  Unfortunately, that new member
> is not in sync with the rest.  The transform must be aware of this and only
> trust the member below the sync mark.  How is this information communicated
> to the transform?  Who updates the sync mark?  Who copies the data to the
> new member while guaranteeing that an in-flight write does not occur to the
> area being synced?

Before the new disk is added to the raid1, user-space is responsible for
writing an initial state to that disk, effectively marking it as completely
dirty and unsynced. When the new table is loaded, part of the "resume" is for
the module to read any metadata and do any initial setup that's necessary. In
this particular example, it means the new disk would start with all of its
"regions" marked "dirty", and all the regions would need to be synced from
corresponding "clean" regions on another disk in the set.

If the previously-existing disks were part-way through a sync when the table
was switched, their metadata would indicate where the current "sync mark" was
located. The module could then continue the sync from where it left off,
including the new disk that was just added. When the sync completed, it might
have to scan back to the beginning of the new disk to see if had any remaining
dirty regions that needed to be synced before that disk was completely clean.

And of course the I/O-mapping path just has to be smart enough to know which
regions are dirty and avoid sending live I/O to those.

(And I'm sure Joe or Alasdair could provide a better in-depth explanation of 
the current dm-mirror module than I'm trying to. This is obviously a very 
high-level overview.)

This process is somewhat similar to how dm-snapshot works. If it reads an 
empty header structure, it assumes it's a new snapshot, and starts with an 
empty hash table. If it reads a previously existing header, it continues to 
read the on-disk COW tables and constructs the necessary in-memory hash-table 
to represent that initial state.

> If you intend to add all of this to DM, then it is no 
> longer any "simpler" or more extensible than EMD.

Sure it is. Because very little (if any) of this needs to affect the core DM
driver, that core remains as simple and extensible as it currently is. The
extra complexity only really affects the new modules that would handle RAID.

> Don't take my arguments the wrong way.  I believe that DM is useful
> for what it was designed for: LVM.  It does not, however, provide the
> machinery required for it to replace a generic RAID stack.  Could
> you merge a RAID stack into DM.  Sure.  Its only software.  But for
> it to be robust, the same types of operations MD/EMD perform in kernel
> space will have to be done there too.
>
> The simplicity of DM is part of why it is compelling.  My belief is that
> merging RAID into DM will compromise this simplicity and divert DM from
> what it was designed to do - provide LVM transforms.

I disagree. The simplicity of the core DM driver really isn't at stake here.
We're only talking about adding a few relatively complex target modules. And
with DM you get the benefit of a very simple user/kernel interface.

> As for RAID discovery, this is the trivial portion of RAID.  For an extra
> 10% or less of code in a meta-data module, you get RAID discovery.  You
> also get a single point of access to the meta-data, avoid duplicated code,
> and complex kernel/user interfaces.  There seems to be a consistent feeling
> that it is worth compromising all of these benefits just to push this 10%
> of the meta-data handling code out of the kernel (and inflate it by 5 or
> 6 X duplicating code already in the kernel).  Where are the benefits of
> this userland approach?

I've got to admit, this whole discussion is very ironic. Two years ago I
was exactly where you are today, pushing for in-kernel discover, a variety of 
metadata modules, internal opaque device stacking, etc, etc. I can only
imagine that hch is laughing his ass off now that I'm the one arguing for
moving all this stuff to user-space.

I don't honestly expect to suddenly change your mind on all these issues.
A lot of work has obviously gone into EMD, and I definitely know how hard it
can be when the community isn't greeting your suggestions with open arms. And
I'm certainly not saying the EMD method isn't a potentially viable approach.
But it doesn't seem to be the approach the community is looking for. We faced
the same resistance two years ago. It took months of arguing with the 
community and arguing amongst ourselves before we finally decided to move 
EVMS to user-space and use MD and DM. It was a decision that meant 
essentially throwing away an enormous amount of work from several people. It 
was an incredibly hard choice, but I really believe now that it was the right
decision. It was the direction the community wanted to move in, and the only
way for our project to truely survive was to move with them.

So feel free to continue to develop and promote EMD. I'm not trying to stop
you and I don't mind having competition for finding the best way to do RAID
in Linux. But I can tell you from experience that EMD is going to face a good
bit of opposition based on its current design and you might want to take that
into consideration.

I am interested in discussing if and how RAID could be supported under
Device-Mapper (or some other "merging" of these two drivers). Jeff and Lars
have shown some interest, and I certainly hope we can convince Neil and Joe
that this is a good direction. Maybe it can be done and maybe it can't. I
personally think it can be, and I'd at least like to have that discussion
and find out.

-- 
Kevin Corry
kevcorry@us.ibm.com
http://evms.sourceforge.net/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [dm-devel] Re: "Enhanced" MD code avaible for review
  2004-03-27 15:39                 ` Kevin Corry
@ 2004-03-28  9:11                   ` christophe varoqui
  2004-03-30 17:03                   ` Justin T. Gibbs
  1 sibling, 0 replies; 38+ messages in thread
From: christophe varoqui @ 2004-03-28  9:11 UTC (permalink / raw)
  To: device-mapper development
  Cc: linux-kernel, Justin T. Gibbs, linux-raid, Jeff Garzik,
	Neil Brown

Justin,

I direct you to http://christophe.varoqui.free.fr/ for a well documented 
example of coordination between the device-mapper and the userspace 
multipath tools.

I hope you'll see how robust and elegant the solution can be.

regards,
cvaroqui

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-27 15:39                 ` Kevin Corry
  2004-03-28  9:11                   ` [dm-devel] " christophe varoqui
@ 2004-03-30 17:03                   ` Justin T. Gibbs
  2004-03-30 17:15                     ` Jeff Garzik
  1 sibling, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:03 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid, dm-devel

> Well, there's certainly no guarantee that the "industry" will get it right. In
> this case, it seems that they didn't. But even given that we don't have ideal
> metadata formats, it's still possible to do discovery and a number of other
> management tasks from user-space.

I have never proposed that management activities be performed solely
within the kernel.  My position has been that meta-data parsing and
updating has to be core-resident for any solution that handles advanced
RAID functionality and that spliting out any portion of those roles
to userland just complicates the solution.

>> it is perfectly suited to some types of logical volume management
>> applications.  But that is as far as it goes.  It does not have any
>> support for doing "sync/resync/scrub" type operations or any generic
>> support for doing anything with meta-data.
> 
> The core DM driver would not and should not be handling these operations.
> These are handled in modules specific to one type of mapping. There's no
> need for the DM core to know anything about any metadata. If one particular
> module (e.g. dm-mirror) needs to support one or more metadata formats, it's
> free to do so.

That's unfortunate considering that the meta-data formats we are talking
about already have the capability of expressing RAID 1(E),4,5,6.  There has
to be a common meta-data framework in order to avoid this duplication.

>> In all of the examples you 
>> have presented so far, you have not explained how this part of the equation
>> is handled.

...

> Before the new disk is added to the raid1, user-space is responsible for
> writing an initial state to that disk, effectively marking it as completely
> dirty and unsynced. When the new table is loaded, part of the "resume" is for
> the module to read any metadata and do any initial setup that's necessary. In
> this particular example, it means the new disk would start with all of its
> "regions" marked "dirty", and all the regions would need to be synced from
> corresponding "clean" regions on another disk in the set.
> 
> If the previously-existing disks were part-way through a sync when the table
> was switched, their metadata would indicate where the current "sync mark" was
> located. The module could then continue the sync from where it left off,
> including the new disk that was just added. When the sync completed, it might
> have to scan back to the beginning of the new disk to see if had any remaining
> dirty regions that needed to be synced before that disk was completely clean.
> 
> And of course the I/O-mapping path just has to be smart enough to know which
> regions are dirty and avoid sending live I/O to those.
> 
> (And I'm sure Joe or Alasdair could provide a better in-depth explanation of 
> the current dm-mirror module than I'm trying to. This is obviously a very 
> high-level overview.)

So all of this complexity is still in the kernel.  The only difference is
that the meta-data can *also* be manipulated from userspace.  In order
for this to be safe, the mirror must be suspended (meta-data becomes stable),
the meta-data must be re-read by the userland program, the meta-data must be
updated, the mapping must be updated, the mirror must be resumed, and the
mirror must revalidate all meta-data.  How do you avoid deadlock in this
process?  Does the userland daemon, which must be core resident in this case,
pre-allocate buffers for reading and writing the meta-data?

The dm-raid1 module also appears to intrinsicly trust its mapping and the
contents of its meta-data (simple magic number check).  It seems to me that 
the kernel should validate all of its inputs regardless of whether the
ioctls that are used to present them are only supposed to be used by a
"trusted daemon".

All of this adds up to more complexity.  Your argument seems to be that,
since DM avoids this complexity in its core, this is a better solution,
but I am more interested in the least complex, most easily maintained
total solution.

>> The simplicity of DM is part of why it is compelling.  My belief is that
>> merging RAID into DM will compromise this simplicity and divert DM from
>> what it was designed to do - provide LVM transforms.
> 
> I disagree. The simplicity of the core DM driver really isn't at stake here.
> We're only talking about adding a few relatively complex target modules. And
> with DM you get the benefit of a very simple user/kernel interface.

The simplicity of the user/kernel interface is not what is at stake here.
With EMD, you can perform all of the same operations talked about above,
in just as few ioctl calls.  The only difference is that the kernel and
only the kernel, reads and modifies the metadata.  There are actually
fewer steps for the userland application than before.  This becomes even
more evident as more meta-data modules are added.

> I don't honestly expect to suddenly change your mind on all these issues.
> A lot of work has obviously gone into EMD, and I definitely know how hard it
> can be when the community isn't greeting your suggestions with open arms.

I honestly don't care if the final solution is EMD, DM, or XYZ so long
as that solution is correct, supportable, and covers all of the scenarios
required for robust RAID support.  That is the crux of the argument, not
"please love my code".

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:03                   ` Justin T. Gibbs
@ 2004-03-30 17:15                     ` Jeff Garzik
  2004-03-30 17:35                       ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-30 17:15 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
> The dm-raid1 module also appears to intrinsicly trust its mapping and the
> contents of its meta-data (simple magic number check).  It seems to me that 
> the kernel should validate all of its inputs regardless of whether the
> ioctls that are used to present them are only supposed to be used by a
> "trusted daemon".

The kernel should not be validating -trusted- userland inputs.  Root is 
allowed to scrag the disk, violate limits, and/or crash his own machine.

A simple example is requiring userland, when submitting ATA taskfiles 
via an ioctl, to specify the data phase (pio read, dma write, no-data, 
etc.).  If the data phase is specified incorrectly, you kill the OS 
driver's ATA host state machine, and the results are very unpredictable. 
  Since this is a trusted operation, requiring CAP_RAW_IO, it's up to 
userland to get the required details right (just like following a spec).

> I honestly don't care if the final solution is EMD, DM, or XYZ so long
> as that solution is correct, supportable, and covers all of the scenarios
> required for robust RAID support.  That is the crux of the argument, not
> "please love my code".

hehe.  I think we all agree here...

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:15                     ` Jeff Garzik
@ 2004-03-30 17:35                       ` Justin T. Gibbs
  2004-03-30 17:46                         ` Jeff Garzik
  2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  0 siblings, 2 replies; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 17:35 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

> The kernel should not be validating -trusted- userland inputs.  Root is
> allowed to scrag the disk, violate limits, and/or crash his own machine.
> 
> A simple example is requiring userland, when submitting ATA taskfiles via
> an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> If the data phase is specified incorrectly, you kill the OS driver's ATA
> host wwtate machine, and the results are very unpredictable.   Since this
> is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
> required details right (just like following a spec).

That's unfortunate for those using ATA.  A command submitted from userland
to the SCSI drivers I've written that causes a protocol violation will
be detected, result in appropriate recovery, and a nice diagnostic that
can be used to diagnose the problem.  Part of this is because I cannot know
if the protocol violation stems from a target defect, the input from the
user or, for that matter, from the kernel.  The main reason is for robustness
and ease of debugging.  In SCSI case, there is almost no run-time cost, and
the system will stop before data corruption occurs.  In the meta-data case
we've been discussing in terms of EMD, there is no runtime cost, the
validation has to occur somewhere anyway, and in many cases some validation
is already required to avoid races with external events.  If the validation
is done in the kernel, then you get the benefit of nice diagnostics instead
of strange crashes that are difficult to debug.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:35                       ` Justin T. Gibbs
@ 2004-03-30 17:46                         ` Jeff Garzik
  2004-03-30 18:04                           ` Justin T. Gibbs
  2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  1 sibling, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-30 17:46 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>The kernel should not be validating -trusted- userland inputs.  Root is
>>allowed to scrag the disk, violate limits, and/or crash his own machine.
>>
>>A simple example is requiring userland, when submitting ATA taskfiles via
>>an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
>>If the data phase is specified incorrectly, you kill the OS driver's ATA
>>host wwtate machine, and the results are very unpredictable.   Since this
>>is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get the
>>required details right (just like following a spec).
> 
> 
> That's unfortunate for those using ATA.  A command submitted from userland

Required, since one cannot know the data phase of vendor-specific commands.


> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem.  Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel.  The main reason is for robustness

Well,
* the target is not _issuing_ commands,
* any user issuing incorrect commands/cdbs is not your bug,
* and kernel code issuing incorrect cmands/cdbs isn't your bug either

Particularly, checking whether the kernel is doing something wrong, or 
wrong, just wastes cycles.  That's not a scalable way to code...  if 
every driver and Linux subsystem did that, things would be unbearable slow.

	Jeff

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:46                         ` Jeff Garzik
@ 2004-03-30 18:04                           ` Justin T. Gibbs
  2004-03-30 21:47                             ` Jeff Garzik
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 18:04 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

>> That's unfortunate for those using ATA.  A command submitted from userland
> 
> Required, since one cannot know the data phase of vendor-specific commands.

So you are saying that this presents an unrecoverable situation?

> Particularly, checking whether the kernel is doing something wrong, or wrong,
> just wastes cycles.  That's not a scalable way to code...  if every driver
> and Linux subsystem did that, things would be unbearable slow.

Hmm.  I've never had someone tell me that my SCSI drivers are slow.

I don't think that your statement is true in the general case.  My
belief is that validation should occur where it is cheap and efficient
to do so.  More expensive checks should be pushed into diagnostic code
that is disabled by default, but the code *should be there*.  In any event,
for RAID meta-data, we're talking about code that is *not* in the common
or time critical path of the kernel.  A few dozen lines of validation code
there has almost no impact on the size of the kernel and yields huge
benefits for debugging and maintaining the code.  This is even more
the case in Linux the end user is often your test lab.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 18:04                           ` Justin T. Gibbs
@ 2004-03-30 21:47                             ` Jeff Garzik
  2004-03-30 22:12                               ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Jeff Garzik @ 2004-03-30 21:47 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>>That's unfortunate for those using ATA.  A command submitted from userland
>>
>>Required, since one cannot know the data phase of vendor-specific commands.
> 
> 
> So you are saying that this presents an unrecoverable situation?

No, I'm saying that the data phase need not have a bunch of in-kernel 
checks, it should be generated correctly from the source.


>>Particularly, checking whether the kernel is doing something wrong, or wrong,
>>just wastes cycles.  That's not a scalable way to code...  if every driver
>>and Linux subsystem did that, things would be unbearable slow.
> 
> 
> Hmm.  I've never had someone tell me that my SCSI drivers are slow.

This would be noticed in the CPU utilization area.  Your drivers are 
probably a long way from being CPU-bound.


> I don't think that your statement is true in the general case.  My
> belief is that validation should occur where it is cheap and efficient
> to do so.  More expensive checks should be pushed into diagnostic code
> that is disabled by default, but the code *should be there*.  In any event,
> for RAID meta-data, we're talking about code that is *not* in the common
> or time critical path of the kernel.  A few dozen lines of validation code
> there has almost no impact on the size of the kernel and yields huge
> benefits for debugging and maintaining the code.  This is even more
> the case in Linux the end user is often your test lab.

It doesn't scale terribly well, because the checks themselves become a 
source of bugs.

	Jeff





^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 21:47                             ` Jeff Garzik
@ 2004-03-30 22:12                               ` Justin T. Gibbs
  2004-03-30 22:34                                 ` Jeff Garzik
  0 siblings, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-30 22:12 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

>> So you are saying that this presents an unrecoverable situation?
> 
> No, I'm saying that the data phase need not have a bunch of in-kernel
> checks, it should be generated correctly from the source.

The SCSI drivers validate the controller's data phase based on the
expected phase presented to them from an upper layer.  I never talked
about adding checks that make little sense or are overly expensive.  You
seem to equate validation with huge expense.  That is just not the
general case.

>> Hmm.  I've never had someone tell me that my SCSI drivers are slow.
> 
> This would be noticed in the CPU utilization area.  Your drivers are
> probably a long way from being CPU-bound.

I very much doubt that.  There are perhaps four or five tests in the
I/O path where some value already in a cache line that has to be accessed
anyway is compared against a constant.  We're talking about something
down in the noise of any type of profiling you could perform.  As I said,
validation makes sense where there is basically no-cost to do it.

>> I don't think that your statement is true in the general case.  My
>> belief is that validation should occur where it is cheap and efficient
>> to do so.  More expensive checks should be pushed into diagnostic code
>> that is disabled by default, but the code *should be there*.  In any event,
>> for RAID meta-data, we're talking about code that is *not* in the common
>> or time critical path of the kernel.  A few dozen lines of validation code
>> there has almost no impact on the size of the kernel and yields huge
>> benefits for debugging and maintaining the code.  This is even more
>> the case in Linux the end user is often your test lab.
> 
> It doesn't scale terribly well, because the checks themselves become a
> source of bugs.

So now the complaint is that validation code is somehow harder to write
and maintain than the rest of the code?

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 22:12                               ` Justin T. Gibbs
@ 2004-03-30 22:34                                 ` Jeff Garzik
  0 siblings, 0 replies; 38+ messages in thread
From: Jeff Garzik @ 2004-03-30 22:34 UTC (permalink / raw)
  To: Justin T. Gibbs
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

Justin T. Gibbs wrote:
>>>So you are saying that this presents an unrecoverable situation?
>>
>>No, I'm saying that the data phase need not have a bunch of in-kernel
>>checks, it should be generated correctly from the source.
> 
> 
> The SCSI drivers validate the controller's data phase based on the
> expected phase presented to them from an upper layer.  I never talked
> about adding checks that make little sense or are overly expensive.  You
> seem to equate validation with huge expense.  That is just not the
> general case.
> 
> 
>>>Hmm.  I've never had someone tell me that my SCSI drivers are slow.
>>
>>This would be noticed in the CPU utilization area.  Your drivers are
>>probably a long way from being CPU-bound.
> 
> 
> I very much doubt that.  There are perhaps four or five tests in the
> I/O path where some value already in a cache line that has to be accessed
> anyway is compared against a constant.  We're talking about something
> down in the noise of any type of profiling you could perform.  As I said,
> validation makes sense where there is basically no-cost to do it.
> 
> 
>>>I don't think that your statement is true in the general case.  My
>>>belief is that validation should occur where it is cheap and efficient
>>>to do so.  More expensive checks should be pushed into diagnostic code
>>>that is disabled by default, but the code *should be there*.  In any event,
>>>for RAID meta-data, we're talking about code that is *not* in the common
>>>or time critical path of the kernel.  A few dozen lines of validation code
>>>there has almost no impact on the size of the kernel and yields huge
>>>benefits for debugging and maintaining the code.  This is even more
>>>the case in Linux the end user is often your test lab.
>>
>>It doesn't scale terribly well, because the checks themselves become a
>>source of bugs.
> 
> 
> So now the complaint is that validation code is somehow harder to write
> and maintain than the rest of the code?

Actually, yes.  Validation of random user input has always been a source 
of bugs (usually in edge cases), in Linux and in other operating 
systems.  It is often the area where security bugs are found.

Basically you want to avoid add checks for conditions that don't occur 
in properly written software, and make sure that the kernel always 
generates correct requests.  Obviously that excludes anything on the 
target side, but other than that...  in userland, a priveleged user is 
free to do anything they wish, including violate protocols, cook their 
disk, etc.

	Jeff




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-30 17:35                       ` Justin T. Gibbs
  2004-03-30 17:46                         ` Jeff Garzik
@ 2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
  1 sibling, 0 replies; 38+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2004-03-30 18:11 UTC (permalink / raw)
  To: Justin T. Gibbs, Jeff Garzik
  Cc: Kevin Corry, linux-kernel, Neil Brown, linux-raid, dm-devel

On Tuesday 30 of March 2004 19:35, Justin T. Gibbs wrote:
> > The kernel should not be validating -trusted- userland inputs.  Root is
> > allowed to scrag the disk, violate limits, and/or crash his own machine.
> >
> > A simple example is requiring userland, when submitting ATA taskfiles via
> > an ioctl, to specify the data phase (pio read, dma write, no-data, etc.).
> > If the data phase is specified incorrectly, you kill the OS driver's ATA
> > host wwtate machine, and the results are very unpredictable.   Since this
> > is a trusted operation, requiring CAP_RAW_IO, it's up to userland to get
> > the required details right (just like following a spec).
>
> That's unfortunate for those using ATA.  A command submitted from userland
> to the SCSI drivers I've written that causes a protocol violation will
> be detected, result in appropriate recovery, and a nice diagnostic that
> can be used to diagnose the problem.  Part of this is because I cannot know
> if the protocol violation stems from a target defect, the input from the
> user or, for that matter, from the kernel.  The main reason is for
> robustness and ease of debugging.  In SCSI case, there is almost no
> run-time cost, and the system will stop before data corruption occurs.  In

In ATA case detection of protocol violation is not possible w/o checking every
possible command opcode.  Even if implemented (notice that checking commands
coming from kernel is out of question - for performance reasons) this breaks
for future and vendor specific commands.

> the meta-data case we've been discussing in terms of EMD, there is no
> runtime cost, the validation has to occur somewhere anyway, and in many
> cases some validation is already required to avoid races with external
> events.  If the validation is done in the kernel, then you get the benefit
> of nice diagnostics instead of strange crashes that are difficult to debug.

Unless code that crashes is the one doing validation. ;-)

Bartlomiej


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 18:00         ` Kevin Corry
  2004-03-25 18:42           ` Jeff Garzik
@ 2004-03-25 22:59           ` Justin T. Gibbs
  2004-03-25 23:44             ` Lars Marowsky-Bree
  1 sibling, 1 reply; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-25 22:59 UTC (permalink / raw)
  To: Kevin Corry, linux-kernel; +Cc: Jeff Garzik, Neil Brown, linux-raid

>> Independent DM efforts have already started supporting MD raid0/1
>> metadata from what I understand, though these efforts don't seem to post
>> to linux-kernel or linux-raid much at all.  :/
> 
> I post on lkml.....occasionally. :)

...

> This decision was not based on any real dislike of the MD driver, but rather 
> for the benefits that are gained by using Device-Mapper. In particular, 
> Device-Mapper provides the ability to change out the device mapping on the 
> fly, by temporarily suspending I/O, changing the table, and resuming the I/O 
> I'm sure many of you know this already. But I'm not sure everyone fully 
> understands how powerful a feature this is. For instance, it means EVMS can 
> now expand RAID-linear devices online. While that particular example may not 
> sound all that exciting, if things like RAID-1 and RAID-5 were "ported" to 
> Device-Mapper, this feature would then allow you to do stuff like add new 
> "active" members to a RAID-1 online (think changing from 2-way mirror to 
> 3-way mirror). It would be possible to convert from RAID-0 to RAID-4 online 
> simply by adding a new disk (assuming other limitations, e.g. a single 
> stripe-zone). Unfortunately, these are things the MD driver can't do online, 
> because you need to completely stop the MD device before making such changes 
> (to prevent the kernel and user-space from trampling on the same metadata), 
> and MD won't stop the device if it's open (i.e. if it's mounted or if you 
> have other device (LVM) built on top of MD). Often times this means you need 
> to boot to a rescue-CD to make these types of configuration changes.

We should be clear about your argument here.  It is not that DM makes
generic morphing easy and possible, it is that with DM the most basic
types of morphing (no data striping or de-striping) is easily accomplished.
You sight two examples:

1) Adding another member to a RAID-1.  While MD may not allow this to
   occur while the array is operational, EMD does.  This is possible
   because there is only one entity controlling the meta-data.

2) Converting a RAID0 to a RAID4 while possible with DM is not particularly
   interesting from an end user perspective.

The fact of the matter is that neither EMD nor DM provide a generic
morphing capability.  If this is desirable, we can discuss how it could
be achieved, but my initial belief is that attempting any type of
complicated morphing from userland would be slow, prone to deadlocks,
and thus difficult to achieve in a fashion that guaranteed no loss of
data in the face of unexpected system restarts.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 22:59           ` Justin T. Gibbs
@ 2004-03-25 23:44             ` Lars Marowsky-Bree
  2004-03-26  0:03               ` Justin T. Gibbs
  0 siblings, 1 reply; 38+ messages in thread
From: Lars Marowsky-Bree @ 2004-03-25 23:44 UTC (permalink / raw)
  To: Justin T. Gibbs, Kevin Corry, linux-kernel
  Cc: Jeff Garzik, Neil Brown, linux-raid

On 2004-03-25T15:59:00,
   "Justin T. Gibbs" <gibbs@scsiguy.com> said:

> The fact of the matter is that neither EMD nor DM provide a generic
> morphing capability.  If this is desirable, we can discuss how it could
> be achieved, but my initial belief is that attempting any type of
> complicated morphing from userland would be slow, prone to deadlocks,
> and thus difficult to achieve in a fashion that guaranteed no loss of
> data in the face of unexpected system restarts.

Uhm. DM sort of does (at least where the morphing amounts to resyncing a
part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
Freeze, load new mapping, continue.

I agree that more complex morphings (RAID1->RAID5 or vice-versa in
particular) are more difficult to get right, but are not that often
needed online - or if they are, typically such scenarios will have
enough temporary storage to create the new target, RAID1 over,
disconnect the old part and free it, which will work just fine with DM.

Sincerely,
    Lars Marowsky-Brée <lmb@suse.de>

-- 
High Availability & Clustering	      \ ever tried. ever failed. no matter.
SUSE Labs			      | try again. fail again. fail better.
Research & Development, SUSE LINUX AG \ 	-- Samuel Beckett

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: "Enhanced" MD code avaible for review
  2004-03-25 23:44             ` Lars Marowsky-Bree
@ 2004-03-26  0:03               ` Justin T. Gibbs
  0 siblings, 0 replies; 38+ messages in thread
From: Justin T. Gibbs @ 2004-03-26  0:03 UTC (permalink / raw)
  To: Lars Marowsky-Bree, Kevin Corry, linux-kernel
  Cc: Jeff Garzik, Neil Brown, linux-raid

> Uhm. DM sort of does (at least where the morphing amounts to resyncing a
> part of the stripe, ie adding a new mirror, RAID1->4, RAID5->6 etc).
> Freeze, load new mapping, continue.

The point is that these trivial "morphings" can be achieved with limited
effort regardless of whether you do it via EMD or DM.   Implementing this
in EMD could be achieved with perhaps 8 hours work with no significant
increase in code size or complexity.  This is part of why I find them
"uninteresting".  If we really want to talk about generic morphing,
I think you'll find that DM is no better suited to this task than MD or
its derivatives.

> I agree that more complex morphings (RAID1->RAID5 or vice-versa in
> particular) are more difficult to get right, but are not that often
> needed online - or if they are, typically such scenarios will have
> enough temporary storage to create the new target, RAID1 over,
> disconnect the old part and free it, which will work just fine with DM.

The most common requests that we hear from customers are:

o single -> R1

	Equally possible with MD or DM assuming your singles are
	accessed via a volume manager.  Without that support the
	user will have to dismount and remount storage.

o R1 -> R10

	This should require just double the number of active members.
	This is not possible today with either DM or MD.  Only
	"migration" is possible.

o R1 -> R5
o R5 -> R1

	These typically occur when data access patterns change for
	the customer.  Again not possible with DM or MD today.

All of these are important to some subset of customers and are, to
my mind, required if you want to claim even basic morphing capability.
If you are allowing the "cop-out" of using a volume manager to substitute
data-migration for true morphing, then MD is almost as well suited to
that task as DM.

--
Justin

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2004-03-31 17:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-19 20:19 "Enhanced" MD code avaible for review Justin T. Gibbs
2004-03-23  5:05 ` Neil Brown
2004-03-23  6:23   ` Justin T. Gibbs
2004-03-24  2:26     ` Neil Brown
2004-03-24 19:09       ` Matt Domsch
2004-03-25  2:21       ` Jeff Garzik
2004-03-25 18:00         ` Kevin Corry
2004-03-25 18:42           ` Jeff Garzik
2004-03-25 18:48             ` Jeff Garzik
2004-03-25 23:46               ` Justin T. Gibbs
2004-03-26  0:01                 ` Jeff Garzik
2004-03-26  0:10                   ` Justin T. Gibbs
2004-03-26  0:14                     ` Jeff Garzik
2004-03-25 22:04             ` Lars Marowsky-Bree
2004-03-26 19:19               ` Kevin Corry
2004-03-31 17:07                 ` Randy.Dunlap
2004-03-25 23:35             ` Justin T. Gibbs
2004-03-26  0:13               ` Jeff Garzik
2004-03-26 17:43                 ` Justin T. Gibbs
2004-03-28  0:06                   ` Lincoln Dale
2004-03-30 17:54                     ` Justin T. Gibbs
2004-03-28  0:30                   ` Jeff Garzik
2004-03-26 19:15             ` Kevin Corry
2004-03-26 20:45               ` Justin T. Gibbs
2004-03-27 15:39                 ` Kevin Corry
2004-03-28  9:11                   ` [dm-devel] " christophe varoqui
2004-03-30 17:03                   ` Justin T. Gibbs
2004-03-30 17:15                     ` Jeff Garzik
2004-03-30 17:35                       ` Justin T. Gibbs
2004-03-30 17:46                         ` Jeff Garzik
2004-03-30 18:04                           ` Justin T. Gibbs
2004-03-30 21:47                             ` Jeff Garzik
2004-03-30 22:12                               ` Justin T. Gibbs
2004-03-30 22:34                                 ` Jeff Garzik
2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
2004-03-25 22:59           ` Justin T. Gibbs
2004-03-25 23:44             ` Lars Marowsky-Bree
2004-03-26  0:03               ` Justin T. Gibbs

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).