linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: "Enhanced" MD code avaible for review
@ 2004-03-19 20:19 Justin T. Gibbs
  2004-03-23  5:05 ` Neil Brown
  0 siblings, 1 reply; 57+ messages in thread
From: Justin T. Gibbs @ 2004-03-19 20:19 UTC (permalink / raw)
  To: linux-raid; +Cc: linux-kernel

[ CC trimmed since all those on the CC line appear to be on the lists ... ]

Lets take a step back and focus on a few of the points to which we can
hopefully all agree:

o Any successful solution will have to have "meta-data modules" for
  active arrays "core resident" in order to be robust.  This
  requirement stems from the need to avoid deadlock during error
  recovery scenarios that must block "normal I/O" to the array while
  meta-data operations take place.

o It is desirable for arrays to auto-assemble based on recorded
  meta-data.  This includes the ability to have a user hot-insert
  a "cold spare", have the system recognize it as a spare (based
  on the meta-data resident on it) and activate it if necessary to
  restore a degraded array.

o Child devices of an array should only be accessible through the
  array while the array is in a configured state (bd_claim'ed).
  This avoids situations where a user can subvert the integrity of
  the array by performing "rogue I/O" to an array member.

Concentrating on just these three, we come to the conclusion that
whether the solution comes via "early user fs" or kernel modules,
the resident size of the solution *will* include the cost for
meta-data support.  In either case, the user is able to tailor their
system to include only the support necessary for their individual
system to operate.

If we want to argue the merits of either approach based on just the
sheer size of resident code, I have little doubt that the kernel
module approach will prove smaller:

 o No need for "mdadm" or some other daemon to be locked resident in
   memory.   This alone saves you having a locked copy of klibc or
   any other user libraries core resident.  The kernel modules
   leverage the kernel APIs that already have to be core resident
   to satisfy the needs of other parts of the kernel which also
   helps in reducing its size.

 o Initial RAM disk data can be discarded after modules are loaded at
   boot time.

Putting the size argument aside for a moment, lets explore how a
userland solution could satisfy just the above three requirements.

How is meta-data updated on child members of an array while that
array is on-line?  Remember that these operations occur with some
frequency.  MD includes "safe-mode" support where redundant arrays
are marked clean any time writes cease for a predetermined, fairly
short, amount of time.  The userland app cannot access the component
devices directly since they are bd_claim'ed.  Even if that mechanism
is somehow subverted, how do we guarantee that these meta-data
writes do not cause a deadlock?  In the case of a transition from
Read-only to Write mode, all writes are blocked to the array (this
must be the case for "Dirty" state to be accurate).  It seems to
me that you must then provide extra code to not only pre-allocate
buffers for the userland app to do its work, but also provide a
"back-door" interface for these operations to take place.

The argument has also been made that shifting some of this code out
to a userland app "simplifies" the solution and perhaps even makes
it easier to develop.  Comparing the two approaches we have:

UserFS:
      o Kernel Driver + "enhanced interface to userland daemon"
      o Userland Daemon (core resident)
      o Userland Meta-Data modules
      o Userland Management tool
	 - This tool needs to interface to the daemon and
	   perhaps also the kernel driver.

Kernel:
      o Kernel RAID Transform Drivers
      o Kernel Meta-Data modules
      o Simple Userland Mangement
	tool with no meta-data knowledge

So two questions arise from this analysis:

1) Are meta-data modules easier to code up or more robust as user
   or kernel modules?  I believe that doing these outside the kernel
   will make them larger and more complex while also losing the
   ability to have meta-data modules weigh in on rapidly occurring
   events without incurring performance tradeoffs.  Regardless of
   where they reside, these modules must be robust.  A kernel Oops
   or a segfault in the daemon is unacceptable to the end user.
   Saying that a segfault is less harmful in some way than an Oops
   when we're talking about the users data completely misses the
   point of why people use RAID.

2) What added complexity is incurred by supporting both a core
   resident daemon as well as management interfaces to the daemon
   and potentially the kernel module?  I have not fully thought
   through the corner cases such an approach would expose, so I
   cannot quantify this cost.  There are certainly more components
   to get right and keep synchronized.

In the end, I find it hard to justify inventing all of the userland
machinery necessary to make this work just to avoid roughly ~2K
lines of code per-metadata module from being part of the kernel.
The ASR module for example, which is only required by those that
need support for this meta-data type, is only 19K with all of its
debugging printks and code enabled, unstripped.  Are there benefits
to the userland approach that I'm missing?

--
Justin


^ permalink raw reply	[flat|nested] 57+ messages in thread
[parent not found: <1AOTW-4Vx-7@gated-at.bofh.it>]
* "Enhanced" MD code avaible for review
@ 2004-03-17 18:14 Justin T. Gibbs
  2004-03-17 19:18 ` Jeff Garzik
  0 siblings, 1 reply; 57+ messages in thread
From: Justin T. Gibbs @ 2004-03-17 18:14 UTC (permalink / raw)
  To: linux-raid; +Cc: justin_gibbs


[ I tried sending this last night from my Adaptec email address and have
  yet to see it on the list.  Sorry if this is dup for any of you. ]

For the past few months, Adaptec Inc, has been working to enhance MD.
The goals of this project are:

	o Allow fully pluggable meta-data modules
	o Add support for Adaptec ASR (aka HostRAID) and DDF
	  (Disk Data Format) meta-data types.  Both of these
	  formats are understood natively by certain vendor
	  BIOSes meaning that arrays can be booted from transparently.
	o Improve the ability of MD to auto-configure arrays.
	o Support multi-level arrays transparently yet allow
	  proper event notification across levels when the
	  topology is known to MD.
	o Create a more generic "work item" framework which is
	  used to support array initialization, rebuild, and
	  verify operations as well as miscellaneous tasks that
	  a meta-data or RAID personality may need to perform
	  from a thread context (e.g. spare activation where
	  meta-data records may need to be sequenced carefully).
	o Modify the MD ioctl interface to allow the creation
	  of management utilities that are meta-data format
	  agnostic.

A snapshot of this work is now available here:

	http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz

This snapshot includes support for RAID0, RAID1, and the Adaptec
ASR and DDF meta-data formats.  Additional RAID personalities and
support for the Super90 and Super 1 meta-data formats will be added
in the coming weeks, the end goal being to provide a superset of
the functionality in the current MD.

A patch to fs/partitions/check.c is also required for this
release to function correctly:

	
http://people.freebsd.org/~gibbs/linux/SRC/md_announce_whole_device.diff

As the file name implies, this patch exposes not only partitions
on devices, but all "base" block devices to MD.  This is required
to support meta-data formats like ASR and DDF that typically operate
on the whole device.  Nothing in the implementation prevents any
meta-data format from being used on a partition, but BIOS boot
support is only available in the non-partitioned mode.

Since the current MD notification scheme does not allow MD to receive
notifications unless it is statically compiled into the kernel, we
would like to work with the community to develop a more generic
notification scheme to which modules, such as MD, can dynamically
register.  Until that occurs, these EMD snapshots will require at
least md.c to be a static component of the kernel.

For those wanting to test out this snapshot with an Adaptec HostRAID
U320 SCSI controller, you will need to update your kernel to use
version 2.0.8 of the aic79xx driver.  This driver defaults to
attaching to 790X controllers operating in HostRAID mode in addition
to those in direct SCSI mode.  This feature can be disabled using
a module or kernel command option.  Driver source and BK send patches
for this driver can be found here:

http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316-tar.gz
http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316.bksend.gz

Architectural Notes
===================
The major areas of change in "EMD" can be categorized into:

1) "Object Oriented" Data structure changes 

	These changes are the basis for allowing RAID personalities
	to transparently operate on "disks" or "arrays" as member
	objects.  While it has always been possible to create
	multi-level arrays in MD using block layer stacking, our
	approach allows MD to also stack internally.  Once a given
	RAID or meta-data personality is converted to the new
	structures, this "feature" comes at no cost.  The benefit
	to stacking internally, which requires a meta-data format
	that supports this, is that array state can propagate up
	and down the topology without the loss of information
	inherent in using the block layer to traverse levels of an
	array.

2) Opcode based interfaces.

	Rather than add additional method vectors to either the
	RAID personality or meta-data personality objects, the new
	code uses only a few methods that are parameterized.  This
	has allowed us to create a fairly rich interface between
	the core and the personalities without overly bloating
	personality "classes".
	
3) WorkItems

	Workitems provide a generic framework for queuing work to
	a thread context.  Workitems include a "control" method as
	well as a "handler" method.  This separation allows, for
	example, a RAID personality to use the generic sync handler
	while trapping the "open", "close", and "free" of any sync
	workitems.  Since both handlers can be tailored to the
	individual workitem that is queued, this removes the need
	to overload one or more interfaces in the personalities.
	It also means that any code in MD can make use of this
	framework - it is not tied to particular objects or modules
	in the system.

4) "Syncable Volume" Support

	All of the transaction accounting necessary to support
	redundant arrays has been abstracted out into a few inline
	functions.  With the inclusion of a "sync support" structure
	in a RAID personality's private data structure area and the
	use of these functions, the generic sync framework is fully
	available.  The sync algorithm is also now more like that
	in 2.4.X - with some updates to improve performance.  Two
	contiguous sync ranges are employed so that sync I/O can
	be pending while the lock range is extended and new sync
	I/O is stalled waiting for normal I/O writes that might
	conflict with the new range complete.  The syncer updates
	its stats more frequently than in the past so that it can
	more quickly react to changes in the normal I/O load.  Syncer
	backoff is also disabled anytime there is pending I/O blocked
	on the syncer's locked region.  RAID personalities have
	full control over the size of the sync windows used so that
	they can be optimized based on RAID layout policy.

5) IOCTL Interface

	"EMD" now performs all of its configuration via an "mdctl"
	character device.  Since one of our goals is to remove any
	knowledge of meta-data type in the user control programs,
	initial meta-data stamping and configuration validation
	occurs in the kernel.  In general, the meta-data modules
	already need this validation code in order to support
	auto-configuration, so adding this capability adds little
	to the overall size of EMD.  It does, however, require a
	few additional ioctls to support things like querying the
	maximum "coerced" size of a disk targeted for a new array,
	or enumerating the names of installed meta-data modules,
	etc.
	
	This area of EMD is still in very active development and we expect
	to provide a drop of an "emdadm" utility later this week.   

6) Meta-data and Topology State

	To support pluggable meta-data modules which may have diverse
	policies, all embedded knowledge of the MD SuperBlock formats
	has been removed.  In general, the meta-data modules "bid"
	on incoming devices that they can manage.  The high bidder
	is then asked to configure the disk into a reasonable
	topology that can be managed by a RAID personality and the
	MD core.  The bidding process allows a more "native" meta-data
	module to outbid a module that can handle the same format
	in "compatibility" mode.  It also allows the user to load
	a meta-data module update during install scenarios even if
	an older module is compiled statically into the kernel.

	Once the topology is created, all information needed for
	normal operation is available to the MD core and/or RAID
	personalities via direct variable access (at times protected
	by locks or atomic ops of course).  Array or member state
	changes occur via calling into the meta-data personality
	associated with that object.  The meta-data personality is
	then responsible for changing the state visible to the rest
	of the code and notifying interested parties.  This async
	design means that a RAID module noticing an I/O failure on
	one member and posting that event to one meta-data module,
	may cause a chain of notifications all the way to the
	top-level array object owned by another RAID/meta-data
	personality.

	The entire topology is reference counted such that objects
	will only disappear from the topology once they have
	transitioned to the FAILED state and all I/O (each I/O holds
	a reference) ceases.

7) Correction of RAID0 Transform

	The RAID0 transform's "merge function" assumes that the
	incoming bio's starting sector is the same as what will be
	presented to its make_request function.  In the case of a
	partitioned MD device, the starting sector is shifted by
	the partition offset for the target offset.  Unfortunately,
	the merge functions are not notified of the partition
	transform, so RAID0 would often reject requests that span
	"chunk" boundaries once shifted.  The fix employed here is
	to determine if a partition transform will occur and take
	this into account in the merge function.

Adaptec is currently validating EMD through formal testing while
continuing the build-out of new features.  Our hope is to gather
feedback from the Linux community and adjust our approach to satisfy
the community's requirements.  We look forward to your comments,
suggestions, and review of this project.

--
Justin



^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2004-03-31 17:07 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-19 20:19 "Enhanced" MD code avaible for review Justin T. Gibbs
2004-03-23  5:05 ` Neil Brown
2004-03-23  6:23   ` Justin T. Gibbs
2004-03-24  2:26     ` Neil Brown
2004-03-24 19:09       ` Matt Domsch
2004-03-25  2:21       ` Jeff Garzik
2004-03-25 18:00         ` Kevin Corry
2004-03-25 18:42           ` Jeff Garzik
2004-03-25 18:48             ` Jeff Garzik
2004-03-25 23:46               ` Justin T. Gibbs
2004-03-26  0:01                 ` Jeff Garzik
2004-03-26  0:10                   ` Justin T. Gibbs
2004-03-26  0:14                     ` Jeff Garzik
2004-03-25 22:04             ` Lars Marowsky-Bree
2004-03-26 19:19               ` Kevin Corry
2004-03-31 17:07                 ` Randy.Dunlap
2004-03-25 23:35             ` Justin T. Gibbs
2004-03-26  0:13               ` Jeff Garzik
2004-03-26 17:43                 ` Justin T. Gibbs
2004-03-28  0:06                   ` Lincoln Dale
2004-03-30 17:54                     ` Justin T. Gibbs
2004-03-28  0:30                   ` Jeff Garzik
2004-03-26 19:15             ` Kevin Corry
2004-03-26 20:45               ` Justin T. Gibbs
2004-03-27 15:39                 ` Kevin Corry
2004-03-28  9:11                   ` [dm-devel] " christophe varoqui
2004-03-30 17:03                   ` Justin T. Gibbs
2004-03-30 17:15                     ` Jeff Garzik
2004-03-30 17:35                       ` Justin T. Gibbs
2004-03-30 17:46                         ` Jeff Garzik
2004-03-30 18:04                           ` Justin T. Gibbs
2004-03-30 21:47                             ` Jeff Garzik
2004-03-30 22:12                               ` Justin T. Gibbs
2004-03-30 22:34                                 ` Jeff Garzik
2004-03-30 18:11                         ` Bartlomiej Zolnierkiewicz
2004-03-25 22:59           ` Justin T. Gibbs
2004-03-25 23:44             ` Lars Marowsky-Bree
2004-03-26  0:03               ` Justin T. Gibbs
     [not found] <1AOTW-4Vx-7@gated-at.bofh.it>
     [not found] ` <1AOTW-4Vx-5@gated-at.bofh.it>
2004-03-18  1:33   ` Andi Kleen
2004-03-18  2:00     ` Jeff Garzik
2004-03-20  9:58       ` Jamie Lokier
  -- strict thread matches above, loose matches on Subject: below --
2004-03-17 18:14 Justin T. Gibbs
2004-03-17 19:18 ` Jeff Garzik
2004-03-17 19:32   ` Christoph Hellwig
2004-03-17 20:02     ` Jeff Garzik
2004-03-17 21:18   ` Scott Long
2004-03-17 21:35     ` Jeff Garzik
2004-03-17 21:45     ` Bartlomiej Zolnierkiewicz
2004-03-18  0:23       ` Scott Long
2004-03-18  1:55         ` Bartlomiej Zolnierkiewicz
2004-03-18  6:38         ` Stefan Smietanowski
2004-03-20 13:07         ` Arjan van de Ven
2004-03-21 23:42           ` Scott Long
2004-03-22  9:05             ` Arjan van de Ven
2004-03-22 21:59               ` Scott Long
2004-03-23  6:48                 ` Arjan van de Ven
2004-03-18  1:56     ` viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).