From: "Justin T. Gibbs" <gibbs@scsiguy.com>
To: linux-raid@vger.kernel.org
Cc: justin_gibbs@adaptec.com
Subject: "Enhanced" MD code avaible for review
Date: Wed, 17 Mar 2004 11:14:21 -0700 [thread overview]
Message-ID: <459805408.1079547261@aslan.scsiguy.com> (raw)
[ I tried sending this last night from my Adaptec email address and have
yet to see it on the list. Sorry if this is dup for any of you. ]
For the past few months, Adaptec Inc, has been working to enhance MD.
The goals of this project are:
o Allow fully pluggable meta-data modules
o Add support for Adaptec ASR (aka HostRAID) and DDF
(Disk Data Format) meta-data types. Both of these
formats are understood natively by certain vendor
BIOSes meaning that arrays can be booted from transparently.
o Improve the ability of MD to auto-configure arrays.
o Support multi-level arrays transparently yet allow
proper event notification across levels when the
topology is known to MD.
o Create a more generic "work item" framework which is
used to support array initialization, rebuild, and
verify operations as well as miscellaneous tasks that
a meta-data or RAID personality may need to perform
from a thread context (e.g. spare activation where
meta-data records may need to be sequenced carefully).
o Modify the MD ioctl interface to allow the creation
of management utilities that are meta-data format
agnostic.
A snapshot of this work is now available here:
http://people.freebsd.org/~gibbs/linux/SRC/emd-0.7.0-tar.gz
This snapshot includes support for RAID0, RAID1, and the Adaptec
ASR and DDF meta-data formats. Additional RAID personalities and
support for the Super90 and Super 1 meta-data formats will be added
in the coming weeks, the end goal being to provide a superset of
the functionality in the current MD.
A patch to fs/partitions/check.c is also required for this
release to function correctly:
http://people.freebsd.org/~gibbs/linux/SRC/md_announce_whole_device.diff
As the file name implies, this patch exposes not only partitions
on devices, but all "base" block devices to MD. This is required
to support meta-data formats like ASR and DDF that typically operate
on the whole device. Nothing in the implementation prevents any
meta-data format from being used on a partition, but BIOS boot
support is only available in the non-partitioned mode.
Since the current MD notification scheme does not allow MD to receive
notifications unless it is statically compiled into the kernel, we
would like to work with the community to develop a more generic
notification scheme to which modules, such as MD, can dynamically
register. Until that occurs, these EMD snapshots will require at
least md.c to be a static component of the kernel.
For those wanting to test out this snapshot with an Adaptec HostRAID
U320 SCSI controller, you will need to update your kernel to use
version 2.0.8 of the aic79xx driver. This driver defaults to
attaching to 790X controllers operating in HostRAID mode in addition
to those in direct SCSI mode. This feature can be disabled using
a module or kernel command option. Driver source and BK send patches
for this driver can be found here:
http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316-tar.gz
http://people.freebsd.org/~gibbs/linux/SRC/aic79xx-linux-2.6-20040316.bksend.gz
Architectural Notes
===================
The major areas of change in "EMD" can be categorized into:
1) "Object Oriented" Data structure changes
These changes are the basis for allowing RAID personalities
to transparently operate on "disks" or "arrays" as member
objects. While it has always been possible to create
multi-level arrays in MD using block layer stacking, our
approach allows MD to also stack internally. Once a given
RAID or meta-data personality is converted to the new
structures, this "feature" comes at no cost. The benefit
to stacking internally, which requires a meta-data format
that supports this, is that array state can propagate up
and down the topology without the loss of information
inherent in using the block layer to traverse levels of an
array.
2) Opcode based interfaces.
Rather than add additional method vectors to either the
RAID personality or meta-data personality objects, the new
code uses only a few methods that are parameterized. This
has allowed us to create a fairly rich interface between
the core and the personalities without overly bloating
personality "classes".
3) WorkItems
Workitems provide a generic framework for queuing work to
a thread context. Workitems include a "control" method as
well as a "handler" method. This separation allows, for
example, a RAID personality to use the generic sync handler
while trapping the "open", "close", and "free" of any sync
workitems. Since both handlers can be tailored to the
individual workitem that is queued, this removes the need
to overload one or more interfaces in the personalities.
It also means that any code in MD can make use of this
framework - it is not tied to particular objects or modules
in the system.
4) "Syncable Volume" Support
All of the transaction accounting necessary to support
redundant arrays has been abstracted out into a few inline
functions. With the inclusion of a "sync support" structure
in a RAID personality's private data structure area and the
use of these functions, the generic sync framework is fully
available. The sync algorithm is also now more like that
in 2.4.X - with some updates to improve performance. Two
contiguous sync ranges are employed so that sync I/O can
be pending while the lock range is extended and new sync
I/O is stalled waiting for normal I/O writes that might
conflict with the new range complete. The syncer updates
its stats more frequently than in the past so that it can
more quickly react to changes in the normal I/O load. Syncer
backoff is also disabled anytime there is pending I/O blocked
on the syncer's locked region. RAID personalities have
full control over the size of the sync windows used so that
they can be optimized based on RAID layout policy.
5) IOCTL Interface
"EMD" now performs all of its configuration via an "mdctl"
character device. Since one of our goals is to remove any
knowledge of meta-data type in the user control programs,
initial meta-data stamping and configuration validation
occurs in the kernel. In general, the meta-data modules
already need this validation code in order to support
auto-configuration, so adding this capability adds little
to the overall size of EMD. It does, however, require a
few additional ioctls to support things like querying the
maximum "coerced" size of a disk targeted for a new array,
or enumerating the names of installed meta-data modules,
etc.
This area of EMD is still in very active development and we expect
to provide a drop of an "emdadm" utility later this week.
6) Meta-data and Topology State
To support pluggable meta-data modules which may have diverse
policies, all embedded knowledge of the MD SuperBlock formats
has been removed. In general, the meta-data modules "bid"
on incoming devices that they can manage. The high bidder
is then asked to configure the disk into a reasonable
topology that can be managed by a RAID personality and the
MD core. The bidding process allows a more "native" meta-data
module to outbid a module that can handle the same format
in "compatibility" mode. It also allows the user to load
a meta-data module update during install scenarios even if
an older module is compiled statically into the kernel.
Once the topology is created, all information needed for
normal operation is available to the MD core and/or RAID
personalities via direct variable access (at times protected
by locks or atomic ops of course). Array or member state
changes occur via calling into the meta-data personality
associated with that object. The meta-data personality is
then responsible for changing the state visible to the rest
of the code and notifying interested parties. This async
design means that a RAID module noticing an I/O failure on
one member and posting that event to one meta-data module,
may cause a chain of notifications all the way to the
top-level array object owned by another RAID/meta-data
personality.
The entire topology is reference counted such that objects
will only disappear from the topology once they have
transitioned to the FAILED state and all I/O (each I/O holds
a reference) ceases.
7) Correction of RAID0 Transform
The RAID0 transform's "merge function" assumes that the
incoming bio's starting sector is the same as what will be
presented to its make_request function. In the case of a
partitioned MD device, the starting sector is shifted by
the partition offset for the target offset. Unfortunately,
the merge functions are not notified of the partition
transform, so RAID0 would often reject requests that span
"chunk" boundaries once shifted. The fix employed here is
to determine if a partition transform will occur and take
this into account in the merge function.
Adaptec is currently validating EMD through formal testing while
continuing the build-out of new features. Our hope is to gather
feedback from the Linux community and adjust our approach to satisfy
the community's requirements. We look forward to your comments,
suggestions, and review of this project.
--
Justin
next reply other threads:[~2004-03-17 18:14 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-03-17 18:14 Justin T. Gibbs [this message]
2004-03-17 19:18 ` "Enhanced" MD code avaible for review Jeff Garzik
2004-03-17 19:32 ` Christoph Hellwig
2004-03-17 20:02 ` Jeff Garzik
2004-03-17 21:18 ` Scott Long
2004-03-17 21:35 ` Jeff Garzik
2004-03-17 21:45 ` Bartlomiej Zolnierkiewicz
2004-03-18 0:23 ` Scott Long
2004-03-18 1:55 ` Bartlomiej Zolnierkiewicz
2004-03-18 6:38 ` Stefan Smietanowski
2004-03-20 13:07 ` Arjan van de Ven
2004-03-21 23:42 ` Scott Long
2004-03-22 9:05 ` Arjan van de Ven
2004-03-22 21:59 ` Scott Long
2004-03-23 6:48 ` Arjan van de Ven
2004-03-18 1:56 ` viro
[not found] <1AOTW-4Vx-7@gated-at.bofh.it>
[not found] ` <1AOTW-4Vx-5@gated-at.bofh.it>
2004-03-18 1:33 ` Andi Kleen
2004-03-18 2:00 ` Jeff Garzik
2004-03-20 9:58 ` Jamie Lokier
-- strict thread matches above, loose matches on Subject: below --
2004-03-19 20:19 Justin T. Gibbs
2004-03-23 5:05 ` Neil Brown
2004-03-23 6:23 ` Justin T. Gibbs
2004-03-24 2:26 ` Neil Brown
2004-03-24 19:09 ` Matt Domsch
2004-03-25 2:21 ` Jeff Garzik
2004-03-25 18:00 ` Kevin Corry
2004-03-25 18:42 ` Jeff Garzik
2004-03-25 18:48 ` Jeff Garzik
2004-03-25 23:46 ` Justin T. Gibbs
2004-03-26 0:01 ` Jeff Garzik
2004-03-26 0:10 ` Justin T. Gibbs
2004-03-26 0:14 ` Jeff Garzik
2004-03-25 22:04 ` Lars Marowsky-Bree
2004-03-26 19:19 ` Kevin Corry
2004-03-31 17:07 ` Randy.Dunlap
2004-03-25 23:35 ` Justin T. Gibbs
2004-03-26 0:13 ` Jeff Garzik
2004-03-26 17:43 ` Justin T. Gibbs
2004-03-28 0:06 ` Lincoln Dale
2004-03-30 17:54 ` Justin T. Gibbs
2004-03-28 0:30 ` Jeff Garzik
2004-03-26 19:15 ` Kevin Corry
2004-03-26 20:45 ` Justin T. Gibbs
2004-03-27 15:39 ` Kevin Corry
2004-03-30 17:03 ` Justin T. Gibbs
2004-03-30 17:15 ` Jeff Garzik
2004-03-30 17:35 ` Justin T. Gibbs
2004-03-30 17:46 ` Jeff Garzik
2004-03-30 18:04 ` Justin T. Gibbs
2004-03-30 21:47 ` Jeff Garzik
2004-03-30 22:12 ` Justin T. Gibbs
2004-03-30 22:34 ` Jeff Garzik
2004-03-30 18:11 ` Bartlomiej Zolnierkiewicz
2004-03-25 22:59 ` Justin T. Gibbs
2004-03-25 23:44 ` Lars Marowsky-Bree
2004-03-26 0:03 ` Justin T. Gibbs
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=459805408.1079547261@aslan.scsiguy.com \
--to=gibbs@scsiguy.com \
--cc=justin_gibbs@adaptec.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).