From: Neil Brown <neilb@suse.de>
To: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Cc: linux-raid@vger.kernel.org, wojciech.neubauer@intel.com,
adam.kwolek@intel.com, dan.j.williams@intel.com,
ed.ciechanowski@intel.com
Subject: Re: [PATCH 11/13] Document the external reshape implementation
Date: Tue, 23 Nov 2010 15:52:29 +1100 [thread overview]
Message-ID: <20101123155229.0e6c0511@notabene.brown> (raw)
In-Reply-To: <20101118092251.29508.44786.stgit@gklab-170-111.igk.intel.com>
On Thu, 18 Nov 2010 10:22:51 +0100
Krzysztof Wojcik <krzysztof.wojcik@intel.com> wrote:
> From: Dan Williams <dan.j.williams@intel.com>
>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> external-reshape-design.txt | 168 +++++++++++++++++++++++++++++++++++++++++++
*very* happy to get this sort of documentation.
Hopefully the holes will be filled in in due course(?).
There is no mention of starting mdmon when converting e.g. raid0 to raid10.
But applied as-is.
Thanks,
NeilBrown
> 1 files changed, 168 insertions(+), 0 deletions(-)
> create mode 100644 external-reshape-design.txt
>
> diff --git a/external-reshape-design.txt b/external-reshape-design.txt
> new file mode 100644
> index 0000000..d6fb98d
> --- /dev/null
> +++ b/external-reshape-design.txt
> @@ -0,0 +1,168 @@
> +External Reshape
> +
> +1 Problem statement
> +
> +External (third-party metadata) reshape differs from native-metadata
> +reshape in three key ways:
> +
> +1.1 Format specific constraints
> +
> +In the native case reshape is limited by what is implemented in the
> +generic reshape routine (Grow_reshape()) and what is supported by the
> +kernel. There are exceptional cases where Grow_reshape() may block
> +operations when it knows that the kernel implementation is broken, but
> +otherwise the kernel is relied upon to be the final arbiter of what
> +reshape operations are supported.
> +
> +In the external case the kernel, and the generic checks in
> +Grow_reshape(), become the super-set of what reshapes are possible. The
> +metadata format may not support, or have yet to implement a given
> +reshape type. The implication for Grow_reshape() is that it must query
> +the metadata handler and effect changes in the metadata before the new
> +geometry is posted to the kernel. The ->reshape_super method allows
> +Grow_reshape() to validate the requested operation and post the metadata
> +update.
> +
> +1.2 Scope of reshape
> +
> +Native metadata reshape is always performed at the array scope (no
> +metadata relationship with sibling arrays on the same disks). External
> +reshape, depending on the format, may not allow the number of member
> +disks to be changed in a subarray unless the change is simultaneously
> +applied to all subarrays in the container. For example the imsm format
> +requires all member disks to be a member of all subarrays, so a 4-disk
> +raid5 in a container that also houses a 4-disk raid10 array could not be
> +reshaped to 5 disks as the imsm format does not support a 5-disk raid10
> +representation. This requires the ->reshape_super method to check the
> +contents of the array and ask the user to run the reshape at container
> +scope (if both subarrays are agreeable to the change), or report an
> +error in the case where one subarray cannot support the change.
> +
> +1.3 Monitoring / checkpointing
> +
> +Reshape, unlike rebuild/resync, requires strict checkpointing to survive
> +interrupted reshape operations. For example when expanding a raid5
> +array the first few stripes of the array will be overwritten in a
> +destructive manner. When restarting the reshape process we need to know
> +the exact location of the last successfully written stripe, and we need
> +to restore the data in any partially overwritten stripe. Native
> +metadata stores this backup data in the unused portion of spares that
> +are being promoted to array members, or in an external backup file
> +(located on a non-involved block device).
> +
> +The kernel is in charge of recording checkpoints of reshape progress,
> +but mdadm is delegated the task of managing the backup space which
> +involves:
> +1/ Identifying what data will be overwritten in the next unit of reshape
> + operation
> +2/ Suspending access to that region so that a snapshot of the data can
> + be transferred to the backup space.
> +3/ Allowing the kernel to reshape the saved region and setting the
> + boundary for the next backup.
> +
> +In the external reshape case we want to preserve this mdadm
> +'reshape-manager' arrangement, but have a third actor, mdmon, to
> +consider. It is tempting to give the role of managing reshape to mdmon,
> +but that is counter to its role as a monitor, and conflicts with the
> +existing capabilities and role of mdadm to manage the progress of
> +reshape. For clarity the external reshape implementation maintains the
> +role of mdmon as a (mostly) passive recorder of raid events, and mdadm
> +treats it as it would the kernel in the native reshape case (modulo
> +needing to send explicit metadata update messages and checking that
> +mdmon took the expected action).
> +
> +External reshape can use the generic md backup file as a fallback, but in the
> +optimal/firmware-compatible case the reshape-manager will use the metadata
> +specific areas for managing reshape. The implementation also needs to spawn a
> +reshape-manager per subarray when the reshape is being carried out at the
> +container level. For these two reasons the ->manage_reshape() method is
> +introduced. This method in addition to base tasks mentioned above:
> +1/ Spawns a manager per-subarray, when necessary
> +2/ Uses either generic routines in Grow.c for md-style backup file
> + support, or uses the metadata-format specific location for storing
> + recovery data.
> +This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
> +optionally take advantage of generic infrastructure in Grow.c
> +
> +2 Details for specific reshape requests
> +
> +There are quite a few moving pieces spread out across md, mdadm, and mdmon for
> +the support of external reshape, and there are several different types of
> +reshape that need to be comprehended by the implementation. A rundown of
> +these details follows.
> +
> +2.0 General provisions:
> +
> +Obtain an exclusive open on the container to make sure we are not
> +running concurrently with a Create() event.
> +
> +2.1 Freezing sync_action
> +
> +2.2 Reshape size
> +
> + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
> + initializes st->update_tail
> + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
> + is allowed (being performed at subarray scope / enough room) prepares a
> + metadata update
> + 3/ mdadm::Grow_reshape(): flushes the metadata update (via
> + flush_metadata_update(), or ->sync_metadata())
> + 4/ mdadm::Grow_reshape(): post the new size to the kernel
> +
> +
> +2.3 Reshape level (simple-takeover)
> +
> +"simple-takeover" implies the level change can be satisfied without touching
> +sync_action
> +
> + 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
> + initializes st->update_tail
> + 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
> + is allowed (being performed at subarray scope) prepares a
> + metadata update
> + 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
> + ->reshape_super
> + 3/ mdadm::Grow_reshape(): flushes the metadata update (via
> + flush_metadata_update(), or ->sync_metadata())
> + 4/ mdadm::Grow_reshape(): post the new level to the kernel
> +
> +2.4 Reshape chunk, layout
> +
> +2.5 Reshape raid disks (grow)
> +
> + 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
> + because only redundant raid levels can modify the number of raid disks
> + 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
> + change is allowed (being performed at proper scope / permissible
> + geometry / proper spares available in the container) prepares a metadata
> + update.
> + 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
> + raid level that can perform the reshape and starts mdmon.
> + 4/ mdadm::Grow_reshape(): Pushes the update to mdmon...
> + 4a/ mdmon::process_update(): marks the array as reshaping
> + 4b/ mdmon::manage_member(): adds the spares (without assigning a slot)
> + 5/ mdadm::Grow_reshape(): Notes that mdmon has assigned spares and invokes
> + ->manage_reshape()
> + 5/ mdadm::<format>->manage_reshape(): (for each subarray) sets sync_max to
> + zero, starts the reshape, and pings mdmon
> + 5a/ mdmon::read_and_act(): notices that reshape has started and notifies
> + the metadata handler to record the slots chosen by the kernel
> + 6/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
> + the kernel to either the backup file or the metadata specific location,
> + advances sync_max, waits for reshape, ping mdmon, repeat.
> + 6a/ mdmon::read_and_act(): records checkpoints
> + 7/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
> + level back to the nominal raid level (if necessary)
> +
> + FIXME: native metadata does not have the capability to record the original
> + raid level in reshape-restart case because the kernel always records current
> + raid level to the metadata, whereas external metadata can masquerade at an
> + alternate level based on the reshape state.
> +
> +2.6 Reshape raid disks (shrink)
> +
> +3 TODO
> +
> +...
> +
> +[1]: Linux kernel design patterns - part 3, Neil Brown http://lwn.net/Articles/336262/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2010-11-23 4:52 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-18 9:21 [PATCH 00/13] Series short description Krzysztof Wojcik
2010-11-18 9:21 ` [PATCH 01/13] Provide a mdstat_ent to subarray helper Krzysztof Wojcik
2010-11-18 9:21 ` [PATCH 02/13] block monitor: freeze spare assignment for external arrays Krzysztof Wojcik
2010-11-23 4:03 ` Neil Brown
2010-11-18 9:21 ` [PATCH 03/13] Manage: allow manual control of external raid0 readonly flag Krzysztof Wojcik
2010-11-23 4:08 ` Neil Brown
2010-11-18 9:21 ` [PATCH 04/13] Grow: mark some functions static Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 05/13] Assemble: fix assembly in the delta_disks > max_degraded case Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 06/13] Grow: fix check for raid6 layout normalization Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 07/13] Grow: add missing raid4 geometries to geo_map() Krzysztof Wojcik
2010-11-23 4:16 ` Neil Brown
2010-11-18 9:22 ` [PATCH 08/13] fix a get_linux_version() comparison typo Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 09/13] Create: cleanup/unify default geometry handling Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 10/13] Initialize st->devnum and st->container_dev in super_by_fd Krzysztof Wojcik
2010-11-18 9:22 ` [PATCH 11/13] Document the external reshape implementation Krzysztof Wojcik
2010-11-23 4:52 ` Neil Brown [this message]
2010-11-18 9:22 ` [PATCH 12/13] External reshape (step 1): container reshape and ->reshape_super() Krzysztof Wojcik
2010-11-23 5:22 ` Neil Brown
2010-11-18 9:23 ` [PATCH 13/13] External reshape (step 2): Freeze container Krzysztof Wojcik
2010-11-23 6:11 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101123155229.0e6c0511@notabene.brown \
--to=neilb@suse.de \
--cc=adam.kwolek@intel.com \
--cc=dan.j.williams@intel.com \
--cc=ed.ciechanowski@intel.com \
--cc=krzysztof.wojcik@intel.com \
--cc=linux-raid@vger.kernel.org \
--cc=wojciech.neubauer@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).