From: jbrassow@sourceware.org <jbrassow@sourceware.org>
To: lvm-devel@redhat.com
Subject: LVM2/doc lvm_fault_handling.txt
Date: 26 Jul 2010 20:31:54 -0000 [thread overview]
Message-ID: <20100726203154.32752.qmail@sourceware.org> (raw)
CVSROOT: /cvs/lvm2
Module name: LVM2
Changes by: jbrassow at sourceware.org 2010-07-26 20:31:54
Added files:
doc : lvm_fault_handling.txt
Log message:
Initial import of document describing LVM's policies
surrounding device faults/failures.
Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/LVM2/doc/lvm_fault_handling.txt.diff?cvsroot=lvm2&r1=NONE&r2=1.1
/cvs/lvm2/LVM2/doc/lvm_fault_handling.txt,v --> standard output
revision 1.1
--- LVM2/doc/lvm_fault_handling.txt
+++ - 2010-07-26 20:31:54.280508000 +0000
@@ -0,0 +1,221 @@
+LVM device fault handling
+=========================
+
+Introduction
+------------
+This document is to serve as the definitive source for information
+regarding the policies and procedures surrounding device failures
+in LVM. It codifies LVM's responses to device failures as well as
+the responsibilities of administrators.
+
+Device failures can be permanent or transient. A permanent failure
+is one where a device becomes inaccessible and will never be
+revived. A transient failure is a failure that can be recovered
+from (e.g. a power failure, intermittent network outage, block
+relocation, etc). The policies for handling both types of failures
+is described herein.
+
+Available Operations During a Device Failure
+--------------------------------------------
+When there is a device failure, LVM behaves somewhat differently because
+only a subset of the available devices will be found for the particular
+volume group. The number of operations available to the administrator
+is diminished. It is not possible to create new logical volumes while
+PVs cannot be accessed, for example. Operations that create, convert, or
+resize logical volumes are disallowed, such as:
+- lvcreate
+- lvresize
+- lvreduce
+- lvextend
+- lvconvert (unless '--repair' is used)
+Operations that activate, deactivate, remove, report, or repair logical
+volumes are allowed, such as:
+- lvremove
+- vgremove (will remove all LVs, but not the VG until consistent)
+- pvs
+- vgs
+- lvs
+- lvchange -a [yn]
+- vgchange -a [yn]
+Operations specific to the handling of failed devices are allowed and
+are as follows:
+
+- 'vgreduce --removemissing <VG>': This action is designed to remove
+ the reference of a failed device from the LVM metadata stored on the
+ remaining devices. If there are (portions of) logical volumes on the
+ failed devices, the ability of the operation to proceed will depend
+ on the type of logical volumes found. If an image (i.e leg or side)
+ of a mirror is located on the device, that image/leg of the mirror
+ is eliminated along with the failed device. The result of such a
+ mirror reduction could be a no-longer-redundant linear device. If
+ a linear, stripe, or snapshot device is located on the failed device
+ the command will not proceed without a '--force' option. The result
+ of using the '--force' option is the entire removal and complete
+ loss of the non-redundant logical volume. Once this operation is
+ complete, the volume group will again have a complete and consistent
+ view of the devices it contains. Thus, all operations will be
+ permitted - including creation, conversion, and resizing operations.
+
+- 'lvconvert --repair <VG/LV>': This action is designed specifically
+ to operate on mirrored logical volumes. It is used on logical volumes
+ individually and does not remove the faulty device from the volume
+ group. If, for example, a failed device happened to contain the
+ images of four distinct mirrors, it would be necessary to run
+ 'lvconvert --repair' on each of them. The ultimate result is to leave
+ the faulty device in the volume group, but have no logical volumes
+ referencing it. In addition to removing mirror images that reside
+ on failed devices, 'lvconvert --repair' can also replace the failed
+ device if there are spare devices available in the volume group. The
+ user is prompted whether to simply remove the failed portions of the
+ mirror or to also allocate a replacement, if run from the command-line.
+ Optionally, the '--use-policies' flag can be specified which will
+ cause the operation not to prompt the user, but instead respect
+ the policies outlined in the LVM configuration file - usually,
+ /etc/lvm/lvm.conf. Once this operation is complete, mirrored logical
+ volumes will be consistent and I/O will be allowed to continue.
+ However, the volume group will still be inconsistent - due to the
+ refernced-but-missing device/PV - and operations will still be
+ restricted to the aformentioned actions until either the device is
+ restored or 'vgreduce --removemissing' is run.
+
+Device Revival (transient failures):
+------------------------------------
+During a device failure, the above section describes what limitations
+a user can expect. However, if the device returns after a period of
+time, what to expect will depend on what has happened during the time
+period when the device was failed. If no automated actions (described
+below) or user actions were necessary or performed, then no change in
+operations or logical volume layout will occur. However, if an
+automated action or one of the aforementioned repair commands was
+manually run, the returning device will be perceived as having stale
+LVM metadata. In this case, the user can expect to see a warning
+concerning inconsistent metadata. The metadata on the returning
+device will be automatically replaced with the latest copy of the
+LVM metadata - restoring consistency. Note, while most LVM commands
+will automatically update the metadata on a restored devices, the
+following possible exceptions exist:
+- pvs (when it does not read/update VG metadata)
+
+Automated Target Response to Failures:
+--------------------------------------
+The only LVM target type (i.e. "personality") that has an automated
+response to failures is a mirrored logical volume. The other target
+types (linear, stripe, snapshot, etc) will simply propagate the failure.
+[A snapshot becomes invalid if its underlying device fails, but the
+origin will remain valid - presuming the origin device has not failed.]
+There are three types of errors that a mirror can suffer - read, write,
+and resynchronization errors. Each is described in depth below.
+
+Mirror read failures:
+If a mirror is 'in-sync' (i.e. all images have been initialized and
+are identical), a read failure will only produce a warning. Data is
+simply pulled from one of the other images and the fault is recorded.
+Sometimes - like in the case of bad block relocation - read errors can
+be recovered from by the storage hardware. Therefore, it is up to the
+user to decide whether to reconfigure the mirror and remove the device
+that caused the error. Managing the composition of a mirror is done with
+'lvconvert' and removing a device from a volume group can be done with
+'vgreduce'.
+
+If a mirror is not 'in-sync', a read failure will produce an I/O error.
+This error will propagate all the way up to the applications above the
+logical volume (e.g. the file system). No automatic intervention will
+take place in this case either. It is up to the user to decide what
+can be done/salvaged in this senario. If the user is confident that the
+images of the mirror are the same (or they are willing to simply attempt
+to retreive whatever data they can), 'lvconvert' can be used to eliminate
+the failed image and proceed.
+
+Mirror resynchronization errors:
+A resynchronization error is one that occurs when trying to initialize
+all mirror images to be the same. It can happen due to a failure to
+read the primary image (the image considered to have the 'good' data), or
+due to a failure to write the secondary images. This type of failure
+only produces a warning, and it is up to the user to take action in this
+case. If the error is transient, the user can simply reactivate the
+mirrored logical volume to make another attempt at resynchronization.
+If attempts to finish resynchronization fail, 'lvconvert' can be used to
+remove the faulty device from the mirror.
+
+TODO...
+Some sort of response to this type of error could be automated.
+Since this document is the definitive source for how to handle device
+failures, the process should be defined here. If the process is defined
+but not implemented, it should be noted as such. One idea might be to
+make a single attempt to suspend/resume the mirror in an attempt to
+redo the sync operation that failed. On the other hand, if there is
+a permanent failure, it may simply be best to wait for the user or the
+automated response that is sure to follow from a write failure.
+...TODO
+
+Mirror write failures:
+When a write error occurs on a mirror constituent device, an attempt
+to handle the failure is automatically made. This is done by calling
+'lvconvert --repair --use-policies'. The policies implied by this
+command are set in the LVM configuration file. They are:
+- mirror_log_fault_policy: This defines what action should be taken
+ if the device containing the log fails. The available options are
+ "remove" and "allocate". Either of these options will cause the
+ faulty log device to be removed from the mirror. The "allocate"
+ policy will attempt the further action of trying to replace the
+ failed disk log by using space that might be available in the
+ volume group. If the allocation fails (or the "remove" policy
+ is specified), the mirror log will be maintained in memory. Should
+ the machine be rebooted or the logical volume deactivated, a
+ complete resynchronization of the mirror will be necessary upon
+ the follow activation - such is the nature of a mirror with a 'core'
+ log. The default policy for handling log failures is "allocate".
+ The service disruption incurred by replacing the failed log is
+ negligible, while the benefits of having persistent log is
+ pronounced.
+- mirror_image_fault_policy: This defines what action should be taken
+ if a device containing an image fails. Again, the available options
+ are "remove" and "allocate". Both of these options will cause the
+ faulty image device to be removed - adjusting the logical volume
+ accordingly. For example, if one image of a 2-way mirror fails, the
+ mirror will be converted to a linear device. If one image of a
+ 3-way mirror fails, the mirror will be converted to a 2-way mirror.
+ The "allocate" policy takes the further action of trying to replace
+ the failed image using space that is available in the volume group.
+ Replacing a failed mirror image will incure the cost of
+ resynchronizing - degrading the performance of the mirror. The
+ default policy for handling an image failure is "remove". This
+ allows the mirror to still function, but gives the administrator the
+ choice of when to incure the extra performance costs of replacing
+ the failed image.
+
+TODO...
+The appropriate time to take permanent corrective action on a mirror
+should be driven by policy. There should be a directive that takes
+a time or percentage argument. Something like the following:
+- mirror_fault_policy_WHEN = "10sec"/"10%"
+A time value would signal the amount of time to wait for transient
+failures to resolve themselves. The percentage value would signal the
+amount a mirror could become out-of-sync before the faulty device is
+removed.
+
+A mirror cannot be used unless /some/ corrective action is taken,
+however. One option is to replace the failed mirror image with an
+error target, forgo the use of 'handle_errors', and simply let the
+out-of-sync regions accumulate and be tracked by the log. Mirrors
+that have more than 2 images would have to "stack" to perform the
+tracking, as each failed image would have to be associated with a
+log. If the failure is transient, the device would replace the
+error target that was holding its spot and the log that was tracking
+the deltas would be used to quickly restore the portions that changed.
+
+One unresolved issue with the above scheme is how to know which
+regions of the mirror are out-of-sync when a problem occurs. When
+a write failure occurs in the kernel, the log will contain those
+regions that are not in-sync. If the log is a disk log, that log
+could continue to be used to track differences. However, if the
+log was a core log - or if the log device failed at the same time
+as an image device - there would be no way to determine which
+regions are out-of-sync to begin with as we start to track the
+deltas for the failed image. I don't have a solution for this
+problem other than to only be able to handle errors in this way
+if conditions are right. These issues will have to be ironed out
+before proceeding. This could be another case, where it is better
+to handle failures in the kernel by allowing the kernel to store
+updates in various metadata areas.
+...TODO
reply other threads:[~2010-07-26 20:31 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100726203154.32752.qmail@sourceware.org \
--to=jbrassow@sourceware.org \
--cc=lvm-devel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.