Re: md road-map: 2011

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Joe Landman <joe.landman@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Subject: Re: md road-map: 2011
Date: Wed, 16 Feb 2011 12:20:32 -0500	[thread overview]
Message-ID: <4D5C0760.4090304@gmail.com> (raw)
In-Reply-To: <20110216212751.51a294aa@notabene.brown>

On 02/16/2011 05:27 AM, NeilBrown wrote:
>
> I all,
>   I wrote this today and posted it at
> http://neil.brown.name/blog/20110216044002
>
> I thought it might be worth posting it here too...

Another request would be an incremental on-demand build of the RAID. 
That is, when we set up a RAID6, that it only computes the blocks as 
they are allocated and used.  This helps with things like thin 
provisioning on remote target devices (among other nice things).

>
> NeilBrown
>
>
> -------------------------
>
>
> It is about 2 years since I last published a road-map[1] for md/raid
> so I thought it was time for another one.  Unfortunately quite a few
> things on the previous list remain undone, but there has been some
> progress.
>
> I think one of the problems with some to-do lists is that they aren't
> detailed enough.  High-level design, low level design, implementation,
> and testing are all very different sorts of tasks that seem to require
> different styles of thinking and so are best done separately.  As
> writing up a road-map is a high-level design task it makes sense to do
> the full high-level design at that point so that the tasks are
> detailed enough to be addressed individually with little reference to
> the other tasks in the list (except what is explicit in the road map).
>
> A particular need I am finding for this road map is to make explicit
> the required ordering and interdependence of certain tasks.  Hopefully
> that will make it easier to address them in an appropriate order, and
> mean that I waste less time saying "this is too hard, I might go read
> some email instead".
>
> So the following is a detailed road-map for md raid for the coming
> months.
>
> [1] http://neil.brown.name/blog/20090129234603
>
> Bad Block Log
> -------------
>
> As devices grow in capacity, the chance of finding a bad block
> increases, and the time taken to recover to a spare also increases.
> So the practice of ejecting a device from the array as soon as a
> write-error is detected is getting more and more problematic.
>
> For some time we have avoided ejecting devices for read errors, by
> computing the expected data from elsewhere and writing back to the
> device - hopefully fixing the read error.  However this cannot help
> degraded arrays and they will still eject a device (and hence fail the
> whole array) on a single read error.  This is not good.
>
> A particular problem is that when a device does fail and we need to
> recover the data, we typically read all of the other blocks on all
> arrays.  If we are going to hit any read errors, this is the most
> likely time, and also this is the worst possible time and it will mean
> that the recovery doesn't complete and the array gets stuck in a
> degraded state and is very susceptible to substantial loss if another
> failure happens.
>
> Part of the answer to this is to implement a "bad block log".  This is
> a record of blocks that are known to be bad.  i.e. either a read or a
> write has recently failed.  Doing this allows us to just eject that
> block from the array rather than the whole devices.  Similarly instead
> of failing the whole array, we can fail just one stripe.  Certainly
> this can mean data loss, but the loss of a few K is much less
> traumatic than the loss of a terabyte.
>
> But using a bad block list isn't just about keeping the data loss
> small, it can be about keeping it to zero.  If we get a write error on
> a block in a non-degraded array, then recording the bad block means we
> lose redundancy in just that stripe rather than losing it across the
> whole array.  If we then lose a different block on a different drive,
> the ability to record the bad block means that we can continue without
> data loss.  Had we needed to eject both whole drives from the array we
> would have lost access to all of our data.
>
> The bad block list must be recorded to stable storage to be useful, so
> it really needs to be on the same drives that store the data.  The
> bad-block list for a particular device is only of any interest to that
> device.  Keeping information about one device on another is pointless.
> So we don't have a bad block list for the whole array, we keep
> multiple lists, one for each device.
>
> It would be best to keep at least two copies of the bad block list so
> that if the place where the list goes bad we can keep working with the
> device.  The same logic applies to other metadata which currently
> cannot be duplicated.  So implementing this feature will not address
> metadata redundancy.  A separate feature should address metadata
> redundancy and it can duplicate the bad block list as well as other
> metadata.
>
> There are doubtlessly lots of ways that the bad block list could be
> stored, but we need to settle on one.  For externally managed metadata
> we need to make the list accessible via sysfs in a generic way so that
> a user-space program can store is as appropriate.
>
> So: for v0.90 we choose not to store a bad block list.  There isn't
> anywhere convenient to store it and new installations of v0.90 are not
> really encouraged.
>
> For v1.x metadata we record in the metadata an offset (from the
> superblock) and a size for a table, and a 'shift' value which can be
> used to shift from sector addresses to block numbers.  Thus the unit
> that is failed when an error is detected can be larger than one
> sector.
>
> Each entry in the table is 64bits in little-endian.   The most
> significant 55 bits store a block number which allows for 16 exbibytes
> with 512byte blocks, or more if a larger shift size is used.  The
> remaining 9 bits store a length of the bad range which can range from
> 1 to 512.  As bad blocks can often be consecutive, this is expected to
> allow the list to be quite efficient.  A value of all 1's cannot
> correctly identify a bad range of blocks and so it is used to pad out
> the tail of the list.
>
> The bad block list is exposed through sysfs via a directory called
> 'badblocks' containing several attribute files.
>
> "shift" stores the 'shift' number described above and can be set as
> long as the bad block list is empty.
>
> "all" and "unacknowledged" each contains a list of bad ranges, the
> start (in blocks, not sectors) and the length (1-512).  Each can also
> be written to with a string of the same format as is read out.  This
> can be used to add bad blocks to the list or to acknowledge bad
> blocks.  Writing effectively say "this bad range is securely recorded
> on stable storage".
>
> All bad blocks appear in the "badblocks/all" file.  Only "acknowledged"
> bad blocks appear in "badblocks/unacknowledged".  These are ranges
> which appear to be bad but are not known to be stored on stable
> storage.
>
> When md detects a write error or a read error which it cannot correct
> it added the block and marks the range that it was part of as
> 'unacknowledged'.  Any write that depends on this block is then
> blocked until the range is acknowledged.  This ensures that an
> application isn't told that a write has succeeded until the data
> really is safe.
>
> If the bad block list is being managed by v1.x metadata internally,
> then the bad block list will be written out and the ranges will be
> acknowledged and writes unblocked automatically.
>
> If the bad block list is being managed externally, then the bad ranges
> will be reported in "unacknowledged_bad_blocks".  The metadata handler
> should read this, update the on-disk metadata and write the range back
> to "bad_blocks".  This completes the acknowledgment handshake and
> writes can continue.
>
> RAID1, RAID10 and RAID456 should all support bad blocks.  Every read
> or write should perform a lookup of the bad block list.  If a read
> finds a bad block, that device should be treated as failed for that
> read.  This includes reads that are part of resync or recovery.
>
> If a write finds a bad block there are two possible responses.  Either
> the block can be ignored as with reads, or we can try to write the
> data in the hope that it will fix the error.  Always taking the second
> action would seem best as it allows blocks to be removed from the
> bad-block list, but as a failing write can take a long time, there are
> plenty of cases where it would not be good.
>
> To choose between these we make the simple decision that once we see a
> write error we never try to write to bad blocks on that device again.
> This may not always be the perfect strategy, but it will effectively
> address common scenarios.  So if a bad block is marked bad due to a
> read error when the array was degraded, then a write (presumably from
> the filesystem) will have the opportunity to correct the error.
> However if it was marked bad due to a write error we don't risk paying
> the penalty of more write errors.
>
> This 'have seen a write error' status is not stored in the array
> metadata.  So when restarting an array with some bad blocks, each
> device will have one chance to prove that it can correctly handle
> writes to a bad block.  If it can, the bad block will be removed from
> the list and the data is that little bit safer.  If it cannot, no
> further writes to bad blocks will be tried on the device until the
> next array restart.
>
>
> Hot Replace
> -----------
>
> "Hot replace" is my name for the process of replacing one device in an
> array by another one without first failing the one device.  Thus there
> can be two devices in an array filling the same 'role'.  One device
> will contain all of the data, the other device will only contain some
> of it and will be undergoing a 'recovery' process.  Once the second
> device is fully recovered it is expected that the first device will be
> removed from the array.
>
> This can be useful whenever you want to replace a working device with
> another device, without letting the array go degraded.  Two obvious
> cases are:
>   1/ when you want to replace a smaller device with a larger device
>   2/ when you have a device with a number of bad blocks and want to
>      replace it with a more reliable device.
>
> For '2' to be realised, the bad block log described above must be
> implemented, so it should be completed before this feature.
>
> Hot replace is really only needed for RAID10 and RAID456.  For RAID1,
> simply increasing the number of devices in the array while the new
> device recovers, then failing the old device and decreasing the number
> of devices in the array is sufficient.
>
> For RAID0 or LINEAR it would be sufficient to:
>   - stop the array
>   - make a RAID1 without superblocks for the old and new device
>   - re-assemble the array using the RAID1 in place of the old device.
>
> This is certainly not as convenient but is sufficient for a case that
> is not likely to be commonly needed.
>
> So for both the RAID10 and RAID456 modules we need:
>   - the ability to add a device as a hot-replace device for a specific
>     slot
>   - the ability to record hot-spare status in the metadata.
>   - a 'recovery' process to rebuild a device, preferably only reading
>     from the device to be replaced, though reading from elsewhere when
>     needed
>   - writes to go to both primary and secondary device.
>   - Reads to come from either if the second has recovered far enough.
>   - to promote a secondary device to primary when the primary device
>     (that has a hot-replace device) fails.
>
> It is not clear whether the primary should be automatically failed
> when the rebuild of the secondary completes.  Commonly this would be
> ideal, but if the secondary experienced any write errors (that were
> recorded in the bad block log) then it would be best to leave both in
> place until the sysadmin resolves the situation.   So in the first
> implementation this failing should not be automatic.
>
> The identification of a spare as a 'hot-replace' device is achieved
> through the 'md/dev-XXXX/slot' sysfs attribute.  This is usually
> 'none' or a small integer identifying which slot in the array is
> filled by this device.  A number followed by a plus (e.g. '1+') is
> written, then the device takes the role of a hot-spare.  This syntax
> requires there be at most one hot spare per slot.  This is a
> deliberate decision to manage complexity in the code.  Allowing more
> would be of minimal value but require substantial extra complexity.
>
> v0.90 metadata is not supported.  v1.x sets a 'feature bit' on the
> superblock of any 'hot-replace' device and naturally records in
> 'recover_offset' how far recovery has progressed.  Externally managed
> metadata can support this, or not, as they choose.
>
>
> Reversible Reshape
> ------------------
>
> It is possible to start a reshape that cannot be reversed until the
> reshape has completed.  This is occasionally problematic.  While we
> might hope that users would never make errors, we should try to be as
> forgiving as possible.
>
> Reversing a reshape that changes the number of data-devices is
> possible as we support both growing and shrinking and these happen in
> opposite directions so one is the reverse of the other.  Thus at worst,
> such a reshape can be reversed by:
>   - stopping the array
>   - re-writing the metadata so it looks like the change is going in the
>     other direction
>   - restarting the array.
>
> However for a reshape that doesn't change the number of data devices,
> such as a RAID5->RAID6 conversion or a change of chunk-size, reversal
> is currently not possible as the change always goes in the same
> direction.
>
> This is currently only meaningful for RAID456, though at some later
> date it might be relevant for RAID10.
>
> A future change will make it possible to move the data_offset while
> performing a reshape, and that will sometimes require the reshape to
> progress in a certain direction.  It is only when the data_offset is
> unchanged and the number of data disks is unchanged that there is any
> doubt about direction.  In that case it needs to be explicitly stated.
>
> We need:
>   - some way to record in the metadata the direction of the reshape
>   - some way to ask for a reshape to be started in the reverse
>     direction
>   - some way to reverse a reshape that is currently happening.
>
> We have a new sysfs attribute "reshape_direction" which is
> "low-to-high" or "high-to-low".  This defaults to "low-to-high" but
> will be force to "high-to-low" if the particular reshape requires it,
> or can be explicity set by a 'write' before the reshape commences.
>
> Once the reshape has commenced, writing a new value to this field can
> flip the reshape causing it to be reverted.
>
> In both v0.90 and v1.x metadata we record a reversing reshape by
> setting the most significant bit in reshape_position.  For v0.90 we
> also increase the minor number to 91.  For v1.x we set a feature bit
> as well.
>
>
> Change data offset during reshape
> ---------------------------------
>
> One of the biggest problems with reshape currently is the need for the
> backup file.  This is a management problem as it cannot easily be
> found at restart, and it is a performance problem as the extra writing
> is expensive.
>
> In some cases we can avoid the need for a backup file completely by
> changing the data-offset.  i.e. the location on the devices where the
> array data starts.
>
> For reshapes that increase the number of devices, only a single backup
> is required at the start.  If the data_offset is moved just one chunk
> earlier we can do without a separate backup.  This obviously requires
> that space was left when the array was first created.  Recent versions
> of mdadm do leave some space with the default metadata, though more
> would probably be good.
>
> For reshapes that decrease the number of device, only a small backup
> is required right at the end of the process (at the beginning of the
> devices).  If we move the data_offset forward by one chunk that backup
> too can be avoided.  As we are normally reducing the size of the array
> in this process, we just need to reduce it a little bit more.
>
> For reshapes that neither increase of decrease the number of devices a
> somewhat larger change in data_offset is needed to get reasonable
> performance.  A single chunk (of the larger chunk size) would work,
> but would require updating the metadata after each chunk which would
> be prohibitively slow unless chunks were very large.  A few megabytes
> is probably sufficient for reasonable performance, though testing
> would be helpful to be sure.  Current mdadm leaves no space at the
> start of 1.0, and about 1Meg at the start of 1.1 and 1.2 arrays.
>
> This will generally not be enough space.  In these cases it will
> probably be best to perform the reshape in the reverse direction
> (helped by the previous feature).  This will probably require
> shrinking the filesystem and the array slightly first.  Future
> versions of mdadm should aim to leave a few metabytes free at start
> and end to make these reshapes work better.
>
> Moving the data offset is not possible for 0.90 metadata as it does
> not record a data offset.
>
> For 1.x metadata it is possible to have a different data_offset on
> each device.  However for simplicity we will only support changing the
> data offset by the same amount on each device.  This amount will be
> stored in currently-unused space in the 1.x metadata.  There will be a
> sysfs attribute "delta_data_offset" which can be set to a number of
> sectors - positive or negative - to request a change in the data
> offset and thus avoid the need for a backup file.
>
>
> Bitmap of non-sync regions.
> ---------------------------
>
> There are a couple of reasons for having regions of an array that are known
> not to contain important data and are known to not necessarily be
> in-sync.
>
> 1/ When an array is first created it normally contains no valid data.
>     The normal process of a 'resync' to make all parity/copies correct
>     is largely a waste of time.
> 2/ When the filesystem uses a "discard" command to report that a
>     region of the device is no-longer used it would be good to be able
>     to pass this down to the underlying devices.  To do this safely we
>     need to record at the md level that the region is unused so we
>     don't complain about inconsistencies and don't try to re-sync the
>     region after a crash.
>
> If we record which regions are not in-sync in a bitmap then we can meet
> both of these needs.
>
> A read to a non-in-sync region would always return 0s.
> A 'write' to a non-in-sync region should cause that region to be
> resynced.  Writing zeros would in some sense be ideal, but to do that
> we would have to block the write, which would be unfortunate.  As the
> fs should not be reading from that area anyway, it shouldn't really
> matter.
>
> The granularity of the bit is probably quite hard to get right.
> Having it match the block size would mean that no resync would be
> needed and that every discard request could be handled exactly.
> However it could result in a very large bitmap - 30 Megabytes for a 1
> terabyte device with a 4K block size.  This would need to be kept in
> memory and looked up for every access, which could be problematic.
>
> Having a very coarse granularity would make storage and lookups more
> efficient.  If we make sure the bitmap would fit in 4K, we would have
> about 32 megabytes for bit.  This would mean that each time we
> triggered a resync it would resync for a second or two which is
> probably a reasonable time as it wouldn't happen very often.  But it
> would also mean that we can only service a 'discard' request if it
> covers whole blocks of 32 megabytes, and I really don't know how
> likely that is.  Actually I'm not sure if anyone knows, the jury seems
> to still be out on how 'discard' will work long-term.
>
> So probably aiming for a few K to a few hundred K seems reasonable.
> That means that the in-memory representation will have to be a
> two-level array.  A page of pointers to other pages can cover (on a
> 64bit system) 512 pages or 2Meg of bitmap space which should be
> enough.
>
> As always we need a way to:
>   - record the location and size of the bitmap in the metadata
>   - allow the granularity to be set via sysfs
>   - allow bits to be set via sysfs, and allow the current bitmap to
>     be read via sysfs.
>
> For v0.90 metadata we won't support this as there is no room.  We
> could possibly store about 32 bytes directly in the superblock
> allowing for 4Gig sections but this is unlikely to be really useful.
>
> For v1.x metadata we use 8 bytes from the 'array state info'.  4 bytes
> give an offset from the metadata of the start of the bitmap, 2 bytes
> give the space reserved for the bitmap (max 32Meg) and 2 bytes give a
> shift value from sectors to in-sync chunks.  The actual size of the
> bitmap must be computed from the known size of the array and the size
> of the chunks.
>
> We present the bitmap in sysfs similar to the way we present the bad
> block list.  A file 'non-sync/regions' contains start and size of regions
> (measured in sectors) that are known to not be in-sync.  A file
> 'non-sync/now-in-sync' lists ranges that actually are in sync but have not been
> recorded in non-in-sync yet.  User-space reads now-in-sync', updates
> the metadata, and write to 'regions'.
>
> Another file 'non-sync/to-discard' lists ranges for a which a discard
> request has been made.  These need to be recorded in the metadata.
> They are then written back to the file which allows the discard
> request to complete.
>
> The granularity can be set via sysfs by writing to
> 'non-sync/chunksize'.
>
>
> Assume-clean when increasing array --size
> -----------------------------------------
>
> When a RAID1 is created, --assume-clean can be given so that the
> largely-unnecessary initial resync can be avoided.  When extending the
> size of an array with --grow --size=, there is no way to specify
> --assume-clean.
>
> If a non-sync bitmap (see above) is configured this doesn't matter
> that the extra space will simply be marked as non-in-sync.
> However if a non-sync bitmap is not supported by the metadata or is
> not configured it would be good if md/raid1 can be told not to sync
> the extra space - to assume that it is in-sync.
>
> So when a non-sync bitmap is not configured (the chunk-size is zero),
> writing to the non-sync/regions file tells md that we don't care about the
> region being in-sync.  So the sequence:
>   - freeze sync_action
>   - update size
>   - write range to non-sync/regions
>   - unfreeze sync_action
>
> will effect a "--grow --size=bigger --assume-clean" reshape.
>
>
> Enable 'reshape' to also perform 'recovery'.
> --------------------------------------------
>
> As a 'reshape' re-writes all the data in the array it can quite easily
> be used to recover to a spare device.  Normally these two operations
> would happen separately.  However if a device fails during a reshape
> and a spare is available it makes sense to combine them.
>
> Currently if a device fails during a reshape (leaving the array
> degraded but functional) the reshape will continue and complete.  Then
> if a spare is available it will be recovered.  This means a longer
> total time until the array is optimal.
>
> When the device fails, the reshape actually aborts, and the restarts
> from where it left off.  If instead we allow spares to be added
> between the abort and the restart, and cause the 'reshape' to actually
> do a recovery until it reaches the point where it was already up to,
> then we minimise the time to getting an optimal array.
>
>
> When reshaping an array to fewer devices, allow 'size' to be increased
> --------------------------------------------------------------------
>
> The 'size' of an array is the amount of space on each device which is
> used by the array.  Normally the 'size' of an array cannot be set
> beyond the amount of space available on the smallest device.
>
> However when reshaping an array to have fewer devices it can be useful
> to be able to set the 'size' to be the smallest of the remaining
> devices - those that will still be in use after the reshape.
>
> Normally reshaping an array to have fewer devices will make the array
> size smaller.  However if we can simultaneously increase the size of
> the remaining devices, the array size can stay unchanged or even grow.
>
> This can be used after replacing (ideally using hot-replace) a few
> devices in the array with larger devices.  The net result will be a
> similar amount of storage using few drives, each larger than before.
>
> This should simply be a case of allowing size to be set larger when
> delta_disks is negative.  It also requires that when converting the
> excess device to spares, we fail them if they are smaller than the new
> size.
>
> As a reshape can be reversed, we must make sure to revert the size
> change when reversing a reshape.
>
> Allow write-intent-bitmap to be added to an array during reshape/recovery.
> --------------------------------------------------------------------------
>
> Currently it is not possible to add a write-intent-bitmap to an array
> that is being reshaped/resynced/recovered.  There is no real
> justification for this, it was just easier at the time.
>
> Implementing this requires a review of all code relating to the
> bitmap, checking that a bitmap appearing - or disappearing - during
> these processes will not be a problem.  As the array is quiescent when
> the bitmap is added, no IO will actually be happening so it *should*
> be safe.
>
> This should also allow a reshape to be started while a bitmap is
> present, as long as the reshape doesn't change the implied size of the
> bitmap.
>
> Support resizing of write-intent-bitmap prior to reshape
> --------------------------------------------------------
>
> When we increase the 'size' of an array (the amount of the device
> used), that implies a change in size of the bitmap.  However the
> kernel cannot unilaterally resize the bitmap as there may not be room.
>
> Rather, mdadm needs to be able to resize the bitmap first.  This
> requires the sysfs interface to expose the size of the bitmap - which
> is currently implicit.
>
> Whether the bitmap coverage is increased by increasing the number of
> bits or increasing the chunk size, some updating of the bitmap storage
> will be necessary (particularly in the second case).
>
> So it makes sense to allow user-space to remove the bitmap then add a
> new bitmap with a different configuration.  If there is concern about
> a crash between these two, writes could be suspended for the (short)
> duration.
>
> Currently the 'sync_size' stored in the bitmap superblock is not used.
> We could stop updating that, and could allow the bitmap to
> automatically extend up to that boundary.
>
> So: we have a well defined 'sync_size' which can be set via the
> superblock or via sysfs.  A resize is permitted as long as there is no
> bitmap, or the existing bitmap has a sufficiently large sync_size.
>
> Support reshape of RAID10 arrays.
> ---------------------------------
>
> RAID10 arrays currently cannot be reshaped at all.  It is possible to
> convert a 'near' mode RAID10 to RAID0, but that is about all.   Some
> real reshape is possible and should be implemented.
>
> 1/ A 'near' or 'offset' layout can have the device size changed quite
>     easily.
>
> 2/ Device size of 'far' arrays cannot be changed easily.  Increasing device
>     size of 'far' would require re-laying out a lot of data.  We would
>     need to record the 'old' and 'new' sizes which metadata doesn't
>     currently allow.  If we spent 8 bytes on this we could possibly
>     manage a 'reverse reshape' style conversion here.
>
> 3/ Increasing the number of devices is much the same for all layouts.
>     The data needs to be copied to the new location.  As we currently
>     block IO while recovery is actually happening, we could just do
>     that for reshape as well, and make sure reshape happens in whole
>     chunks at a time (or whatever turns out to be the minimum
>     recordable unit).  We switch to 'clean' before doing any reshape so
>     a write will switch to 'dirty' and update the metadata.
>
> 4/ decreasing the number of devices is very much the reverse of
>     increasing..
>     Here is a weird thought:  We have introduced the idea that we can
>     increase the size of remaining devices when we decrease the number
>     of devices in the array.  For 'raid10-far', the re-layout for
>     increasing the device size is very much like that for decreasing
>     the number of devices - just that the number doesn't actually
>     decrease.
>
> 5/ changing layouts between 'near' and 'offset' should be manageable
>     providing enough 'backup' space is available.  We simply copy
>     a few chunks worth of data and move reshape_position.
>
> 6/ changing layout to or from 'far' is nearly impossible...
>     With a change in data_offset it might be possible to move one
>     stripe at a time, always into the place just vacated.
>     However keeping track of where we are and were it is safe to read
>     from would be a major headache - unless it feel out with some
>     really neat maths, which I don't think it does.
>     So this option will be left out.
>
>
> So the only 'instant' conversion possible is to increase the device
> size for 'near' and 'offset' array.
>
> 'reshape' conversions can modify chunk size, increase/decrease number of
> devices and swap between 'near' and 'offset' layout providing a
> suitable number of chunks of backup space is available.
>
> The device-size of a 'far' layout can also be changed by a reshape
> providing the number of devices in not increased.
>
>
> Better reporting of inconsistencies.
> ------------------------------------
>
> When a 'check' finds a data inconsistency it would be useful if it
> was reported.   That would allow a sysadmin to try to understand the
> cause and possibly fix it.
>
> One simple approach would be to simply log all inconsistencies through
> the kernel logs.  This would have to be limited to 'check' and
> possibly 'repair' passed as logging a 'sync' pass (which also find
> inconsistencies) can be expected to be very noisy.
>
> Another approach is to use a sysfs file to export a list of
> addresses.  This would place some upper limit on the number of
> addresses that could be listed, but if there are more inconsistencies
> than that limit, then the details probably aren't all that important.
>
> It makes sense to follow both of these paths.
>   - some easy-to-parse logging of inconsistencies found.
>   - a sysfs file that lists as many inconsistencies as possible.
>
> Each inconsistency is listed as a simple sector offset.  For RAID1 and
> RAID4/5/6, it is an offset from the start of data on the individual devices.  For
> RAID1 and RAID10 it is an offset from the start of the array.  So this
> can only be interpreted with a full understanding of the array layout.
>
> The actual inconsistency may be in some sector immediately following
> the given sector as md performs checks in blocks larger than one
> sector and doesn't both refining.   So an process that uses this
> information should read forward from the address to make sure it has
> found all of the inconsistency.  For striped array, at most 1 chunk
> need be examined.  For non-striped (i.e. RAID1) the window size is
> currently 64K.  The actual size can be found by dividing
> 'mismatch_cnt' by the number of entries in the mismatch list.
>
> This has no dependencies on other features.  It relates slightly to
> the bad-block list as one way of dealing with an inconsistency is to
> tell md that a selected block in the stripe is 'bad'.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2011-02-16 17:20 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-02-16 10:27 md road-map: 2011 NeilBrown
2011-02-16 11:28 ` Giovanni Tessore
2011-02-16 13:40   ` Roberto Spadim
2011-02-16 14:00     ` Robin Hill
2011-02-16 14:09       ` Roberto Spadim
2011-02-16 14:21         ` Roberto Spadim
2011-02-16 21:55           ` NeilBrown
2011-02-17  1:30             ` Roberto Spadim
2011-02-16 14:13 ` Joe Landman
2011-02-16 21:24   ` NeilBrown
2011-02-16 21:44     ` Roman Mamedov
2011-02-16 21:59       ` NeilBrown
2011-02-17  0:48         ` Phil Turmel
2011-02-16 22:12       ` Joe Landman
2011-02-16 15:42 ` David Brown
2011-02-16 21:35   ` NeilBrown
2011-02-16 22:34     ` David Brown
2011-02-16 23:01       ` NeilBrown
2011-02-17  0:30         ` David Brown
2011-02-17  0:55           ` NeilBrown
2011-02-17  1:04           ` Keld Jørn Simonsen
2011-02-17 10:45             ` David Brown
2011-02-17 10:58               ` Keld Jørn Simonsen
2011-02-17 11:45                 ` Giovanni Tessore
2011-02-17 15:44                   ` Keld Jørn Simonsen
2011-02-17 16:22                     ` Roberto Spadim
2011-02-18  0:13                     ` Giovanni Tessore
2011-02-18  2:56                       ` Keld Jørn Simonsen
2011-02-18  4:27                         ` Roberto Spadim
2011-02-18  9:47                         ` Giovanni Tessore
2011-02-18 18:43                           ` Keld Jørn Simonsen
2011-02-18 19:00                             ` Roberto Spadim
2011-02-18 19:18                               ` Keld Jørn Simonsen
2011-02-18 19:22                                 ` Roberto Spadim
2011-02-16 17:20 ` Joe Landman [this message]
2011-02-16 21:36   ` NeilBrown
2011-02-16 19:37 ` Phil Turmel
2011-02-16 21:44   ` NeilBrown
2011-02-17  0:11     ` Phil Turmel
2011-02-16 20:29 ` Piergiorgio Sartor
2011-02-16 21:48   ` NeilBrown
2011-02-16 22:53     ` Piergiorgio Sartor
2011-02-17  0:24     ` Phil Turmel
2011-02-17  0:52       ` NeilBrown
2011-02-17  1:14         ` Phil Turmel
2011-02-17  3:10           ` NeilBrown
2011-02-17 18:46             ` Phil Turmel
2011-02-17 21:04             ` Mr. James W. Laferriere
2011-02-18  1:48               ` NeilBrown
2011-02-17 19:56           ` Piergiorgio Sartor
2011-02-16 22:50 ` Keld Jørn Simonsen
2011-02-23  5:06 ` Daniel Reurich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D5C0760.4090304@gmail.com \
    --to=joe.landman@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).