From: Namhyung Kim <namhyung@gmail.com>
To: David Brown <david.brown@hesbynett.no>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6
Date: Tue, 26 Jul 2011 22:23:27 +0900 [thread overview]
Message-ID: <8739htfa5s.fsf@gmail.com> (raw)
In-Reply-To: <j0m22q$rur$1@dough.gmane.org> (David Brown's message of "Tue, 26 Jul 2011 11:40:42 +0200")
David Brown <david.brown@hesbynett.no> writes:
> On 26/07/11 03:52, NeilBrown wrote:
>> On Fri, 22 Jul 2011 18:10:33 +0900 Namhyung Kim<namhyung@gmail.com> wrote:
>>
>>> NeilBrown<neilb@suse.de> writes:
>>>
>>>> RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
>>>> also allow 'read-modify-write'
>>>> Apart from this difference, handle_stripe_dirtying[56] are nearly
>>>> identical. So resolve these differences and create just one function.
>>>>
>>>> Signed-off-by: NeilBrown<neilb@suse.de>
>>>
>>> Reviewed-by: Namhyung Kim<namhyung@gmail.com>
>>>
>>> BTW, here is a question:
>>> Why RAID6 doesn't allow the read-modify-write? I don't think it is
>>> not possible, so what prevents doing that? performance? complexity?
>>> or it's not possible? Why? :)
>>
>>
>> The primary reason in my mind is that when Peter Anvin wrote this code he
>> didn't implement read-modify-write and I have never seen a need to consider
>> changing that.
>>
>> You would need a largish array - at least 7 devices - before RWM could
>> possibly be a win, but some people do have arrays larger than that.
>>
>> The computation to "subtract" from the Q-syndrome might be a bit complex - I
>> don't know.
>>
>
> The "subtract from Q" isn't difficult in theory, but it involves more
> cpu time than the "subtract from P". It should be an overall win in
> the same was as for RAID5. However, the code would be more complex if
> you allow for RMW on more than one block.
>
> With raid5, if you have a set of data blocks D0, D1, ..., Dn and a
> parity block P, and you want to change Di_old to Di_new, you can do
> this:
>
> Read Di_old and P_old
> Calculate P_new = P_old xor Di_old xor Di_new
> Write Di_new and P_new
>
> You can easily extend this do changing several data blocks without
> reading in the whole old stripe. I don't know if the Linux raid5
> implementation does that or not (I understand the theory here, but I
> am far from an expert on the implementation). There is no doubt a
> balance between the speed gains of multiple-block RMW and the code
> complexity. As long as you are changing less than half the blocks in
> the stripe, the cpu time will be less than it would for whole stripe
> write. For RMW writes that are more than half a stripe, the "best"
> choice depends on the balance between cpu time and disk bandwidth - a
> balance that is very different on today's servers compared to those
> when Linux raid5 was first written.
>
>
>
> With raid6, the procedure is similar:
>
> Read Di_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
> Calculate Q_new = Q_old xor (2^i . Di_old) xor (2^i . Di_new)
> = Q_old xor (2^i . (Di_old xor Di_new))
> Write Di_new, P_new and Q_new
>
> The difference is simply that (Di_old xor Di_new) needs to be
> multiplied (over the GF field - not normal multiplication) by 2^i.
> You would do this by repeatedly applying the multiply-by-two function
> that already exists in the raid6 implementation.
>
> I don't know whether or not it is worth using the common subexpression
> (Di_old xor Di_new), which turns up twice.
>
>
>
> Multi-block raid6 RMW is similar, but you have to keep track of how
> many times you should multiply by 2. For a general case, the code
> would probably be too messy - it's easier to simple handle the whole
> stripe. But the case of consecutive blocks is easier, and likely to be
> far more common in practice. If you want to change blocks Di through
> Dj, you can do:
>
> Read Di_old, D(i+1)_old, ..., Dj_old, P_old and Q_old.
> Calculate P_new = P_old xor Di_old xor Di_new
> xor D(i+1)_old xor D(i+1)_new
> ...
> xor Dj_old xor Dj_new
>
>
> Calculate Q_new = Q_old xor (2^i . (Di_old xor Di_new))
> xor (2^(i+1) . (D(i+1)_old xor D(i+1)_new))
> ...
> xor (2^j . (Dj_old xor Dj_new))
> = Q_old xor (2^i .
> (Di_old xor Di_new)
> xor (2^1) . (D(i+1)_old xor D(i+1)_new))
> xor (2^2) . (D(i+2)_old xor D(i+2)_new))
> ...
> xor (2^(j-i) . (Dj_old xor Dj_new))
> )
>
> Write Di_new, D(i+1)_new, ..., Dj_new, P_new and Q_new
>
>
> The algorithm above looks a little messy (ASCII emails are not the
> best medium for mathematics), but it's not hard to see the pattern,
> and the loops needed. It should also be possible to merge such
> routines with the main raid6 parity calculation functions.
>
>
> mvh.,
>
> David
>
Thanks a lot for your detailed explanation, David!
I think it'd be good if we implement RMW for small (eg. single disk)
write on RAID6 too.
--
Regards,
Namhyung Kim
next prev parent reply other threads:[~2011-07-26 13:23 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-21 2:32 [md PATCH 00/34] md patches for 3.1 - part 1 NeilBrown
2011-07-21 2:32 ` [md PATCH 02/34] md/raid10: factor out common bio handling code NeilBrown
2011-07-21 2:32 ` [md PATCH 01/34] md/raid10: get rid of duplicated conditional expression NeilBrown
2011-07-21 2:32 ` [md PATCH 03/34] md/raid10: share pages between read and write bio's during recovery NeilBrown
2011-07-21 2:32 ` [md PATCH 05/34] md/raid5: get rid of duplicated call to bio_data_dir() NeilBrown
2011-07-21 2:32 ` [md PATCH 09/34] md/raid5: move common code into handle_stripe NeilBrown
2011-07-22 4:30 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 10/34] md/raid5: unify stripe_head_state and r6_state NeilBrown
2011-07-22 4:49 ` Namhyung Kim
2011-07-22 5:15 ` NeilBrown
2011-07-22 5:37 ` NeilBrown
2011-07-22 5:53 ` Namhyung Kim
2011-07-26 6:44 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 04/34] md/raid5: use kmem_cache_zalloc() NeilBrown
2011-07-21 2:32 ` [md PATCH 07/34] md/raid5: Protect some more code with ->device_lock NeilBrown
2011-07-22 3:54 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 08/34] md/raid5: replace sh->lock with an 'active' flag NeilBrown
2011-07-22 4:27 ` Namhyung Kim
2011-07-22 4:49 ` NeilBrown
2011-07-22 5:03 ` Namhyung Kim
2011-08-03 22:47 ` Dan Williams
2011-08-03 23:35 ` NeilBrown
2011-08-03 23:45 ` Williams, Dan J
2011-08-04 0:18 ` NeilBrown
2011-07-21 2:32 ` [md PATCH 06/34] md/raid5: Remove use of sh->lock in sync_request NeilBrown
2011-07-22 3:39 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 11/34] md/raid5: add some more fields to stripe_head_state NeilBrown
2011-07-22 5:31 ` Namhyung Kim
2011-07-26 1:35 ` NeilBrown
2011-07-21 2:32 ` [md PATCH 17/34] md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 NeilBrown
2011-07-22 9:10 ` Namhyung Kim
2011-07-26 1:52 ` NeilBrown
2011-07-26 2:41 ` H. Peter Anvin
2011-07-26 9:40 ` David Brown
2011-07-26 13:23 ` Namhyung Kim [this message]
2011-07-26 15:01 ` David Brown
2011-07-21 2:32 ` [md PATCH 18/34] md/raid5: move more common code into handle_stripe NeilBrown
2011-07-22 9:20 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 12/34] md/raid5: move stripe_head_state and more " NeilBrown
2011-07-22 5:41 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 13/34] md/raid5: Move code for finishing a reconstruction " NeilBrown
2011-07-22 7:09 ` Namhyung Kim
2011-07-26 1:44 ` NeilBrown
2011-07-21 2:32 ` [md PATCH 19/34] md/raid5: move some more common code " NeilBrown
2011-07-22 9:29 ` Namhyung Kim
2011-07-26 1:59 ` NeilBrown
2011-07-21 2:32 ` [md PATCH 16/34] md/raid5: unite fetch_block5 and fetch_block6 NeilBrown
2011-07-22 8:24 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 15/34] md/raid5: rearrange a test in fetch_block6 NeilBrown
2011-07-22 7:37 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 14/34] md/raid5: move more code into common handle_stripe NeilBrown
2011-07-22 7:32 ` Namhyung Kim
2011-07-26 1:48 ` NeilBrown
2011-07-21 2:32 ` [md PATCH 25/34] md: change managed of recovery_disabled NeilBrown
2011-07-21 2:32 ` [md PATCH 20/34] md/raid5: finalise new merged handle_stripe NeilBrown
2011-07-22 9:36 ` Namhyung Kim
2011-07-26 2:02 ` NeilBrown
2011-07-26 4:50 ` Namhyung Kim
2011-07-21 2:32 ` [md PATCH 22/34] md/raid: use printk_ratelimited instead of printk_ratelimit NeilBrown
2011-07-21 2:32 ` [md PATCH 24/34] md: remove ro check in md_check_recovery() NeilBrown
2011-07-21 2:32 ` [md PATCH 26/34] md/raid10: Make use of new recovery_disabled handling NeilBrown
2011-07-21 2:32 ` [md PATCH 23/34] md: introduce link/unlink_rdev() helpers NeilBrown
2011-07-21 2:32 ` [md PATCH 21/34] md: use proper little-endian bitops NeilBrown
2011-07-21 2:32 ` [md PATCH 27/34] md/raid10: Improve decision on whether to fail a device with a read error NeilBrown
2011-07-21 2:32 ` [md PATCH 33/34] MD: raid1 s/sysfs_notify_dirent/sysfs_notify_dirent_safe NeilBrown
2011-07-21 2:32 ` [md PATCH 32/34] md/raid5: Avoid BUG caused by multiple failures NeilBrown
2011-07-21 2:32 ` [md PATCH 34/34] MD bitmap: Revert DM dirty log hooks NeilBrown
2011-07-21 2:32 ` [md PATCH 31/34] md/raid10: move rdev->corrected_errors counting NeilBrown
2011-07-21 2:32 ` [md PATCH 30/34] md/raid5: " NeilBrown
2011-07-21 2:32 ` [md PATCH 29/34] md/raid1: " NeilBrown
2011-07-21 2:32 ` [md PATCH 28/34] md: get rid of unnecessary casts on page_address() NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8739htfa5s.fsf@gmail.com \
--to=namhyung@gmail.com \
--cc=david.brown@hesbynett.no \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).