* Raid 4/5 small writes
@ 2006-04-02 15:37 Gaëtan LEURENT
2006-04-03 1:37 ` Neil Brown
0 siblings, 1 reply; 3+ messages in thread
From: Gaëtan LEURENT @ 2006-04-02 15:37 UTC (permalink / raw)
To: linux-raid
Hi,
I'm considering building a Raid4[#] array for my desktop and I have a
question about small writes with Raid4/Raid5: when a small part of a
block is modified, we have two options:
- read the hole stripe, compute the new checksum and write the data and
the checksum
- read only the part of the data that will be overwritten, and the
corresponding part of the checksum. Since the checksum is a simple
XOR, we have:
New Checksum = Old Checksum XOR Old Data XOR New Data
and we can write the new data and the new checksum without reading
more data.
Does the Linux kernel implements the second way?
[#] I'm not doing Raid5 because I have two 120Go drives and two 250Go
drives. I am thinking of making a Raid0 array out of the two small
drives, and using that as the parity drive for the Raid4. I believe
this should give better performance than Raid5.
Thanks,
--
Gaëtan LEURENT
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid 4/5 small writes
2006-04-02 15:37 Raid 4/5 small writes Gaëtan LEURENT
@ 2006-04-03 1:37 ` Neil Brown
[not found] ` <444282BB.6080407@tmr.com>
0 siblings, 1 reply; 3+ messages in thread
From: Neil Brown @ 2006-04-03 1:37 UTC (permalink / raw)
To: Gaëtan LEURENT; +Cc: linux-raid
On Sunday April 2, gaetan.leurent@ens.fr wrote:
> Hi,
>
> I'm considering building a Raid4[#] array for my desktop and I have a
> question about small writes with Raid4/Raid5: when a small part of a
> block is modified, we have two options:
>
> - read the hole stripe, compute the new checksum and write the data and
> the checksum
>
> - read only the part of the data that will be overwritten, and the
> corresponding part of the checksum. Since the checksum is a simple
> XOR, we have:
> New Checksum = Old Checksum XOR Old Data XOR New Data
> and we can write the new data and the new checksum without reading
> more data.
>
> Does the Linux kernel implements the second way?
Linux/md does the right thing.
If you are writing all the blocks in a stripe, it doesn't read
anything. It just creates the parity block from the new data and
writes it and the data out.
If you are writing more than half the blocks in the stripe, it
pre-reads the blocks you aren't writing and uses them and the new
blocks to generate parity and writes it and the new data out.
If you are writing fewer than half the blocks in the stripe, it will
do as you suggest, read the old data, work out what the new parity
will be from them and the old parity and the new data, and write out
the parity and new data.
If you are writing exactly half the data in a stripe, I think it takes
the first option (Read the old unchanged data) as that is fewer reads
than reading the old changed data and the parity block.
Does that make sense?
>
>
> [#] I'm not doing Raid5 because I have two 120Go drives and two 250Go
> drives. I am thinking of making a Raid0 array out of the two small
> drives, and using that as the parity drive for the Raid4. I believe
> this should give better performance than Raid5.
Certainly worth a try.
NeilBrown
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Raid 4/5 small writes
[not found] ` <444282BB.6080407@tmr.com>
@ 2006-04-16 22:37 ` Neil Brown
0 siblings, 0 replies; 3+ messages in thread
From: Neil Brown @ 2006-04-16 22:37 UTC (permalink / raw)
To: Bill Davidsen; +Cc: Gaëtan LEURENT, linux-raid
On Sunday April 16, davidsen@tmr.com wrote:
> Neil Brown wrote:
> >
> >If you are writing exactly half the data in a stripe, I think it takes
> >the first option (Read the old unchanged data) as that is fewer reads
> >than reading the old changed data and the parity block.
> >
> >Does that make sense?
> >
> >
> You can do some simulation of this, but consid this scenario: I have a
> write which changes five of eight data blocks of a RAID4/5.
> 8data+1parity. If I read the old data I read blocks from three drives
> and write to six, generating a seek+io of every drive in the array.
>
> if I read the old data and parity only on the drives which will be
> rewritten, I generate two seeks and two io on five data and one parity
> drive. However... I have not generate seeks on the other three drives,
> and the 2nd seek on each drive will either be a no-op if the data are on
> a single cylinder, or will be small, the seek from the end of the old
> data back to the start.
>
> I think the better performance would depend on the size of the io and
> seek (small, or it would write every data drive), and the load (leaving
> three drives to do user io could be a gain). Am I missing something? I
> don't disagree with the way it works, I'm just not sure it's optimal if
> the write doesn't reach every drive in the array.
>
Optimising for a single IO request is pretty pointless. The situation
that you want to optimise for is when there are lots of IO requests
coming and queueing up and keeping the array busy. In this case I
suspect you would want to spread load evenly over all devices, and
keep the total number of IOs to a minimum. That is what the current
code tries to do.
Ofcourse, this is pure theory, and so could be purely wrong. The only
way we could tell is to do some measurements on some interesting
loads.
It would be very easy to change the code to all to read-modify-write
or always do reconstruct-write, and then test performance for some
real benchmark. I would certainly be interested if any benchmarks
were significantly affected by the choice.
So I agree with your first comment - doing some simulations is best.
NeilBrown
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2006-04-16 22:37 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-02 15:37 Raid 4/5 small writes Gaëtan LEURENT
2006-04-03 1:37 ` Neil Brown
[not found] ` <444282BB.6080407@tmr.com>
2006-04-16 22:37 ` Neil Brown
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).