linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid5: two writing algorithms
@ 2008-02-07 16:13 Keld Jørn Simonsen
  2008-02-07 20:25 ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Keld Jørn Simonsen @ 2008-02-07 16:13 UTC (permalink / raw)
  To: linux-raid

As I understand it, there are 2 valid algoritms for writing in raid5.

1. calculate the parity data by XOR'ing all data of the relevant data
chunks.

2. calculate the parity data by kind of XOR-subtracting the old data to
be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
is actually the same).

There are situations where method 1 is the fastest, and situations where
method 2 is the fastest.

My idea is then that the raid5 code in the kernel can calculate which
method is the faster. 

method 1 is faster, if all data is already available. I understand that
this method is employed in the current kernel. This would eg be the case
with sequential writes.

Method 2 is faster, if no data is available in core. It would require
2 reads and two writes, which always will be faster than n reads and 1
write, possibly except for n=2. method 2 is thus faster normally for
random writes.

I think that method 2 is not used in the kernel today. Mayby I am wrong,
but I did have a look in the kernel code.

So I hereby give the idea for inspiration to kernel hackers.

Yoyr kernel hacker wannabe
keld

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5: two writing algorithms
  2008-02-07 16:13 raid5: two writing algorithms Keld Jørn Simonsen
@ 2008-02-07 20:25 ` Neil Brown
  2008-02-08  1:10   ` Keld Jørn Simonsen
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2008-02-07 20:25 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On Thursday February 7, keld@dkuug.dk wrote:
> As I understand it, there are 2 valid algoritms for writing in raid5.
> 
> 1. calculate the parity data by XOR'ing all data of the relevant data
> chunks.
> 
> 2. calculate the parity data by kind of XOR-subtracting the old data to
> be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
> is actually the same).
> 
> There are situations where method 1 is the fastest, and situations where
> method 2 is the fastest.
> 
> My idea is then that the raid5 code in the kernel can calculate which
> method is the faster. 
> 
> method 1 is faster, if all data is already available. I understand that
> this method is employed in the current kernel. This would eg be the case
> with sequential writes.
> 
> Method 2 is faster, if no data is available in core. It would require
> 2 reads and two writes, which always will be faster than n reads and 1
> write, possibly except for n=2. method 2 is thus faster normally for
> random writes.
> 
> I think that method 2 is not used in the kernel today. Mayby I am wrong,
> but I did have a look in the kernel code.

It is very odd that you would think something about the behaviour of
the kernel with actually having looked.

It also seems a little arrogant to have a clever idea and assume that
no one else has thought of it before.

> 
> So I hereby give the idea for inspiration to kernel hackers.

and I hereby invite you to read the code ;-)

Code reading is a good first step to being a
> 
> Yoyr kernel hacker wannabe
       ^^^^^^^^^^^^^

NeilBrown


> keld
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5: two writing algorithms
  2008-02-07 20:25 ` Neil Brown
@ 2008-02-08  1:10   ` Keld Jørn Simonsen
  2008-02-08  1:51     ` Neil Brown
  0 siblings, 1 reply; 5+ messages in thread
From: Keld Jørn Simonsen @ 2008-02-08  1:10 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
> On Thursday February 7, keld@dkuug.dk wrote:
> > As I understand it, there are 2 valid algoritms for writing in raid5.
> > 
> > 1. calculate the parity data by XOR'ing all data of the relevant data
> > chunks.
> > 
> > 2. calculate the parity data by kind of XOR-subtracting the old data to
> > be changed, and then XOR-adding the new data. (XOR-subtract and XOR-add
> > is actually the same).
> > 
> > There are situations where method 1 is the fastest, and situations where
> > method 2 is the fastest.
> > 
> > My idea is then that the raid5 code in the kernel can calculate which
> > method is the faster. 
> > 
> > method 1 is faster, if all data is already available. I understand that
> > this method is employed in the current kernel. This would eg be the case
> > with sequential writes.
> > 
> > Method 2 is faster, if no data is available in core. It would require
> > 2 reads and two writes, which always will be faster than n reads and 1
> > write, possibly except for n=2. method 2 is thus faster normally for
> > random writes.
> > 
> > I think that method 2 is not used in the kernel today. Mayby I am wrong,
> > but I did have a look in the kernel code.
> 
> It is very odd that you would think something about the behaviour of
> the kernel with actually having looked.
> 
> It also seems a little arrogant to have a clever idea and assume that
> no one else has thought of it before.

Oh well, I have to admit that I do not understand the code fully.
I am not a seasoned kernel hacker, as I also indicated in my ad hoc
signature.

> > So I hereby give the idea for inspiration to kernel hackers.
> 
> and I hereby invite you to read the code ;-)

I did some reading.  Is there somewhere a description of it, especially
the raid code, or are the comments and the code the best documentation?

Do you say that this is already implemented?

I am sorry if you think I am mailing too much on the list.
But I happen to think it is fun.
And I do try to give something back.

> Code reading is a good first step to being a
> > 
> > Yoyr kernel hacker wannabe
>        ^^^^^^^^^^^^^
> 
> NeilBrown

Well, I do have a hack in mind, on the raid10,f2.
I need to investigate some more, and possibly test out
what really happens. But maybe the code already does what I want it to.
You are possibly the one that knows the code best, so maybe you can tell
me if raid10,f2 always does its reading in the first part of the disks?

best regards
keld

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5: two writing algorithms
  2008-02-08  1:10   ` Keld Jørn Simonsen
@ 2008-02-08  1:51     ` Neil Brown
  2008-02-08 10:25       ` Keld Jørn Simonsen
  0 siblings, 1 reply; 5+ messages in thread
From: Neil Brown @ 2008-02-08  1:51 UTC (permalink / raw)
  To: Keld Jørn Simonsen; +Cc: linux-raid

On Friday February 8, keld@dkuug.dk wrote:
> On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
> > On Thursday February 7, keld@dkuug.dk wrote:
> 
> > > So I hereby give the idea for inspiration to kernel hackers.
> > 
> > and I hereby invite you to read the code ;-)
> 
> I did some reading.  Is there somewhere a description of it, especially
> the raid code, or are the comments and the code the best documentation?

No.  If a description was written (and various people have tried to
describe various parts) it would be out of date within a few months :-(

Look for "READ_MODIFY_WRITE" and "RECONSTRUCT_WRITE" .... no.  That
only applied to raid6 code now..
Look instead for the 'rcw' and 'rmw' counters, and then at
'handle_write_operations5'  which does different things based on the
'rcw' variable.

It used to be a lot clearer before we implemented xor-offload.  The
xor-offload stuff is good, but it does make the code more complex.


> 
> Do you say that this is already implemented?

Yes.

> 
> I am sorry if you think I am mailing too much on the list.

You aren't.

> But I happen to think it is fun.

Good.

> And I do try to give something back.

We'll look forward to that.

> 
> > Code reading is a good first step to being a
> > > 
> > > Yoyr kernel hacker wannabe
> >        ^^^^^^^^^^^^^
> > 
> > NeilBrown
> 
> Well, I do have a hack in mind, on the raid10,f2.
> I need to investigate some more, and possibly test out
> what really happens. But maybe the code already does what I want it to.
> You are possibly the one that knows the code best, so maybe you can tell
> me if raid10,f2 always does its reading in the first part of the disks?

Yes, I know the code best.

No, raid10,f2 doesn't always use the first part of the disk.  Getting
it to do that would be a fairly small change in 'read_balance' in
md/raid10.c.

I'm not at all convinced that the read balancing code in raid10 (or
raid1) really does the best thing.  So any improvements - backed up
with broad testing - would be most welcome.

NeilBrown

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: raid5: two writing algorithms
  2008-02-08  1:51     ` Neil Brown
@ 2008-02-08 10:25       ` Keld Jørn Simonsen
  0 siblings, 0 replies; 5+ messages in thread
From: Keld Jørn Simonsen @ 2008-02-08 10:25 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On Fri, Feb 08, 2008 at 12:51:39PM +1100, Neil Brown wrote:
> On Friday February 8, keld@dkuug.dk wrote:
> > On Fri, Feb 08, 2008 at 07:25:31AM +1100, Neil Brown wrote:
> > > On Thursday February 7, keld@dkuug.dk wrote:
> > 
> > > > So I hereby give the idea for inspiration to kernel hackers.
> > > 
> > > and I hereby invite you to read the code ;-)
> > 
> > I did some reading.  Is there somewhere a description of it, especially
> > the raid code, or are the comments and the code the best documentation?
> 
> No.  If a description was written (and various people have tried to
> describe various parts) it would be out of date within a few months :-(

OK, I was under the impression that some of the code did not change
much. Eg. you said that there had not been any work on optimizing
raid10 for performance since the 2.6.12 kernel I was using. And then at
least the raid5 code, the last copyright notice right in the top is 
Copyright (C) 2002, 2003 H. Peter Anvin. That is 5 years ago. 
And your name is not on it. So I did not look that much into that code,
thinking nothing had been done there for ages. Maybe you could add your
name on it, that would only be fair. The same comment goes for other
modules (for which it is relevant).

> Look for "READ_MODIFY_WRITE" and "RECONSTRUCT_WRITE" .... no.  That
> only applied to raid6 code now..
> Look instead for the 'rcw' and 'rmw' counters, and then at
> 'handle_write_operations5'  which does different things based on the
> 'rcw' variable.
> 
> It used to be a lot clearer before we implemented xor-offload.  The
> xor-offload stuff is good, but it does make the code more complex.

OK, I think it is fairly well documented here, I can at least follow the
logic, and then I think it is a good approach to have the flow
description/strategy included directly in the code. Given there are many
changes to the code, different files for code and description could
easily mix up the alignment of code and documentation badly.

> 
> 
> > 
> > Do you say that this is already implemented?
> 
> Yes.

That is very good!


Do you konw if other implementations of this, eg. commercial controller
code, have this facility? If not, we could list this as an advantage of 
linux raid. Anyway it would be implicit in performance documentation.
I do plan to write up something on performance, soonish. The howto
is hopelessly outdated.

IMHO such code should make the performance of raid5 random writes not
that bad. Better than the reputation that raid5 is hopelessly slow for
database writing. I think raid5 would be less than double as slow as
raid1 for random writing.

> > Well, I do have a hack in mind, on the raid10,f2.
> > I need to investigate some more, and possibly test out
> > what really happens. But maybe the code already does what I want it to.
> > You are possibly the one that knows the code best, so maybe you can tell
> > me if raid10,f2 always does its reading in the first part of the disks?
> 
> Yes, I know the code best.
> 
> No, raid10,f2 doesn't always use the first part of the disk.  Getting
> it to do that would be a fairly small change in 'read_balance' in
> md/raid10.c.
> 
> I'm not at all convinced that the read balancing code in raid10 (or
> raid1) really does the best thing.  So any improvements - backed up
> with broad testing - would be most welcome.

I think I know where to do my proposed changes, and how it could be done.
So maybe in a not too distant future I will have done my first kernel
hack!

Best regards
keld

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-02-08 10:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-07 16:13 raid5: two writing algorithms Keld Jørn Simonsen
2008-02-07 20:25 ` Neil Brown
2008-02-08  1:10   ` Keld Jørn Simonsen
2008-02-08  1:51     ` Neil Brown
2008-02-08 10:25       ` Keld Jørn Simonsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).