raid5 write performance

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid5 write performance
@ 2005-11-18 14:05 Jure Pečar
  2005-11-18 19:19 ` Dan Stromberg
  0 siblings, 1 reply; 23+ messages in thread
From: Jure Pečar @ 2005-11-18 14:05 UTC (permalink / raw)
  To: linux-raid

Hi all,

Currently zfs is a major news in the storage area. It is very interesting to read various details about it on varios blogs of Sun employees. Among the more interesting I found was this:

http://blogs.sun.com/roller/page/bonwick?entry=raid_z

The point the guy makes is that it is impossible to atomically both write data and update parity, which leaves a window of crash that would silently leave on-disk data+paritiy in an inconsistent state. Then he mentions that there are software only workarounds for that but that they are very very slow.

It's interesting that my expirience with veritas raid5 for example is just that: slow to the point of being unuseable. Now, I'm wondering what kind of magic does linux md raid5 does, since its write performance is quite good? Or, does it actually do something regarding this? :)

Niel?

-- 

Jure Pečar
http://jure.pecar.org

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-18 14:05 Jure Pečar
@ 2005-11-18 19:19 ` Dan Stromberg
  2005-11-18 19:23   ` Mike Hardy
  0 siblings, 1 reply; 23+ messages in thread
From: Dan Stromberg @ 2005-11-18 19:19 UTC (permalink / raw)
  To: Jure Pečar; +Cc: linux-raid, strombrg


Would it really be that much slower to have a journal of RAID 5 writes?

On Fri, 2005-11-18 at 15:05 +0100, Jure Pečar wrote:
> Hi all,
> 
> Currently zfs is a major news in the storage area. It is very interesting to read various details about it on varios blogs of Sun employees. Among the more interesting I found was this:
> 
> http://blogs.sun.com/roller/page/bonwick?entry=raid_z
> 
> The point the guy makes is that it is impossible to atomically both write data and update parity, which leaves a window of crash that would silently leave on-disk data+paritiy in an inconsistent state. Then he mentions that there are software only workarounds for that but that they are very very slow.
> 
> It's interesting that my expirience with veritas raid5 for example is just that: slow to the point of being unuseable. Now, I'm wondering what kind of magic does linux md raid5 does, since its write performance is quite good? Or, does it actually do something regarding this? :)
> 
> Niel?
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-18 19:19 ` Dan Stromberg
@ 2005-11-18 19:23   ` Mike Hardy
  2005-11-19  4:40     ` Guy
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Hardy @ 2005-11-18 19:23 UTC (permalink / raw)
  To: Dan Stromberg; +Cc: Jure Pečar, linux-raid


Moreover, and I'm sure Neil will chime in here, isn't the clean/unclean
thing designed to prevent this exact scenario?

The array is marked unclean immediately prior to write, then the write
and parity write happens, then the array is marked clean.

If you crash during the write but before parity is correct, the array is
unclean and you resync (quickly now thanks to intent logging if you have
that on)

The non-parity blocks that were partially written are then the
responsibility of your journalling filesystem, which should make sure
there is no corruption, silent or otherwise.

If I'm misunderstanding that, I'd love to be corrected. I was under the
impression that the "silent corruption" issue was mythical at this point
and if it's not I'd like to know.

-Mike

Dan Stromberg wrote:
> Would it really be that much slower to have a journal of RAID 5 writes?
> 
> On Fri, 2005-11-18 at 15:05 +0100, Jure Pečar wrote:
> 
>>Hi all,
>>
>>Currently zfs is a major news in the storage area. It is very interesting to read various details about it on varios blogs of Sun employees. Among the more interesting I found was this:
>>
>>http://blogs.sun.com/roller/page/bonwick?entry=raid_z
>>
>>The point the guy makes is that it is impossible to atomically both write data and update parity, which leaves a window of crash that would silently leave on-disk data+paritiy in an inconsistent state. Then he mentions that there are software only workarounds for that but that they are very very slow.
>>
>>It's interesting that my expirience with veritas raid5 for example is just that: slow to the point of being unuseable. Now, I'm wondering what kind of magic does linux md raid5 does, since its write performance is quite good? Or, does it actually do something regarding this? :)
>>
>>Niel?
>>
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: raid5 write performance
  2005-11-18 19:23   ` Mike Hardy
@ 2005-11-19  4:40     ` Guy
  2005-11-19  4:57       ` Mike Hardy
  0 siblings, 1 reply; 23+ messages in thread
From: Guy @ 2005-11-19  4:40 UTC (permalink / raw)
  To: 'Mike Hardy', 'Dan Stromberg'
  Cc: 'Jure Pečar', linux-raid

> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
> owner@vger.kernel.org] On Behalf Of Mike Hardy
> Sent: Friday, November 18, 2005 2:24 PM
> To: Dan Stromberg
> Cc: Jure Pečar; linux-raid@vger.kernel.org
> Subject: Re: raid5 write performance
> 
> 
> Moreover, and I'm sure Neil will chime in here, isn't the clean/unclean
> thing designed to prevent this exact scenario?
> 
> The array is marked unclean immediately prior to write, then the write
> and parity write happens, then the array is marked clean.
> 
> If you crash during the write but before parity is correct, the array is
> unclean and you resync (quickly now thanks to intent logging if you have
> that on)
> 
> The non-parity blocks that were partially written are then the
> responsibility of your journalling filesystem, which should make sure
> there is no corruption, silent or otherwise.
> 
> If I'm misunderstanding that, I'd love to be corrected. I was under the
> impression that the "silent corruption" issue was mythical at this point
> and if it's not I'd like to know.
> 
> -Mike

It is not just a parity issue.  If you have a 4 disk RAID 5, you can't be
sure which if any have written the stripe.  Maybe the parity was updated,
but nothing else.  Maybe the parity and 2 data disks, leaving 1 data disk
with old data.

Beyond that, md does write caching.  I don't think the file system can tell
when a write is truly complete.  I don't recall ever having a Linux system
crash, so I am not worried.  But power failures cause the same risk, or
maybe more.  I have seen power failures, even with a UPS!

Guy

> 
> Dan Stromberg wrote:
> > Would it really be that much slower to have a journal of RAID 5 writes?
> >
> > On Fri, 2005-11-18 at 15:05 +0100, Jure Pečar wrote:
> >
> >>Hi all,
> >>
> >>Currently zfs is a major news in the storage area. It is very
> interesting to read various details about it on varios blogs of Sun
> employees. Among the more interesting I found was this:
> >>
> >>http://blogs.sun.com/roller/page/bonwick?entry=raid_z
> >>
> >>The point the guy makes is that it is impossible to atomically both
> write data and update parity, which leaves a window of crash that would
> silently leave on-disk data+paritiy in an inconsistent state. Then he
> mentions that there are software only workarounds for that but that they
> are very very slow.
> >>
> >>It's interesting that my expirience with veritas raid5 for example is
> just that: slow to the point of being unuseable. Now, I'm wondering what
> kind of magic does linux md raid5 does, since its write performance is
> quite good? Or, does it actually do something regarding this? :)
> >>
> >>Niel?
> >>
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19  4:40     ` Guy
@ 2005-11-19  4:57       ` Mike Hardy
  2005-11-19  5:54         ` Neil Brown
  2005-11-19  5:56         ` Guy
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Hardy @ 2005-11-19  4:57 UTC (permalink / raw)
  To: Guy; +Cc: 'Dan Stromberg', 'Jure Pečar', linux-raid

Guy wrote:

> It is not just a parity issue.  If you have a 4 disk RAID 5, you can't be
> sure which if any have written the stripe.  Maybe the parity was updated,
> but nothing else.  Maybe the parity and 2 data disks, leaving 1 data disk
> with old data.
> 
> Beyond that, md does write caching.  I don't think the file system can tell
> when a write is truly complete.  I don't recall ever having a Linux system
> crash, so I am not worried.  But power failures cause the same risk, or
> maybe more.  I have seen power failures, even with a UPS!

Good points there Guy - I do like your example. I'll go further with
crashing too and say that I actually crash outright occasionally.
Usually when building out new machines where I don't know the proper
driver tweaks, or failing hardware, but it happens without power loss.
Its important to get this correct and well understood.

That said, unless I hear otherwise from someone that works in the code,
I think md won't report the write as complete to upper layers until it
actually is. I don't believe it does write-caching, and regardless, if
it does it must not do it until some durable representation of the data
is committed to hardware and the parity stays dirty until redundancy is
committed.

Building on that, barring hardware write-caching, I think with a
journalling FS like ext3 and md only reporting the write complete when
it really is, things won't be trusted at the FS level unless they're
durably written to hardware.

I think that's sufficient to prove consistency across crashes.

For example, even if you crash during an update to a file smaller than a
stripe, the stripe will be dirty so the bad parity will be discarded and
the filesystem won't trust the blocks that didn't get reported back as
written by md. So that file update is lost, but the FS is consistent and
all the data it can reach is consistent with what it thinks is there.

So, I continue to believe silent corruption is mythical. I'm still open
to good explanation it's not though.

-Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19  4:57       ` Mike Hardy
@ 2005-11-19  5:54         ` Neil Brown
  2005-11-19 11:59           ` Farkas Levente
  2005-11-19 19:52           ` Carlos Carvalho
  2005-11-19  5:56         ` Guy
  1 sibling, 2 replies; 23+ messages in thread
From: Neil Brown @ 2005-11-19  5:54 UTC (permalink / raw)
  To: Mike Hardy
  Cc: Guy, 'Dan Stromberg', 'Jure Pečar',
	linux-raid

On Friday November 18, mhardy@h3c.com wrote:
> 
> So, I continue to believe silent corruption is mythical. I'm still open
> to good explanation it's not though.
> 

Silent corruption is not mythical, though it is probably talked about
more than it actually happens (but then as it is silent, I cannot be
certain :-).

Silent corruption can happen only if an unclean degraded array is
started. 
md will not start an unclean degraded (raid 4/5/6) array (though I'm
going to add a module parameter to allow it) and mdadm will only start
such an array if given --force (in which case it modifies to appear
clean so md will start it).

If your array is not degraded, or you always shut down cleanly, there
is no opportunity for raid5-level corruption (of course, the drives may
choose to corrupt things silently themselves...).

Note that an unclean degraded start doesn't imply corruption - you
could be in this situation and not have any corruption at all.  But it
does allow it.  It must as 'unclean' means you cannot trust the
parity, and 'degraded' means that you have to.

There are two solutions to this silent corruption problem (other than
'ignore it and hope it doesn't bite' which is a fair widely used
solution, and I haven't seen any bite marks myself).

One is journalling, as has been mentioned.  This could be done to a
mirrored pair, or to a ECC NVRAM card (the latter being probably the
best, though also most expensive).  You would write each data block as
it becomes available, and each parity block just before commencing a
write to the raid5.  Obviously you also keep track of what you have
written.
I have toyed with the idea of implementing this, but I think demand is
sufficiently low that it isn't worth it.

The other is to use a filesystem that allows the problem to be avoided
by making sure that the only blocks that can be corrupted are dead
blocks.
This could be done with a copy-on-write filesystem that knows about the
raid5 geometry, and only ever writes to a stripe when no other blocks
on the stripe contain live data.
I've been working on a filesystem which does just this, and hope to
have it available in a year or two (it is a back-ground 'hobby'
project). 

I know that ZFS is a copy-on-write filesystem.  It is entirely
possible that it can do the right thing for raid5.

And as an addendum, md/raid5 never reports a block as complete to the
filesystem until the device drives have reported the data block and
the parity block as being safe.  i.e. It has a write-through cache,
not a write-behind cache.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: raid5 write performance
  2005-11-19  4:57       ` Mike Hardy
  2005-11-19  5:54         ` Neil Brown
@ 2005-11-19  5:56         ` Guy
  1 sibling, 0 replies; 23+ messages in thread
From: Guy @ 2005-11-19  5:56 UTC (permalink / raw)
  To: 'Mike Hardy'
  Cc: 'Dan Stromberg', 'Jure Pečar', linux-raid

> -----Original Message-----
> From: Mike Hardy [mailto:mhardy@h3c.com]
> Sent: Friday, November 18, 2005 11:57 PM
> To: Guy
> Cc: 'Dan Stromberg'; 'Jure Pečar'; linux-raid@vger.kernel.org
> Subject: Re: raid5 write performance
> 
> 
> 
> Guy wrote:
> 
> > It is not just a parity issue.  If you have a 4 disk RAID 5, you can't
> be
> > sure which if any have written the stripe.  Maybe the parity was
> updated,
> > but nothing else.  Maybe the parity and 2 data disks, leaving 1 data
> disk
> > with old data.
> >
> > Beyond that, md does write caching.  I don't think the file system can
> tell
> > when a write is truly complete.  I don't recall ever having a Linux
> system
> > crash, so I am not worried.  But power failures cause the same risk, or
> > maybe more.  I have seen power failures, even with a UPS!
> 
> Good points there Guy - I do like your example. I'll go further with
> crashing too and say that I actually crash outright occasionally.
> Usually when building out new machines where I don't know the proper
> driver tweaks, or failing hardware, but it happens without power loss.
> Its important to get this correct and well understood.
> 
> That said, unless I hear otherwise from someone that works in the code,
> I think md won't report the write as complete to upper layers until it
> actually is. I don't believe it does write-caching, and regardless, if
> it does it must not do it until some durable representation of the data
> is committed to hardware and the parity stays dirty until redundancy is
> committed.
> 
> Building on that, barring hardware write-caching, I think with a
> journalling FS like ext3 and md only reporting the write complete when
> it really is, things won't be trusted at the FS level unless they're
> durably written to hardware.
> 
> I think that's sufficient to prove consistency across crashes.
> 
> For example, even if you crash during an update to a file smaller than a
> stripe, the stripe will be dirty so the bad parity will be discarded and
> the filesystem won't trust the blocks that didn't get reported back as
> written by md. So that file update is lost, but the FS is consistent and
> all the data it can reach is consistent with what it thinks is there.
> 
> So, I continue to believe silent corruption is mythical. I'm still open
> to good explanation it's not though.
> 
> -Mike

I will take a stab at an explanation.

Assume a single stripe has data for 2 different files (A and B).  A disk has
failed.  The file system writes a 4K chunk of data to file A.  The parity
gets updated, but not the data.  Or the data gets updated but not the
parity.  The system crashes or power fails.  The system recovers, but can't
do anything about the parity with a failed disk.  But the filesystem does
its thing.  The disk is replaced and added, the data block is reconstructed
using a good block and a bad block.  The parity now matches the data.  But
the reconstructed block (file B) is wrong and the file using that block has
not been changed for years.  So, silent corruption.  But you could argue it
was a double failure.

Also, after a power failure, I have seen the system come back with a single
disk failed.  I guess that 1 disk had an expired superblock.  When you add
that disk back, any stripes that were not up to date will be re-constructed
with invalid data.  I don't know if the intent logging will help here or
not.  Most likely, more than 1 disk will have an expired superblock.  If you
force md to assemble the array it resyncs, but I don't know if it does the
parity or picks a disk to pretend was just added.  It show a single disk as
rebuilding, if that is correct, silent corruption could occur.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19  5:54         ` Neil Brown
@ 2005-11-19 11:59           ` Farkas Levente
  2005-11-20 23:39             ` Neil Brown
  2005-11-19 19:52           ` Carlos Carvalho
  1 sibling, 1 reply; 23+ messages in thread
From: Farkas Levente @ 2005-11-19 11:59 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:

> The other is to use a filesystem that allows the problem to be avoided
> by making sure that the only blocks that can be corrupted are dead
> blocks.
> This could be done with a copy-on-write filesystem that knows about the
> raid5 geometry, and only ever writes to a stripe when no other blocks
> on the stripe contain live data.
> I've been working on a filesystem which does just this, and hope to
> have it available in a year or two (it is a back-ground 'hobby'
> project). 

why are you waiting so long? why not just release the project plan, and 
any pre-pre-alpha code? that's the point of the cathedral and the 
bazaar. may be others can help, find bugs, write code, etc..

-- 
   Levente                               "Si vis pacem para bellum!"

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19  5:54         ` Neil Brown
  2005-11-19 11:59           ` Farkas Levente
@ 2005-11-19 19:52           ` Carlos Carvalho
  2005-11-20 19:54             ` Paul Clements
  1 sibling, 1 reply; 23+ messages in thread
From: Carlos Carvalho @ 2005-11-19 19:52 UTC (permalink / raw)
  To: linux-raid

Neil Brown (neilb@suse.de) wrote on 19 November 2005 16:54:
 >There are two solutions to this silent corruption problem (other than
 >'ignore it and hope it doesn't bite' which is a fair widely used
 >solution, and I haven't seen any bite marks myself).

It happened to me several years ago when two disks failed almost
simultaneously due to scsi bus problems. I had to re-assemble the
array anyway and some files got corrupted :-( That's why I ended up
having each disk on an independent bus and cable...

 >One is journalling, as has been mentioned.  This could be done to a
 >mirrored pair, or to a ECC NVRAM card (the latter being probably the
 >best, though also most expensive).  You would write each data block as
 >it becomes available, and each parity block just before commencing a
 >write to the raid5.  Obviously you also keep track of what you have
 >written.
 >I have toyed with the idea of implementing this, but I think demand is
 >sufficiently low that it isn't worth it.
 >
 >The other is to use a filesystem that allows the problem to be avoided
 >by making sure that the only blocks that can be corrupted are dead
 >blocks.
 >This could be done with a copy-on-write filesystem that knows about the
 >raid5 geometry, and only ever writes to a stripe when no other blocks
 >on the stripe contain live data.
 >I've been working on a filesystem which does just this, and hope to
 >have it available in a year or two (it is a back-ground 'hobby'
 >project). 

I think the demand for any solution to the unclean array is indeed low
because of the small probability of a double failure. Those that want
more reliability can use a spare drive that resyncs automatically or
raid6 (or both).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19 19:52           ` Carlos Carvalho
@ 2005-11-20 19:54             ` Paul Clements
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Clements @ 2005-11-20 19:54 UTC (permalink / raw)
  To: Carlos Carvalho; +Cc: linux-raid

Carlos Carvalho wrote:

> I think the demand for any solution to the unclean array is indeed low
> because of the small probability of a double failure. Those that want
> more reliability can use a spare drive that resyncs automatically or
> raid6 (or both).

A spare disk would help, but note that raid6 does not decrease the 
probability of the silent corruption problem. Losing one disk in a raid6 
still means that you are degraded (i.e., you rely on parity to 
recalculate data, so an incomplete stripe write means corruption).

--
Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2005-11-19 11:59           ` Farkas Levente
@ 2005-11-20 23:39             ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2005-11-20 23:39 UTC (permalink / raw)
  To: Farkas Levente; +Cc: linux-raid

On Saturday November 19, lfarkas@bppiac.hu wrote:
> Neil Brown wrote:
> 
> > The other is to use a filesystem that allows the problem to be avoided
> > by making sure that the only blocks that can be corrupted are dead
> > blocks.
> > This could be done with a copy-on-write filesystem that knows about the
> > raid5 geometry, and only ever writes to a stripe when no other blocks
> > on the stripe contain live data.
> > I've been working on a filesystem which does just this, and hope to
> > have it available in a year or two (it is a back-ground 'hobby'
> > project). 
> 
> why are you waiting so long? why not just release the project plan, and 
> any pre-pre-alpha code? that's the point of the cathedral and the 
> bazaar. may be others can help, find bugs, write code, etc..

Uhmm... maybe I'm just selfish and want to have all the fun to myself (it
is as much a learning exercise as anything else)... or maybe I want to
be a coder rather than a project manager...

Despite the 'release early / release often' mantra, I think it is
possible to release something too early.  People look, find nothing
there, lose interest and never come back.

However I'm possibly at the stage were it might be worth making the
project more visible...  I'll think about it and let you know....

Thanks for your interest

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* raid5 write performance
@ 2006-07-02 14:02 Raz Ben-Jehuda(caro)
  2006-07-02 22:35 ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-07-02 14:02 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: Neil Brown

Neil hello.

I have been looking at the raid5 code trying to understand why writes
performance is so poor.
If I am not mistaken here, It seems that you issue a write in size of
one page an no more no matter what buffer size I am using .

1. Is this page is directed only to parity disk ?
2. How can i increase the write throughout ?

Thank you
-- 
Raz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2006-07-02 14:02 raid5 write performance Raz Ben-Jehuda(caro)
@ 2006-07-02 22:35 ` Neil Brown
  2006-08-13 13:19   ` Raz Ben-Jehuda(caro)
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2006-07-02 22:35 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Sunday July 2, raziebe@gmail.com wrote:
> Neil hello.
> 
> I have been looking at the raid5 code trying to understand why writes
> performance is so poor.

raid5 write performance is expected to be poor, as you often need to
pre-read data or parity before the write can be issued.

> If I am not mistaken here, It seems that you issue a write in size of
> one page an no more no matter what buffer size I am using .

I doubt the small write size would contribute more than a couple of
percent to the speed issue.  Scheduling (when to write, when to
pre-read, when to wait a moment) is probably much more important.

> 
> 1. Is this page is directed only to parity disk ?

No.  All drives are written with one page units.  Each request is
divided into one-page chunks, these one page chunks are gathered -
where possible - into strips, and the strips are handled as units
(Where a strip is like a stripe, only 1 page wide rather then one chunk
wide - if that makes sense).

> 2. How can i increase the write throughout ?

Look at scheduling patterns - what order are the blocks getting
written, do we pre-read when we don't need to, things like that.

The current code tries to do the right thing, and it certainly has
been worse in the past, but I wouldn't be surprised if it could still
be improved.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2006-07-02 22:35 ` Neil Brown
@ 2006-08-13 13:19   ` Raz Ben-Jehuda(caro)
  2006-08-28  4:32     ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2006-08-13 13:19 UTC (permalink / raw)
  To: Linux RAID Mailing List; +Cc: Neil Brown

well ... me again

Following your advice....

I added a deadline for every WRITE stripe head when it is created.
in raid5_activate_delayed i checked if deadline is expired and if not i am
setting the sh to prereadactive mode as .

This small fix ( and in few other places in the code) reduced the
amount of reads
to zero with dd but with no improvement to throghput. But with random access to
the raid  ( buffers are aligned by the stripe width and with the size
of stripe width )
there is an improvement of at least 20 % .

Problem is that a user must know what he is doing else there would be
a reduction
in performance if deadline line it too long (say 100 ms).

raz

On 7/3/06, Neil Brown <neilb@suse.de> wrote:
> On Sunday July 2, raziebe@gmail.com wrote:
> > Neil hello.
> >
> > I have been looking at the raid5 code trying to understand why writes
> > performance is so poor.
>
> raid5 write performance is expected to be poor, as you often need to
> pre-read data or parity before the write can be issued.
>
> > If I am not mistaken here, It seems that you issue a write in size of
> > one page an no more no matter what buffer size I am using .
>
> I doubt the small write size would contribute more than a couple of
> percent to the speed issue.  Scheduling (when to write, when to
> pre-read, when to wait a moment) is probably much more important.
>
> >
> > 1. Is this page is directed only to parity disk ?
>
> No.  All drives are written with one page units.  Each request is
> divided into one-page chunks, these one page chunks are gathered -
> where possible - into strips, and the strips are handled as units
> (Where a strip is like a stripe, only 1 page wide rather then one chunk
> wide - if that makes sense).
>
> > 2. How can i increase the write throughout ?
>
> Look at scheduling patterns - what order are the blocks getting
> written, do we pre-read when we don't need to, things like that.
>
> The current code tries to do the right thing, and it certainly has
> been worse in the past, but I wouldn't be surprised if it could still
> be improved.
>
> NeilBrown
>


-- 
Raz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2006-08-13 13:19   ` Raz Ben-Jehuda(caro)
@ 2006-08-28  4:32     ` Neil Brown
  2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
  0 siblings, 1 reply; 23+ messages in thread
From: Neil Brown @ 2006-08-28  4:32 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Sunday August 13, raziebe@gmail.com wrote:
> well ... me again
> 
> Following your advice....
> 
> I added a deadline for every WRITE stripe head when it is created.
> in raid5_activate_delayed i checked if deadline is expired and if not i am
> setting the sh to prereadactive mode as .
> 
> This small fix ( and in few other places in the code) reduced the
> amount of reads
> to zero with dd but with no improvement to throghput. But with random access to
> the raid  ( buffers are aligned by the stripe width and with the size
> of stripe width )
> there is an improvement of at least 20 % .
> 
> Problem is that a user must know what he is doing else there would be
> a reduction
> in performance if deadline line it too long (say 100 ms).

So if I understand you correctly, you are delaying write requests to
partial stripes slightly (your 'deadline') and this is sometimes
giving you a 20% improvement ?

I'm not surprised that you could get some improvement.  20% is quite
surprising.  It would be worth following through with this to make
that improvement generally available.

As you say, picking a time in milliseconds is very error prone.  We
really need to come up with something more natural.
I had hopped that the 'unplug' infrastructure would provide the right
thing, but apparently not.  Maybe unplug is just being called too
often.

I'll see if I can duplicate this myself and find out what is really
going on.

Thanks for the report.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2006-08-28  4:32     ` Neil Brown
@ 2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
  2007-03-31 21:28         ` Bill Davidsen
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2007-03-30 21:44 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 12376 bytes --]

Please see bellow.

On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> On Sunday August 13, raziebe@gmail.com wrote:
> > well ... me again
> >
> > Following your advice....
> >
> > I added a deadline for every WRITE stripe head when it is created.
> > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > setting the sh to prereadactive mode as .
> >
> > This small fix ( and in few other places in the code) reduced the
> > amount of reads
> > to zero with dd but with no improvement to throghput. But with random access to
> > the raid  ( buffers are aligned by the stripe width and with the size
> > of stripe width )
> > there is an improvement of at least 20 % .
> >
> > Problem is that a user must know what he is doing else there would be
> > a reduction
> > in performance if deadline line it too long (say 100 ms).
>
> So if I understand you correctly, you are delaying write requests to
> partial stripes slightly (your 'deadline') and this is sometimes
> giving you a 20% improvement ?
>
> I'm not surprised that you could get some improvement.  20% is quite
> surprising.  It would be worth following through with this to make
> that improvement generally available.
>
> As you say, picking a time in milliseconds is very error prone.  We
> really need to come up with something more natural.
> I had hopped that the 'unplug' infrastructure would provide the right
> thing, but apparently not.  Maybe unplug is just being called too
> often.
>
> I'll see if I can duplicate this myself and find out what is really
> going on.
>
> Thanks for the report.
>
> NeilBrown
>

Neil Hello. I am sorry for this interval , I was assigned abruptly to
a different project.

1.
  I'd taken a look at the raid5 delay patch I have written a while
ago. I ported it to 2.6.17 and tested it. it makes sounds of working
and when used correctly it eliminates the reads penalty.

2. Benchmarks .
    configuration:
     I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
synchronous and non-buffered(o_direct) , 2 MB in size and always
aligned to the beginning of a stripe. kernel is 2.6.17. The
stripe_delay was set to 10ms.

 Attached is the simple_write code.

         command :
               simple_write /dev/md1 2048 0 1000
                       simple_write raw writes (O_DIRECT) sequentially
starting from offset zero 2048 kilobytes 1000 times.

Benchmark Before patch

sda            1848.00      8384.00     50992.00       8384      50992
sdb            1995.00     12424.00     51008.00      12424      51008
sdc            1698.00      8160.00     51000.00       8160      51000
sdd               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
md1             450.00         0.00    102400.00          0     102400


Benchmark After patch

sda             389.11         0.00    128530.69          0     129816
sdb             381.19         0.00    129354.46          0     130648
sdc             383.17         0.00    128530.69          0     129816
sdd               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
md1            1140.59         0.00    259548.51          0     262144

As one can see , no additional reads were done. One can actually
calculate  the raid's utilization: n-1/n * ( single disk throughput
with 1M writes ) .


      3.  The patch code.
          Kernel tested above was 2.6.17. The patch is of 2.6.20.2
because I have noticed a big code differences between 17 to 20.x .
This patch was not tested on 2.6.20.2 but it is essentialy the same. I
have not tested (yet) degraded mode or any other non-common pathes.


--- linux-2.6.20.2/drivers/md/raid5.c   2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/drivers/md/raid5.c      2007-03-30
12:37:55.000000000 +0300
@@ -65,6 +65,7 @@
 #define NR_HASH                        (PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK              (NR_HASH - 1)

+
 #define stripe_hash(conf, sect)
(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))

 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
@@ -234,6 +235,8 @@
        sh->sector = sector;
        sh->pd_idx = pd_idx;
        sh->state = 0;
+       sh->active_preread_jiffies =
+                       msecs_to_jiffies(
atomic_read(&conf->deadline_ms) )+ jiffies;

        sh->disks = disks;

@@ -628,6 +631,7 @@

        clear_bit(R5_LOCKED, &sh->dev[i].flags);
        set_bit(STRIPE_HANDLE, &sh->state);
+       sh->active_preread_jiffies = jiffies;
        release_stripe(sh);
        return 0;
 }
@@ -1255,8 +1259,11 @@
                bip = &sh->dev[dd_idx].towrite;
                if (*bip == NULL && sh->dev[dd_idx].written == NULL)
                        firstwrite = 1;
-       } else
+       } else{
                bip = &sh->dev[dd_idx].toread;
+               sh->active_preread_jiffies = jiffies;
+       }
+
        while (*bip && (*bip)->bi_sector < bi->bi_sector) {
                if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
                        goto overlap;
@@ -2437,13 +2444,27 @@



-static void raid5_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head*  raid5_activate_delayed(raid5_conf_t *conf)
 {
        if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
                while (!list_empty(&conf->delayed_list)) {
                        struct list_head *l = conf->delayed_list.next;
                        struct stripe_head *sh;
                        sh = list_entry(l, struct stripe_head, lru);
+
+                       if( time_before(jiffies,sh->active_preread_jiffies) ){
+                         PRINTK("deadline : no expire sec=%lld %8u %8u\n",
+                               (unsigned long long) sh->sector,
+
jiffies_to_msecs(sh->active_preread_jiffies),
+                                       jiffies_to_msecs(jiffies));
+                         return sh;
+                       }
+                       else{
+                             PRINTK("deadline:  expire:sec=%lld %8u %8u\n",
+                                       (unsigned long long)sh->sector,
+
jiffies_to_msecs(sh->active_preread_jiffies),
+                                               jiffies_to_msecs(jiffies));
+                       }
                        list_del_init(l);
                        clear_bit(STRIPE_DELAYED, &sh->state);
                        if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE,
&sh->state))
@@ -2451,6 +2472,7 @@
                        list_add_tail(&sh->lru, &conf->handle_list);
                }
        }
+     return NULL;
 }

 static void activate_bit_delay(raid5_conf_t *conf)
@@ -3191,7 +3213,7 @@
  */
 static void raid5d (mddev_t *mddev)
 {
-       struct stripe_head *sh;
+       struct stripe_head *sh,*delayed_sh=NULL;
        raid5_conf_t *conf = mddev_to_conf(mddev);
        int handled;

@@ -3218,8 +3240,10 @@
                    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
                    !blk_queue_plugged(mddev->queue) &&
                    !list_empty(&conf->delayed_list))
-                       raid5_activate_delayed(conf);
-
+                       delayed_sh=raid5_activate_delayed(conf);
+
+               if(delayed_sh) break;
+
                while ((bio = remove_bio_from_retry(conf))) {
                        int ok;
                        spin_unlock_irq(&conf->device_lock);
@@ -3254,9 +3278,51 @@
        unplug_slaves(mddev);

        PRINTK("--- raid5d inactive\n");
+       if (delayed_sh){
+               long wakeup=delayed_sh->active_preread_jiffies-jiffies;
+               PRINTK("--- raid5d inactive sleep for %d\n",
+                       jiffies_to_msecs(wakeup) );
+               if (wakeup>0)
+               mddev->thread->timeout = wakeup;
+       }
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  if (conf)
+    return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms));
+  else
+    return 0;
 }

 static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  char *end;
+  int new;
+  if (len >= PAGE_SIZE)
+    return -EINVAL;
+  if (!conf)
+    return -ENODEV;
+  new = simple_strtoul(page, &end, 10);
+  if (!*page || (*end && *end != '\n') )
+    return -EINVAL;
+  if (new < 0 || new > 10000)
+    return -EINVAL;
+  atomic_set(&conf->deadline_ms,new);
+  return len;
+}
+
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+                                raid5_show_stripe_deadline,
+                               raid5_store_stripe_deadline);
+
+
+static ssize_t
 raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
 {
        raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3297,6 +3363,9 @@
        return len;
 }

+
+
+
 static struct md_sysfs_entry
 raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
                                raid5_show_stripe_cache_size,
@@ -3318,8 +3387,10 @@
 static struct attribute *raid5_attrs[] =  {
        &raid5_stripecache_size.attr,
        &raid5_stripecache_active.attr,
+    &raid5_stripe_deadline.attr,
        NULL,
 };
+
 static struct attribute_group raid5_attrs_group = {
        .name = NULL,
        .attrs = raid5_attrs,
@@ -3567,6 +3638,8 @@

        blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);

+       atomic_set(&conf->deadline_ms,0);
+
        return 0;
 abort:
        if (conf) {


x/raid/raid5.h
--- linux-2.6.20.2/include/linux/raid/raid5.h   2007-03-09
20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/include/linux/raid/raid5.h      2007-03-30
00:25:38.000000000 +0200
@@ -136,6 +136,7 @@
        spinlock_t              lock;
        int                     bm_seq; /* sequence number for bitmap flushes */
        int                     disks;                  /* disks in stripe */
+       unsigned long           active_preread_jiffies;
        struct r5dev {
                struct bio      req;
                struct bio_vec  vec;
@@ -254,6 +255,7 @@
         * Free stripes pool
         */
        atomic_t                active_stripes;
+       atomic_t                deadline_ms;
        struct list_head        inactive_list;
        wait_queue_head_t       wait_for_stripe;
        wait_queue_head_t       wait_for_overlap;



3.
 I have also tested it over XFS file system ( I'd written a special
copy method for xfs for this purpose, called r5cp ). I am getting much
better numbers with this patch .
sdd is the source file system and sd[abc] contain the raid. xfs is
mounted over /dev/md1.

stripe_deadline=0ms ( disabled)
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sda              90.10      7033.66     37409.90       7104      37784
sdb              94.06      7168.32     37417.82       7240      37792
sdc              89.11      7215.84     37417.82       7288      37792
sdd              75.25     77053.47         0.00      77824          0
md1             319.80         0.00     77053.47          0      77824

stripe_deadline=10ms ( enabled)
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sda             113.00         0.00     67648.00          0      67648
sdb             113.00         0.00     67648.00          0      67648
sdc             113.00         0.00     67648.00          0      67648
sdd             128.00    131072.00         0.00     131072          0
md1             561.00         0.00    135168.00          0     135168

XFS did not crash nor suffer from any other incosistencies so far. Yet
I have only
begon.

4.
I am going to work on this with other configurations, such as raid5's
with more disks and raid50.  I will be happy to hear your opinion on
this matter. what puzzles me is why deadline must be so long as 10 ms?
 the less deadline the more reads I am getting.

Many thanks
Raz

[-- Attachment #2: simple_write.cpp --]
[-- Type: text/x-c++src, Size: 1291 bytes --]

#include <iostream>
#include <stdio.h>
#include <string>
#include <stddef.h>
#include <sys/time.h>

#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>



using namespace std;

int main (int argc, char *argv[])
{
  if (argc<5){
	cout << "usage  <device name>  <size to write in kb> <offset in kb > <loop>" << endl;
	return 0;
  }

  char* dev_name = argv[1];

  int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 );
  if (fd<0){
	perror("open ");
	return (-1);
  }

  long long write_sz_bytes = ( (long long)atoi(argv[2]))<<10;
  long long offset_sz_bytes   = atoi(argv[3])<<10;
  int   loops = atoi(argv[4]); 

  char* buffer = (char*)valloc(write_sz_bytes);
  if (!buffer) {
	perror("alloc : ");
	return -1;
  }

  memset(buffer,0x00,write_sz_bytes);

  while( (--loops)>0 ){
    
    int ret = pwrite64(fd,buffer,write_sz_bytes,offset_sz_bytes);
    if (ret<0) {
      perror("failed to write: ");
      printf("read_sz_kb=%d offset_sz_kb=%d\n",write_sz_bytes,offset_sz_bytes);
      return -1;
    }

    offset_sz_bytes += write_sz_bytes;
    printf("writing %lld bytes at offset %lld\n",write_sz_bytes,offset_sz_bytes);
  }
  
  return(0);
}



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
@ 2007-03-31 21:28         ` Bill Davidsen
  2007-03-31 23:03           ` Raz Ben-Jehuda(caro)
  2007-04-01 23:08         ` Dan Williams
       [not found]         ` <17950.50209.580439.607958@notabene.brown>
  2 siblings, 1 reply; 23+ messages in thread
From: Bill Davidsen @ 2007-03-31 21:28 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Neil Brown, Linux RAID Mailing List

Raz Ben-Jehuda(caro) wrote:
> Please see bellow.
>
> On 8/28/06, Neil Brown <neilb@suse.de> wrote:
>> On Sunday August 13, raziebe@gmail.com wrote:
>> > well ... me again
>> >
>> > Following your advice....
>> >
>> > I added a deadline for every WRITE stripe head when it is created.
>> > in raid5_activate_delayed i checked if deadline is expired and if 
>> not i am
>> > setting the sh to prereadactive mode as .
>> >
>> > This small fix ( and in few other places in the code) reduced the
>> > amount of reads
>> > to zero with dd but with no improvement to throghput. But with 
>> random access to
>> > the raid  ( buffers are aligned by the stripe width and with the size
>> > of stripe width )
>> > there is an improvement of at least 20 % .
>> >
>> > Problem is that a user must know what he is doing else there would be
>> > a reduction
>> > in performance if deadline line it too long (say 100 ms).
>>
>> So if I understand you correctly, you are delaying write requests to
>> partial stripes slightly (your 'deadline') and this is sometimes
>> giving you a 20% improvement ?
>>
>> I'm not surprised that you could get some improvement.  20% is quite
>> surprising.  It would be worth following through with this to make
>> that improvement generally available.
>>
>> As you say, picking a time in milliseconds is very error prone.  We
>> really need to come up with something more natural.
>> I had hopped that the 'unplug' infrastructure would provide the right
>> thing, but apparently not.  Maybe unplug is just being called too
>> often.
>>
>> I'll see if I can duplicate this myself and find out what is really
>> going on.
>>
>> Thanks for the report.
>>
>> NeilBrown
>>
>
> Neil Hello. I am sorry for this interval , I was assigned abruptly to
> a different project.
>
> 1.
>  I'd taken a look at the raid5 delay patch I have written a while
> ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> and when used correctly it eliminates the reads penalty.
>
> 2. Benchmarks .
>    configuration:
>     I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> synchronous and non-buffered(o_direct) , 2 MB in size and always
> aligned to the beginning of a stripe. kernel is 2.6.17. The
> stripe_delay was set to 10ms.
>
> Attached is the simple_write code.
>
>         command :
>               simple_write /dev/md1 2048 0 1000
>                       simple_write raw writes (O_DIRECT) sequentially
> starting from offset zero 2048 kilobytes 1000 times.
>
> Benchmark Before patch
>
> sda            1848.00      8384.00     50992.00       8384      50992
> sdb            1995.00     12424.00     51008.00      12424      51008
> sdc            1698.00      8160.00     51000.00       8160      51000
> sdd               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> md1             450.00         0.00    102400.00          0     102400
>
>
> Benchmark After patch
>
> sda             389.11         0.00    128530.69          0     129816
> sdb             381.19         0.00    129354.46          0     130648
> sdc             383.17         0.00    128530.69          0     129816
> sdd               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> md1            1140.59         0.00    259548.51          0     262144
>
> As one can see , no additional reads were done. One can actually
> calculate  the raid's utilization: n-1/n * ( single disk throughput
> with 1M writes ) .
>
>
>      3.  The patch code.
>          Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> because I have noticed a big code differences between 17 to 20.x .
> This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> have not tested (yet) degraded mode or any other non-common pathes.
My weekend is pretty taken, but I hope to try putting this patch against 
2.6.21-rc6-git1 (or whatever is current Monday), to see not only how it 
works against the test program, but also under some actual load. By eye, 
my data should be safe, but I think I'll test on a well backed machine 
anyway ;-)

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-03-31 21:28         ` Bill Davidsen
@ 2007-03-31 23:03           ` Raz Ben-Jehuda(caro)
  2007-04-01  2:16             ` Bill Davidsen
  0 siblings, 1 reply; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2007-03-31 23:03 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Neil Brown, Linux RAID Mailing List

On 3/31/07, Bill Davidsen <davidsen@tmr.com> wrote:
> Raz Ben-Jehuda(caro) wrote:
> > Please see bellow.
> >
> > On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> >> On Sunday August 13, raziebe@gmail.com wrote:
> >> > well ... me again
> >> >
> >> > Following your advice....
> >> >
> >> > I added a deadline for every WRITE stripe head when it is created.
> >> > in raid5_activate_delayed i checked if deadline is expired and if
> >> not i am
> >> > setting the sh to prereadactive mode as .
> >> >
> >> > This small fix ( and in few other places in the code) reduced the
> >> > amount of reads
> >> > to zero with dd but with no improvement to throghput. But with
> >> random access to
> >> > the raid  ( buffers are aligned by the stripe width and with the size
> >> > of stripe width )
> >> > there is an improvement of at least 20 % .
> >> >
> >> > Problem is that a user must know what he is doing else there would be
> >> > a reduction
> >> > in performance if deadline line it too long (say 100 ms).
> >>
> >> So if I understand you correctly, you are delaying write requests to
> >> partial stripes slightly (your 'deadline') and this is sometimes
> >> giving you a 20% improvement ?
> >>
> >> I'm not surprised that you could get some improvement.  20% is quite
> >> surprising.  It would be worth following through with this to make
> >> that improvement generally available.
> >>
> >> As you say, picking a time in milliseconds is very error prone.  We
> >> really need to come up with something more natural.
> >> I had hopped that the 'unplug' infrastructure would provide the right
> >> thing, but apparently not.  Maybe unplug is just being called too
> >> often.
> >>
> >> I'll see if I can duplicate this myself and find out what is really
> >> going on.
> >>
> >> Thanks for the report.
> >>
> >> NeilBrown
> >>
> >
> > Neil Hello. I am sorry for this interval , I was assigned abruptly to
> > a different project.
> >
> > 1.
> >  I'd taken a look at the raid5 delay patch I have written a while
> > ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> > and when used correctly it eliminates the reads penalty.
> >
> > 2. Benchmarks .
> >    configuration:
> >     I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> > synchronous and non-buffered(o_direct) , 2 MB in size and always
> > aligned to the beginning of a stripe. kernel is 2.6.17. The
> > stripe_delay was set to 10ms.
> >
> > Attached is the simple_write code.
> >
> >         command :
> >               simple_write /dev/md1 2048 0 1000
> >                       simple_write raw writes (O_DIRECT) sequentially
> > starting from offset zero 2048 kilobytes 1000 times.
> >
> > Benchmark Before patch
> >
> > sda            1848.00      8384.00     50992.00       8384      50992
> > sdb            1995.00     12424.00     51008.00      12424      51008
> > sdc            1698.00      8160.00     51000.00       8160      51000
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1             450.00         0.00    102400.00          0     102400
> >
> >
> > Benchmark After patch
> >
> > sda             389.11         0.00    128530.69          0     129816
> > sdb             381.19         0.00    129354.46          0     130648
> > sdc             383.17         0.00    128530.69          0     129816
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1            1140.59         0.00    259548.51          0     262144
> >
> > As one can see , no additional reads were done. One can actually
> > calculate  the raid's utilization: n-1/n * ( single disk throughput
> > with 1M writes ) .
> >
> >
> >      3.  The patch code.
> >          Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> > because I have noticed a big code differences between 17 to 20.x .
> > This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> > have not tested (yet) degraded mode or any other non-common pathes.
> My weekend is pretty taken, but I hope to try putting this patch against
> 2.6.21-rc6-git1 (or whatever is current Monday), to see not only how it
> works against the test program, but also under some actual load. By eye,
> my data should be safe, but I think I'll test on a well backed machine
> anyway ;-)
Bill.
This test program WRITES data to a raw device, it will
destroy everything you have on the RAID.
If you want to use a file system test unit, as mentioned
I have one for XFS file system.
> --
> bill davidsen <davidsen@tmr.com>
>  CTO TMR Associates, Inc
>  Doing interesting things with small computers since 1979
>
>


-- 
Raz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-03-31 23:03           ` Raz Ben-Jehuda(caro)
@ 2007-04-01  2:16             ` Bill Davidsen
  0 siblings, 0 replies; 23+ messages in thread
From: Bill Davidsen @ 2007-04-01  2:16 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Neil Brown, Linux RAID Mailing List

Raz Ben-Jehuda(caro) wrote:
> On 3/31/07, Bill Davidsen <davidsen@tmr.com> wrote:
>> Raz Ben-Jehuda(caro) wrote:
>> > Please see bellow.
>> >
>> > On 8/28/06, Neil Brown <neilb@suse.de> wrote:
>> >> On Sunday August 13, raziebe@gmail.com wrote:
>> >> > well ... me again
>> >> >
>> >> > Following your advice....
>> >> >
>> >> > I added a deadline for every WRITE stripe head when it is created.
>> >> > in raid5_activate_delayed i checked if deadline is expired and if
>> >> not i am
>> >> > setting the sh to prereadactive mode as .
>> >> >
>> >> > This small fix ( and in few other places in the code) reduced the
>> >> > amount of reads
>> >> > to zero with dd but with no improvement to throghput. But with
>> >> random access to
>> >> > the raid  ( buffers are aligned by the stripe width and with the 
>> size
>> >> > of stripe width )
>> >> > there is an improvement of at least 20 % .
>> >> >
>> >> > Problem is that a user must know what he is doing else there 
>> would be
>> >> > a reduction
>> >> > in performance if deadline line it too long (say 100 ms).
>> >>
>> >> So if I understand you correctly, you are delaying write requests to
>> >> partial stripes slightly (your 'deadline') and this is sometimes
>> >> giving you a 20% improvement ?
>> >>
>> >> I'm not surprised that you could get some improvement.  20% is quite
>> >> surprising.  It would be worth following through with this to make
>> >> that improvement generally available.
>> >>
>> >> As you say, picking a time in milliseconds is very error prone.  We
>> >> really need to come up with something more natural.
>> >> I had hopped that the 'unplug' infrastructure would provide the right
>> >> thing, but apparently not.  Maybe unplug is just being called too
>> >> often.
>> >>
>> >> I'll see if I can duplicate this myself and find out what is really
>> >> going on.
>> >>
>> >> Thanks for the report.
>> >>
>> >> NeilBrown
>> >>
>> >
>> > Neil Hello. I am sorry for this interval , I was assigned abruptly to
>> > a different project.
>> >
>> > 1.
>> >  I'd taken a look at the raid5 delay patch I have written a while
>> > ago. I ported it to 2.6.17 and tested it. it makes sounds of working
>> > and when used correctly it eliminates the reads penalty.
>> >
>> > 2. Benchmarks .
>> >    configuration:
>> >     I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
>> > synchronous and non-buffered(o_direct) , 2 MB in size and always
>> > aligned to the beginning of a stripe. kernel is 2.6.17. The
>> > stripe_delay was set to 10ms.
>> >
>> > Attached is the simple_write code.
>> >
>> >         command :
>> >               simple_write /dev/md1 2048 0 1000
>> >                       simple_write raw writes (O_DIRECT) sequentially
>> > starting from offset zero 2048 kilobytes 1000 times.
>> >
>> > Benchmark Before patch
>> >
>> > sda            1848.00      8384.00     50992.00       8384      50992
>> > sdb            1995.00     12424.00     51008.00      12424      51008
>> > sdc            1698.00      8160.00     51000.00       8160      51000
>> > sdd               0.00         0.00         0.00          0          0
>> > md0               0.00         0.00         0.00          0          0
>> > md1             450.00         0.00    102400.00          0     102400
>> >
>> >
>> > Benchmark After patch
>> >
>> > sda             389.11         0.00    128530.69          0     129816
>> > sdb             381.19         0.00    129354.46          0     130648
>> > sdc             383.17         0.00    128530.69          0     129816
>> > sdd               0.00         0.00         0.00          0          0
>> > md0               0.00         0.00         0.00          0          0
>> > md1            1140.59         0.00    259548.51          0     262144
>> >
>> > As one can see , no additional reads were done. One can actually
>> > calculate  the raid's utilization: n-1/n * ( single disk throughput
>> > with 1M writes ) .
>> >
>> >
>> >      3.  The patch code.
>> >          Kernel tested above was 2.6.17. The patch is of 2.6.20.2
>> > because I have noticed a big code differences between 17 to 20.x .
>> > This patch was not tested on 2.6.20.2 but it is essentialy the same. I
>> > have not tested (yet) degraded mode or any other non-common pathes.
>> My weekend is pretty taken, but I hope to try putting this patch against
>> 2.6.21-rc6-git1 (or whatever is current Monday), to see not only how it
>> works against the test program, but also under some actual load. By eye,
>> my data should be safe, but I think I'll test on a well backed machine
>> anyway ;-)
> Bill.
> This test program WRITES data to a raw device, it will
> destroy everything you have on the RAID.
> If you want to use a file system test unit, as mentioned
> I have one for XFS file system.

I realize how it works, I have some disks to run tests. But changing the 
RAID code puts all my RAID filesystems at risk to some extent, when logs 
get written, etc. When I play with low level stuff I am careful, I've 
been burned...

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
  2007-03-31 21:28         ` Bill Davidsen
@ 2007-04-01 23:08         ` Dan Williams
  2007-04-02 14:13           ` Raz Ben-Jehuda(caro)
       [not found]         ` <17950.50209.580439.607958@notabene.brown>
  2 siblings, 1 reply; 23+ messages in thread
From: Dan Williams @ 2007-04-01 23:08 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Neil Brown, Linux RAID Mailing List

On 3/30/07, Raz Ben-Jehuda(caro) <raziebe@gmail.com> wrote:
> Please see bellow.
>
> On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> > On Sunday August 13, raziebe@gmail.com wrote:
> > > well ... me again
> > >
> > > Following your advice....
> > >
> > > I added a deadline for every WRITE stripe head when it is created.
> > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > setting the sh to prereadactive mode as .
> > >
> > > This small fix ( and in few other places in the code) reduced the
> > > amount of reads
> > > to zero with dd but with no improvement to throghput. But with random access to
> > > the raid  ( buffers are aligned by the stripe width and with the size
> > > of stripe width )
> > > there is an improvement of at least 20 % .
> > >
> > > Problem is that a user must know what he is doing else there would be
> > > a reduction
> > > in performance if deadline line it too long (say 100 ms).
> >
> > So if I understand you correctly, you are delaying write requests to
> > partial stripes slightly (your 'deadline') and this is sometimes
> > giving you a 20% improvement ?
> >
> > I'm not surprised that you could get some improvement.  20% is quite
> > surprising.  It would be worth following through with this to make
> > that improvement generally available.
> >
> > As you say, picking a time in milliseconds is very error prone.  We
> > really need to come up with something more natural.
> > I had hopped that the 'unplug' infrastructure would provide the right
> > thing, but apparently not.  Maybe unplug is just being called too
> > often.
> >
> > I'll see if I can duplicate this myself and find out what is really
> > going on.
> >
> > Thanks for the report.
> >
> > NeilBrown
> >
>
> Neil Hello. I am sorry for this interval , I was assigned abruptly to
> a different project.
>
> 1.
>   I'd taken a look at the raid5 delay patch I have written a while
> ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> and when used correctly it eliminates the reads penalty.
>
> 2. Benchmarks .
>     configuration:
>      I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> synchronous and non-buffered(o_direct) , 2 MB in size and always
> aligned to the beginning of a stripe. kernel is 2.6.17. The
> stripe_delay was set to 10ms.
>
>  Attached is the simple_write code.
>
>          command :
>                simple_write /dev/md1 2048 0 1000
>                        simple_write raw writes (O_DIRECT) sequentially
> starting from offset zero 2048 kilobytes 1000 times.
>
> Benchmark Before patch
>
> sda            1848.00      8384.00     50992.00       8384      50992
> sdb            1995.00     12424.00     51008.00      12424      51008
> sdc            1698.00      8160.00     51000.00       8160      51000
> sdd               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> md1             450.00         0.00    102400.00          0     102400
>
>
> Benchmark After patch
>
> sda             389.11         0.00    128530.69          0     129816
> sdb             381.19         0.00    129354.46          0     130648
> sdc             383.17         0.00    128530.69          0     129816
> sdd               0.00         0.00         0.00          0          0
> md0               0.00         0.00         0.00          0          0
> md1            1140.59         0.00    259548.51          0     262144
>
> As one can see , no additional reads were done. One can actually
> calculate  the raid's utilization: n-1/n * ( single disk throughput
> with 1M writes ) .
>
>
>       3.  The patch code.
>           Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> because I have noticed a big code differences between 17 to 20.x .
> This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> have not tested (yet) degraded mode or any other non-common pathes.
>
This is along the same lines of what I am working on, new cache
policies for raid5/6, so I want to give it a try as well.
Unfortunately gmail has mangled your patch.  Can you resend as an
attachment?

patch: **** malformed patch at line 10:
(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))

Thanks,
Dan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-04-01 23:08         ` Dan Williams
@ 2007-04-02 14:13           ` Raz Ben-Jehuda(caro)
  0 siblings, 0 replies; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2007-04-02 14:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: Neil Brown, Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 6706 bytes --]

On 4/2/07, Dan Williams <dan.j.williams@intel.com> wrote:
> On 3/30/07, Raz Ben-Jehuda(caro) <raziebe@gmail.com> wrote:
> > Please see bellow.
> >
> > On 8/28/06, Neil Brown <neilb@suse.de> wrote:
> > > On Sunday August 13, raziebe@gmail.com wrote:
> > > > well ... me again
> > > >
> > > > Following your advice....
> > > >
> > > > I added a deadline for every WRITE stripe head when it is created.
> > > > in raid5_activate_delayed i checked if deadline is expired and if not i am
> > > > setting the sh to prereadactive mode as .
> > > >
> > > > This small fix ( and in few other places in the code) reduced the
> > > > amount of reads
> > > > to zero with dd but with no improvement to throghput. But with random access to
> > > > the raid  ( buffers are aligned by the stripe width and with the size
> > > > of stripe width )
> > > > there is an improvement of at least 20 % .
> > > >
> > > > Problem is that a user must know what he is doing else there would be
> > > > a reduction
> > > > in performance if deadline line it too long (say 100 ms).
> > >
> > > So if I understand you correctly, you are delaying write requests to
> > > partial stripes slightly (your 'deadline') and this is sometimes
> > > giving you a 20% improvement ?
> > >
> > > I'm not surprised that you could get some improvement.  20% is quite
> > > surprising.  It would be worth following through with this to make
> > > that improvement generally available.
> > >
> > > As you say, picking a time in milliseconds is very error prone.  We
> > > really need to come up with something more natural.
> > > I had hopped that the 'unplug' infrastructure would provide the right
> > > thing, but apparently not.  Maybe unplug is just being called too
> > > often.
> > >
> > > I'll see if I can duplicate this myself and find out what is really
> > > going on.
> > >
> > > Thanks for the report.
> > >
> > > NeilBrown
> > >
> >
> > Neil Hello. I am sorry for this interval , I was assigned abruptly to
> > a different project.
> >
> > 1.
> >   I'd taken a look at the raid5 delay patch I have written a while
> > ago. I ported it to 2.6.17 and tested it. it makes sounds of working
> > and when used correctly it eliminates the reads penalty.
> >
> > 2. Benchmarks .
> >     configuration:
> >      I am testing a raid5 x 3 disks with 1MB chunk size.  IOs are
> > synchronous and non-buffered(o_direct) , 2 MB in size and always
> > aligned to the beginning of a stripe. kernel is 2.6.17. The
> > stripe_delay was set to 10ms.
> >
> >  Attached is the simple_write code.
> >
> >          command :
> >                simple_write /dev/md1 2048 0 1000
> >                        simple_write raw writes (O_DIRECT) sequentially
> > starting from offset zero 2048 kilobytes 1000 times.
> >
> > Benchmark Before patch
> >
> > sda            1848.00      8384.00     50992.00       8384      50992
> > sdb            1995.00     12424.00     51008.00      12424      51008
> > sdc            1698.00      8160.00     51000.00       8160      51000
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1             450.00         0.00    102400.00          0     102400
> >
> >
> > Benchmark After patch
> >
> > sda             389.11         0.00    128530.69          0     129816
> > sdb             381.19         0.00    129354.46          0     130648
> > sdc             383.17         0.00    128530.69          0     129816
> > sdd               0.00         0.00         0.00          0          0
> > md0               0.00         0.00         0.00          0          0
> > md1            1140.59         0.00    259548.51          0     262144
> >
> > As one can see , no additional reads were done. One can actually
> > calculate  the raid's utilization: n-1/n * ( single disk throughput
> > with 1M writes ) .
> >
> >
> >       3.  The patch code.
> >           Kernel tested above was 2.6.17. The patch is of 2.6.20.2
> > because I have noticed a big code differences between 17 to 20.x .
> > This patch was not tested on 2.6.20.2 but it is essentialy the same. I
> > have not tested (yet) degraded mode or any other non-common pathes.
> >
> This is along the same lines of what I am working on, new cache
> policies for raid5/6, so I want to give it a try as well.
> Unfortunately gmail has mangled your patch.  Can you resend as an
> attachment?
>
> patch: **** malformed patch at line 10:
> (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
>
> Thanks,
> Dan
>

Dan hello.
Attached are the patches. Also , I have added another test unit : random_writev.
It is not much of a code but it does the work. It tests writing a
vector .it shows the same results as writing using a single buffer.

What is the new cache poilcies ?

Please note !
I haven't indented the patch nor did the instructions according to
SubmitingPatches document. If Neil would approve this patch or parts
of it, I will do so.

# Benchmark 3:  Testing  8 disks raid5.

Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller
is promise
in jbod mode.

raid conf:
md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3]
sdc1[2] sdb2[1]
      3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU]

In order to achieve zero reads I had to tune the deadline to 20ms ( so
long ? ). stripe_cache_size is 256 which is exactly what is needed to
preform a full stripe hit
with this configuration.

> comand:    random_writev /dev/md1 7168 0 3000 10000

iostats snapshot

avg-cpu:  %user   %nice    %sys %iowait   %idle
           0.00    0.00   21.00   29.00   50.00

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
hda               0.00         0.00         0.00          0          0
md0               0.00         0.00         0.00          0          0
sda             234.34         0.00     50400.00          0      49896
sdb             235.35         0.00     50658.59          0      50152
sdc             242.42         0.00     51014.14          0      50504
sdd             246.46         0.00     50755.56          0      50248
sde             248.48         0.00     51272.73          0      50760
sdf             245.45         0.00     50755.56          0      50248
sdg             244.44         0.00     50755.56          0      50248
sdh             245.45         0.00     50755.56          0      50248
md1            1407.07         0.00    347741.41          0     344264

Try setting it the stripe_cace_size to 255 and you will notice the delay.
Try lowering with the stripe_deadline and you will notice how the amount
of reads grow.

Cheers
-- 
Raz

[-- Attachment #2: raid5_write.c.patch --]
[-- Type: text/x-patch, Size: 5182 bytes --]

diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/drivers/md/raid5.c linux-2.6.20.2-raid/drivers/md/raid5.c
--- linux-2.6.20.2/drivers/md/raid5.c	2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/drivers/md/raid5.c	2007-03-30 12:37:55.000000000 +0300
@@ -65,6 +65,7 @@
 #define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
 #define HASH_MASK		(NR_HASH - 1)
 
+
 #define stripe_hash(conf, sect)	(&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK]))
 
 /* bio's attached to a stripe+device for I/O are linked together in bi_sector
@@ -234,6 +235,8 @@
 	sh->sector = sector;
 	sh->pd_idx = pd_idx;
 	sh->state = 0;
+	sh->active_preread_jiffies =
+        		msecs_to_jiffies( atomic_read(&conf->deadline_ms) )+ jiffies;
 
 	sh->disks = disks;
 
@@ -628,6 +631,7 @@
 	
 	clear_bit(R5_LOCKED, &sh->dev[i].flags);
 	set_bit(STRIPE_HANDLE, &sh->state);
+	sh->active_preread_jiffies = jiffies;
 	release_stripe(sh);
 	return 0;
 }
@@ -1255,8 +1259,11 @@
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
 			firstwrite = 1;
-	} else
+	} else{
 		bip = &sh->dev[dd_idx].toread;
+		sh->active_preread_jiffies = jiffies;	
+	}
+
 	while (*bip && (*bip)->bi_sector < bi->bi_sector) {
 		if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector)
 			goto overlap;
@@ -2437,13 +2444,27 @@
 
 
 
-static void raid5_activate_delayed(raid5_conf_t *conf)
+static struct stripe_head*  raid5_activate_delayed(raid5_conf_t *conf)
 {
 	if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) {
 		while (!list_empty(&conf->delayed_list)) {
 			struct list_head *l = conf->delayed_list.next;
 			struct stripe_head *sh;
 			sh = list_entry(l, struct stripe_head, lru);
+      			
+			if( time_before(jiffies,sh->active_preread_jiffies) ){
+        		  PRINTK("deadline : no expire sec=%lld %8u %8u\n",
+	               		(unsigned long long) sh->sector,
+               			jiffies_to_msecs(sh->active_preread_jiffies),
+               			jiffies_to_msecs(jiffies));
+        		  return sh;
+      			}
+      			else{
+			      PRINTK("deadline:  expire:sec=%lld %8u %8u\n",
+	               			(unsigned long long)sh->sector,
+        	       			jiffies_to_msecs(sh->active_preread_jiffies),
+               				jiffies_to_msecs(jiffies));
+			}
 			list_del_init(l);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -2451,6 +2472,7 @@
 			list_add_tail(&sh->lru, &conf->handle_list);
 		}
 	}
+     return NULL;
 }
 
 static void activate_bit_delay(raid5_conf_t *conf)
@@ -3191,7 +3213,7 @@
  */
 static void raid5d (mddev_t *mddev)
 {
-	struct stripe_head *sh;
+	struct stripe_head *sh,*delayed_sh=NULL;
 	raid5_conf_t *conf = mddev_to_conf(mddev);
 	int handled;
 
@@ -3218,8 +3240,10 @@
 		    atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD &&
 		    !blk_queue_plugged(mddev->queue) &&
 		    !list_empty(&conf->delayed_list))
-			raid5_activate_delayed(conf);
-
+			delayed_sh=raid5_activate_delayed(conf);
+		
+		if(delayed_sh) break;
+		    
 		while ((bio = remove_bio_from_retry(conf))) {
 			int ok;
 			spin_unlock_irq(&conf->device_lock);
@@ -3254,9 +3278,51 @@
 	unplug_slaves(mddev);
 
 	PRINTK("--- raid5d inactive\n");
+ 	if (delayed_sh){
+   		long wakeup=delayed_sh->active_preread_jiffies-jiffies;
+   		PRINTK("--- raid5d inactive sleep for %d\n",
+            		jiffies_to_msecs(wakeup) );
+   		if (wakeup>0)
+     		mddev->thread->timeout = wakeup;
+  	}
+}
+
+static ssize_t
+raid5_show_stripe_deadline(mddev_t *mddev, char *page)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  if (conf)
+    return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms));
+  else
+    return 0;
 }
 
 static ssize_t
+raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len)
+{
+  raid5_conf_t *conf = mddev_to_conf(mddev);
+  char *end;
+  int new;
+  if (len >= PAGE_SIZE)
+    return -EINVAL;
+  if (!conf)
+    return -ENODEV;
+  new = simple_strtoul(page, &end, 10);
+  if (!*page || (*end && *end != '\n') )
+    return -EINVAL;
+  if (new < 0 || new > 10000)
+    return -EINVAL;
+  atomic_set(&conf->deadline_ms,new);
+  return len;
+}
+
+static struct md_sysfs_entry
+raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR,
+                                raid5_show_stripe_deadline,
+                               raid5_store_stripe_deadline);
+
+
+static ssize_t
 raid5_show_stripe_cache_size(mddev_t *mddev, char *page)
 {
 	raid5_conf_t *conf = mddev_to_conf(mddev);
@@ -3297,6 +3363,9 @@
 	return len;
 }
 
+
+
+
 static struct md_sysfs_entry
 raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR,
 				raid5_show_stripe_cache_size,
@@ -3318,8 +3387,10 @@
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
+    &raid5_stripe_deadline.attr,
 	NULL,
 };
+
 static struct attribute_group raid5_attrs_group = {
 	.name = NULL,
 	.attrs = raid5_attrs,
@@ -3567,6 +3638,8 @@
 
 	blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec);
 
+	atomic_set(&conf->deadline_ms,0);
+
 	return 0;
 abort:
 	if (conf) {

[-- Attachment #3: raid5_write.h.patch --]
[-- Type: text/x-patch, Size: 765 bytes --]

diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/include/linux/raid/raid5.h linux-2.6.20.2-raid/include/linux/raid/raid5.h
--- linux-2.6.20.2/include/linux/raid/raid5.h	2007-03-09 20:58:04.000000000 +0200
+++ linux-2.6.20.2-raid/include/linux/raid/raid5.h	2007-03-30 00:25:38.000000000 +0200
@@ -136,6 +136,7 @@
 	spinlock_t		lock;
 	int			bm_seq;	/* sequence number for bitmap flushes */
 	int			disks;			/* disks in stripe */
+	unsigned long   	active_preread_jiffies;
 	struct r5dev {
 		struct bio	req;
 		struct bio_vec	vec;
@@ -254,6 +255,7 @@
 	 * Free stripes pool
 	 */
 	atomic_t		active_stripes;
+	atomic_t        	deadline_ms;
 	struct list_head	inactive_list;
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;

[-- Attachment #4: random_writev.cpp --]
[-- Type: text/x-c++src, Size: 2203 bytes --]

#define _LARGEFILE64_SOURC

#include <iostream>
#include <stdio.h>
#include <string>
#include <stddef.h>
#include <sys/time.h>

#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <libaio.h>
#include <time.h>
#include <stdio.h>
#include <errno.h>
#include <sys/uio.h>
#include <unistd.h>
#include <sys/types.h>
#include <linux/unistd.h>
#include <errno.h>

using namespace std;

int main (int argc, char *argv[])
{
  if (argc<6){
	cout << "usage  <device name>  <size to write in kb> <offset in kb > <diskSizeGB> <loops>" << endl;
	return 0;
  }

  char* dev_name = argv[1];

  int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 );
  if (fd<0){
	perror("open ");
	return (-1);
  }

  long long write_sz_bytes    = ( (long long)atoi(argv[2]))<<10;
  long long offset_sz_bytes   = ( (long long) atoi(argv[3]) )<<10;
  long long diskSizeBytes        = ( (long long)atoi(argv[4]))<<30;
  int loops = atoi(argv[5]);
 
  struct iovec vec[10];
  int blocks = (write_sz_bytes >>20);

  for( int i = 0 ; i < blocks; i++){
  
    char* buffer = (char*)valloc((1<<20));
    if (!buffer) {
	    perror("alloc : ");
        return -1;
    }
    vec[i].iov_base = buffer;
    vec[i].iov_len = 1048576;
    memset(buffer,0x00,1048576);
  }

  int ret=0;
 
  while( (--loops)>0 ){

    if ( lseek64(fd,offset_sz_bytes,SEEK_SET) < 0  ){
      printf("%s: failed on lseek offset=%lld\n",offset_sz_bytes);
      return (0);
    }
    
    ret = writev(fd,(struct iovec*)&vec,blocks);
    if ( ret != write_sz_bytes ) {
      perror("failed to write: ");
      printf("write size=%lld offset=%lld\n",write_sz_bytes,offset_sz_bytes);
      return -1;
    }

    offset_sz_bytes =  write_sz_bytes *(  random() % diskSizeBytes );

    long long rnd = (long long)random();
    offset_sz_bytes =  write_sz_bytes * (long long)( rnd %  diskSizeBytes );

    if(offset_sz_bytes>diskSizeBytes){
            offset_sz_bytes =  (offset_sz_bytes - diskSizeBytes ) % diskSizeBytes;
            offset_sz_bytes =  (offset_sz_bytes/write_sz_bytes)*write_sz_bytes;
    }

    printf("writing %d bytes at offset %lld\n",ret,offset_sz_bytes);
  }
  
  return(0);
}



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
       [not found]           ` <5d96567b0704161329n5c3ca008p56df00baaa16eacb@mail.gmail.com>
@ 2007-04-19  8:28             ` Raz Ben-Jehuda(caro)
  2007-04-19  9:20               ` Neil Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Raz Ben-Jehuda(caro) @ 2007-04-19  8:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: Linux RAID Mailing List

On 4/16/07, Raz Ben-Jehuda(caro) <raziebe@gmail.com> wrote:
> On 4/13/07, Neil Brown <neilb@suse.de> wrote:
> > On Saturday March 31, raziebe@gmail.com wrote:
> > >
> > > 4.
> > > I am going to work on this with other configurations, such as raid5's
> > > with more disks and raid50.  I will be happy to hear your opinion on
> > > this matter. what puzzles me is why deadline must be so long as 10 ms?
> > >  the less deadline the more reads I am getting.
> >
> > I've finally had a bit of a look at this.
> >
> > The extra reads are being caused because for the 3msec unplug
> > timeout. Once you plug a queue it will automatically get unplugged 3
> > msec later.  When this happens, any stripes that are on the pending
> > list (waiting to see if more blocks will be written to them) get
> > processed and some pre-reading happens.
> >
> > If you remove the 3msec timeout (I changed it to 300msec) in
> > block/ll_rw_blk.c, the reads go away.  However that isn't a good
> > solution.
> >
> > Your patch effectively ensures that a stripe gets to last at least N
> > msec before being unplugged and pre-reading starts.
> > Why does it need to be 10 msec?  Let's see.
> >
> > When you start writing, you will quickly fill up the stripe cache and
> > then have to wait for stripes to be fully written and become free
> > before you can start attaching more write requests.
> > You could have to wait for a full chunk-wide stripe to be written
> > before another chunk of stripes can proceed.  The first blocks of the
> > second stripe could stay in the stripe cache for the time it takes to
> > write out a stripe.
> >
> > With a 1024K chunk size and 30Meg/second write speed it will take 1/30
> > of a second to write out a chunk-wide stripe, or about 33msec.  So I'm
> > surprised you get by with a deadline of 'only' 10msec.  Maybe there is
> > some over-lapping of chunks that I wasn't taking into account (I did
> > oversimplify the model a bit).
> >
> > So, what is the right heuristic to use to determine when we should
> > start write-processing on an incomplete stripe?  Obviously '3msec' is
> > bad.
> >
> > It seems we don't want to start processing incomplete stripes while
> > there are full stripes being written, but we also don't want to hold
> > up incomplete stripes forever if some other thread is successfully
> > writing complete stripes.
> >
> > So maybe something like this:
> >  - We keep a (cyclic) counter of the number of stripes on which we
> >    have started write, and the number which have completed.
> >  - every time we add a write request to a stripe, we set the deadline
> >    to 3msec in the future, and we record in the stripe the current
> >    value of the number that have started write.
> >  - We process a stripe requiring preread when both the deadline
> >    has expired, and the count of completed writes reaches the recorded
> >    count of commenced writes.
> >
> > Does that make sense?  Would you like to try it?
> >
> > NeilBrown
> >

Neil Hello
I have been doing some thinking. I feel we should take a different path here.
In my tests  I actually accumulate the user's buffers and when ready I submit
them, an elevator like algorithm.

The main problem is the amount of IO's the stripe cache can hold which is
too small. My suggestion is to add an elevator of bios before moving them to the
stripe cache, trying to postpone as much as needed allocation of a new stripe.
This way we will be able to move as much as IOs to the "raid logic"
without congesting
it and still filling stripes if possible.

Psuedo code;

 make_request()
...
   if IO direction is WRITE and IO not in stripe cache
     add IO to raid elevator
..

raid5d()
  ...
  Is there a set of IOs in raid elevator such that they make a full stripe
    move IOs to raid handling
  while oldest IO in raid elevator is deadlined( 3ms ? )
      move IO to raid handling
 ....

Does it make any sense ?

thank you
-- 
Raz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: raid5 write performance
  2007-04-19  8:28             ` Raz Ben-Jehuda(caro)
@ 2007-04-19  9:20               ` Neil Brown
  0 siblings, 0 replies; 23+ messages in thread
From: Neil Brown @ 2007-04-19  9:20 UTC (permalink / raw)
  To: Raz Ben-Jehuda(caro); +Cc: Linux RAID Mailing List

On Thursday April 19, raziebe@gmail.com wrote:
> 
> Neil Hello
> I have been doing some thinking. I feel we should take a different path here.
> In my tests  I actually accumulate the user's buffers and when ready I submit
> them, an elevator like algorithm.
> 
> The main problem is the amount of IO's the stripe cache can hold which is
> too small. My suggestion is to add an elevator of bios before moving them to the
> stripe cache, trying to postpone as much as needed allocation of a new stripe.
> This way we will be able to move as much as IOs to the "raid logic"
> without congesting
> it and still filling stripes if possible.
> 
> Psuedo code;
> 
>  make_request()
> ...
>    if IO direction is WRITE and IO not in stripe cache
>      add IO to raid elevator
> ..
> 
> raid5d()
>   ...
>   Is there a set of IOs in raid elevator such that they make a full stripe
>     move IOs to raid handling
>   while oldest IO in raid elevator is deadlined( 3ms ? )
>       move IO to raid handling
>  ....
> 
> Does it make any sense ?

Yes.
The "Is there a set of IOs in raid elevator such that they make a full
stripe"  would be hard to calculate.

However the concept is still fine.
In make request, if we cannot get a stripe_head without blocking, just
add the request to a list.
Once the number of active stripes drops below 75% (or 50% or
whatever), raid5 reprocessed all the bios on the list, some will get
added, some might not until next time.

The fact that a single bio can require many stripe_heads adds an
awkwardness.  You would have to be able to store a partially-processed
request on the list, but we do that in retry_aligned_read, so we know
it is possible.  Possibly the same code can be used for
retry_aligned_read and for retry_delayed_write.

And we can treat writes and reads the same - if no stripe_head is
available, stick it on a queue.

Another issue to be aware of is that write-throttling in the VM
depends on the fact that each device has a limited queue.  Just
sticking everything on a list defeats thats.  So we do need to impose
some limit on the number of request in the queue.  Possibly we limit
the requests on the queue to some multiple of a full stripe.

NeilBrown

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-04-19  9:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-02 14:02 raid5 write performance Raz Ben-Jehuda(caro)
2006-07-02 22:35 ` Neil Brown
2006-08-13 13:19   ` Raz Ben-Jehuda(caro)
2006-08-28  4:32     ` Neil Brown
2007-03-30 21:44       ` Raz Ben-Jehuda(caro)
2007-03-31 21:28         ` Bill Davidsen
2007-03-31 23:03           ` Raz Ben-Jehuda(caro)
2007-04-01  2:16             ` Bill Davidsen
2007-04-01 23:08         ` Dan Williams
2007-04-02 14:13           ` Raz Ben-Jehuda(caro)
     [not found]         ` <17950.50209.580439.607958@notabene.brown>
     [not found]           ` <5d96567b0704161329n5c3ca008p56df00baaa16eacb@mail.gmail.com>
2007-04-19  8:28             ` Raz Ben-Jehuda(caro)
2007-04-19  9:20               ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2005-11-18 14:05 Jure Pečar
2005-11-18 19:19 ` Dan Stromberg
2005-11-18 19:23   ` Mike Hardy
2005-11-19  4:40     ` Guy
2005-11-19  4:57       ` Mike Hardy
2005-11-19  5:54         ` Neil Brown
2005-11-19 11:59           ` Farkas Levente
2005-11-20 23:39             ` Neil Brown
2005-11-19 19:52           ` Carlos Carvalho
2005-11-20 19:54             ` Paul Clements
2005-11-19  5:56         ` Guy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).