experiences with raid5: stripe

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* experiences with raid5: stripe_queue patches
@ 2007-10-15 15:03 Bernd Schubert
  2007-10-15 16:40 ` Justin Piszcz
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Bernd Schubert @ 2007-10-15 15:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, neilb

Hi,

in order to tune raid performance I did some benchmarks with and without the 
stripe queue patches. 2.6.22 is only for comparison to rule out other 
effects, e.g. the new scheduler, etc.
It seems there is a regression with these patch regarding the re-write 
performance, as you can see its almost 50% of what it should be.

write      re-write   read       re-read
480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)

Benchmark details:

3xraid5 over 4 partitions of the very same hardware raid (in the end thats 
raid65, raid6 in hardware and raid5 in software, we need to do that).

chunk size: 8192
stripe_cache_size: 8192 each
readahead of the md*: 65535 (well actually it limits itself to 65528
readahead of the underlying partitions: 16384

filesystem: xfs

Testsystem: 2 x Quadcore Xeon 1.86 GHz (E5320)

An interesting effect to notice: Without these patches the pdflush daemons 
will take a lot of CPU time, with these patches, pdflush almost doesn't 
appear in the 'top' list.

Actually we would prefer one single raid5 array, but then one single raid5 
thread will run with 100% CPU time leaving 7 CPUs idle state, the status of 
the hardware raid says its utilization is only at about 50% and we only see 
writes at about 200 MB/s.
On the contrary, with 3 different software raid5 sets the i/o to the harware 
raid systems is the bottleneck.

Is there any chance to parallize the raid5 code? I think almost everything is 
done in raid5.c make_request(), but the main loop there is spin_locked by 
prepare_to_wait(). Would it be possible not to lock this entire loop?

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: experiences with raid5: stripe_queue patches
  2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
@ 2007-10-15 16:40 ` Justin Piszcz
  2007-10-16  2:01 ` Neil Brown
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Justin Piszcz @ 2007-10-15 16:40 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Dan Williams, linux-raid, neilb



On Mon, 15 Oct 2007, Bernd Schubert wrote:

> Hi,
>
> in order to tune raid performance I did some benchmarks with and without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.
> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
>
> write      re-write   read       re-read
> 480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
> 487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
> 469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)
>
> Benchmark details:
>
> 3xraid5 over 4 partitions of the very same hardware raid (in the end thats
> raid65, raid6 in hardware and raid5 in software, we need to do that).
>
> chunk size: 8192
> stripe_cache_size: 8192 each
> readahead of the md*: 65535 (well actually it limits itself to 65528
> readahead of the underlying partitions: 16384
>
> filesystem: xfs
>
> Testsystem: 2 x Quadcore Xeon 1.86 GHz (E5320)
>
> An interesting effect to notice: Without these patches the pdflush daemons
> will take a lot of CPU time, with these patches, pdflush almost doesn't
> appear in the 'top' list.
>
> Actually we would prefer one single raid5 array, but then one single raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the status of
> the hardware raid says its utilization is only at about 50% and we only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the harware
> raid systems is the bottleneck.
>
> Is there any chance to parallize the raid5 code? I think almost everything is
> done in raid5.c make_request(), but the main loop there is spin_locked by
> prepare_to_wait(). Would it be possible not to lock this entire loop?
>
>
> Thanks,
> Bernd
>
> -- 
> Bernd Schubert
> Q-Leap Networks GmbH
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Excellent questions I look forward to reading this thread :)

Justin.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: experiences with raid5: stripe_queue patches
  2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
  2007-10-15 16:40 ` Justin Piszcz
@ 2007-10-16  2:01 ` Neil Brown
       [not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
  2007-10-16 17:31 ` Dan Williams
  3 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2007-10-16  2:01 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Dan Williams, linux-raid

On Monday October 15, bs@q-leap.de wrote:
> Hi,
> 
> in order to tune raid performance I did some benchmarks with and without the 
> stripe queue patches. 2.6.22 is only for comparison to rule out other 
> effects, e.g. the new scheduler, etc.

Thanks!

> It seems there is a regression with these patch regarding the re-write 
> performance, as you can see its almost 50% of what it should be.
> 
> write      re-write   read       re-read
> 480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
> 487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
> 469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)

I wonder if it is a fairness issue.  One concern I have about that new
code it that it seems to allow full stripes to bypass incomplete
stripes in the queue indefinitely.  So an incomplete stripe might be
delayed a very long time. 

I've had a bit of time to think about these patches and experiment a
bit.
I think we should think about the stripe queue in four parts;
  A/  those that have scheduled some write requests
  B/  those that have scheduled some pre-read requests
  C/  those that can start writing without any preread
  D/  those that need some preread before we write.

Original code  lets C flow directly to A, and D move into B in
bursts.  i.e. once B becomes empty, all of D moves to B.

The new code further restricts D to only move to B when the total size
of A+B is below some limit.

I think that including the size of A is good as it gives stripes on D
more chance to move to C by getting more blocks attached.  However it
is bad because it makes it easier for stripes on C to over take
stripes on D.

I made a tiny change to raid5_activate_delayed so that the while loop
aborts if "atomic_read(&conf->active_stripes) < 32"
This (in a very coarse way) limits D moving to B when A+B is more than
a certain size, and it had a similar effect to the SQ patches on a
simple sequential write test.  But it still allowed some pre-read
requests (that shouldn't be needed) to slip through.

I think we should:
  Keep a precise count of the size of A
  Only allow the D->B transition when A < one-full-stripe
  Limit the extent to which C can leap frog D.
    I'm not sure how best to do this yet.  Something simple but fair
    is needed.

> 
> An interesting effect to notice: Without these patches the pdflush daemons 
> will take a lot of CPU time, with these patches, pdflush almost doesn't 
> appear in the 'top' list.

Maybe the patches move processing time from make_request into raid5d,
thus moving it from pdflush to raid5d.  Does raid5d appear higher in
the list....

> 
> Actually we would prefer one single raid5 array, but then one single raid5 
> thread will run with 100% CPU time leaving 7 CPUs idle state, the status of 
> the hardware raid says its utilization is only at about 50% and we only see 
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the harware 
> raid systems is the bottleneck.
> 
> Is there any chance to parallize the raid5 code? I think almost everything is 
> done in raid5.c make_request(), but the main loop there is spin_locked by 
> prepare_to_wait(). Would it be possible not to lock this entire loop?

I think you want multiple raid5d threads - that is where most of the
work is done.  That is just a case of creating them and keeping track
of them so they can be destroyed when appropriate, and - possibly the
trickiest bit - waking them up at the right time, so they share the
load without wasteful wakeups.

NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

[parent not found: <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>]

* RE: experiences with raid5: stripe_queue patches
       [not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
@ 2007-10-16  2:04   ` Neil Brown
  0 siblings, 0 replies; 6+ messages in thread
From: Neil Brown @ 2007-10-16  2:04 UTC (permalink / raw)
  To: 彭席汉; +Cc: Bernd Schubert, linux-raid, dan.j.williams

On Tuesday October 16, pengxihan@hotmail.com wrote:
> In my opinion, one
> of ways to increase raid5 performance may be to change STRIPE_SIZE to
> to a bigger value. It is now equal to PAGE_SIZE. But we can't just
> change it simply, some other codes must be modified. I havn't tried,:).
> Maybe Neil can give us some advices!

I really don't think increase STRIPE_SIZE would have much effect.  You
might save a bit of CPU (though it might cost you too) but it should
change the order of requests, so it wouldn't affect disk io much.

I really think this is a scheduling problem.  We need to make sure to
send the requests in the right order without delaying things
needlessly, and avoid reads when ever possible.

NeilBrown

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: experiences with raid5: stripe_queue patches
  2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
                   ` (2 preceding siblings ...)
       [not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
@ 2007-10-16 17:31 ` Dan Williams
  2007-10-17 16:59   ` Bernd Schubert
  3 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2007-10-16 17:31 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: linux-raid, neilb

On Mon, 2007-10-15 at 08:03 -0700, Bernd Schubert wrote:
> Hi,
> 
> in order to tune raid performance I did some benchmarks with and
> without the
> stripe queue patches. 2.6.22 is only for comparison to rule out other
> effects, e.g. the new scheduler, etc.

Thanks for testing!

> It seems there is a regression with these patch regarding the re-write
> performance, as you can see its almost 50% of what it should be.
> 
> write      re-write   read       re-read
> 480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
> 487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
> 469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)

A quick way to verify that it is a fairness issue is to simply not
promote full stripe writes to their own list, debug patch follows:

---

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index eb7fd10..755aafb 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -162,7 +162,7 @@ static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq)
 
 		if (to_write &&
 		    io_weight(sq->overwrite, disks) == data_disks) {
-			list_add_tail(&sq->list_node, &conf->io_hi_q_list);
+			list_add_tail(&sq->list_node, &conf->io_lo_q_list);
 			queue_work(conf->workqueue, &conf->stripe_queue_work);
 		} else if (io_weight(sq->to_read, disks)) {
 			list_add_tail(&sq->list_node, &conf->io_lo_q_list);


---


<snip>

> 
> An interesting effect to notice: Without these patches the pdflush
> daemons
> will take a lot of CPU time, with these patches, pdflush almost
> doesn't
> appear in the 'top' list.
> 
> Actually we would prefer one single raid5 array, but then one single
> raid5
> thread will run with 100% CPU time leaving 7 CPUs idle state, the
> status of
> the hardware raid says its utilization is only at about 50% and we
> only see
> writes at about 200 MB/s.
> On the contrary, with 3 different software raid5 sets the i/o to the
> harware
> raid systems is the bottleneck.
> 
> Is there any chance to parallize the raid5 code? I think almost
> everything is
> done in raid5.c make_request(), but the main loop there is spin_locked
> by
> prepare_to_wait(). Would it be possible not to lock this entire loop?

I made a rough attempt at multi-threading raid5[1] a while back.
However, this configuration only helps affinity, it does not address the
cases where the load needs to be further rebalanced between cpus.
> 
> 
> Thanks,
> Bernd
> 

[1] http://marc.info/?l=linux-raid&m=117262977831208&w=2
Note this implementation incorrectly handles the raid6 spare_page, we
would need a spare_page per cpu.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: experiences with raid5: stripe_queue patches
  2007-10-16 17:31 ` Dan Williams
@ 2007-10-17 16:59   ` Bernd Schubert
  0 siblings, 0 replies; 6+ messages in thread
From: Bernd Schubert @ 2007-10-17 16:59 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, neilb

Hello Dan, hello Neil,

thanks for your help!

On Tuesday 16 October 2007 19:31:08 Dan Williams wrote:
> On Mon, 2007-10-15 at 08:03 -0700, Bernd Schubert wrote:
> > Hi,
> >
> > in order to tune raid performance I did some benchmarks with and
> > without the
> > stripe queue patches. 2.6.22 is only for comparison to rule out other
> > effects, e.g. the new scheduler, etc.
>
> Thanks for testing!
>
> > It seems there is a regression with these patch regarding the re-write
> > performance, as you can see its almost 50% of what it should be.
> >
> > write      re-write   read       re-read
> > 480844.26  448723.48  707927.55  706075.02 (2.6.22 w/o SQ patches)
> > 487069.47  232574.30  709038.28  707595.09 (2.6.23 with SQ patches)
> > 469865.75  438649.88  711211.92  703229.00 (2.6.23 without SQ patches)
>
> A quick way to verify that it is a fairness issue is to simply not
> promote full stripe writes to their own list, debug patch follows:

I tested with that and the rewrite performance is better, but still not 
perfect:

  write      re-write    read      re-read
461794.14   377896.27  701793.81  693018.02


[...]

> I made a rough attempt at multi-threading raid5[1] a while back.
> However, this configuration only helps affinity, it does not address the
> cases where the load needs to be further rebalanced between cpus.
>
> > Thanks,
> > Bernd
>
> [1] http://marc.info/?l=linux-raid&m=117262977831208&w=2
> Note this implementation incorrectly handles the raid6 spare_page, we
> would need a spare_page per cpu.

Ah great, I will test this on Friday.

Thanks,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-10-17 16:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-15 15:03 experiences with raid5: stripe_queue patches Bernd Schubert
2007-10-15 16:40 ` Justin Piszcz
2007-10-16  2:01 ` Neil Brown
     [not found] ` <BAY125-W2D0CD53AC925A85655321A59C0@phx.gbl>
2007-10-16  2:04   ` Neil Brown
2007-10-16 17:31 ` Dan Williams
2007-10-17 16:59   ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).