All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Ward <lee@sandia.gov>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] SeaStar message priority
Date: Wed, 1 Apr 2009 08:26:45 -0600	[thread overview]
Message-ID: <1238596005.5091.31.camel@wheel> (raw)
In-Reply-To: <74EA92D6-25E8-42DA-A4FD-BDDCED233244@Sun.COM>

On Tue, 2009-03-31 at 22:43 -0600, Oleg Drokin wrote:
> Hello!
> 
>    It came to my attention that seastar network does not implement  
> message priorities for various reasons.

That is incorrect. The seastar network does implement at least one
priority scheme based on age. It's not something an application can play
with if I remember right.

>    I really think there is very valid case for the priorities of some  
> sort to allow MPI and other
>    latency-critical traffic to go in front of bulk IO traffic on the  
> wire.

That would be very difficult to implement without making starvation
scenarios trivial.

>    Consider this test I was running the other day on Jaguar. The  
> application writes 250M of data from every
>    core with plain write() system call, the write() syscall returns  
> very fast (less than 0.5 sec == 400+Mb/sec
>    app-perceived bandwidth) because the data just goes to the memory  
> cache to be flushed later.
>    Then I do 2 barriers one by one with nothing in between.
>    If I run it at sufficient scale (say 1200 cores), the first barrier  
> takes 4.5 seconds to complete and
>    the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
> huge bulk data requests on the clients,
>    presumably (I do not have any other good explanations at least).
>    This makes for a lot of wasted time in applications that would like  
> to use the buffering capabilities provided
>    by the OS.

I strongly suspect OS jitter, probably related to FS activity, is a much
more likely explanation for the above. If just one node has the
process/rank suspended then it can't service the barrier; All will wait
until it can.

Jitter gets a bad rap. Usually for good reason. However, in this case,
it doesn't seem something to worry overly much about as it will cease.
Your test says the 1st barrier after the write completes in 4.5 sec and
the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
rapidly. Jitter is really only bad when it is chronic.

To me, you are worrying way too much about the situation immediately
after a write. Checkpoints are relatively rare, with long periods
between. Why worry about something that's only going to affect a very
small portion of the overall job? As long as the jitter dissipates in a
short time, things will work out fine.

Maybe you could convince yourself of the efficacy of write-back caching
in this scenario by altering the  app to do an fsync() after the write
phase on the node but before the barrier? If the app can get back to
computing, even with the jitter-disrupted barrier, faster than it could
by waiting for the outstanding dirty buffers to be flushed then it's a
net win to just live with the jitter, no?

		--Lee

> 
>    Do you think something like this could be organized if not for  
> current revision then at least for the next
>    version?
> 
> Bye,
>      Oleg
>   
>   
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
> 

  parent reply	other threads:[~2009-04-01 14:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-01  4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01  5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke
2009-04-01 15:02   ` Oleg Drokin
2009-04-01 14:26 ` Lee Ward [this message]
2009-04-01 15:14   ` Oleg Drokin
2009-04-01 15:58     ` Lee Ward
2009-04-01 16:20       ` Eric Barton
2009-04-01 16:35       ` Oleg Drokin
2009-04-01 19:13         ` Lee Ward
2009-04-01 20:17           ` Oleg Drokin
2009-04-02  2:46             ` Oleg Drokin
2009-04-02  4:28               ` Lee Ward
2009-04-01 19:15         ` Nicholas Henke
2009-04-01 19:26           ` Oleg Drokin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1238596005.5091.31.camel@wheel \
    --to=lee@sandia.gov \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.