All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nic Henke <nic@cray.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] SeaStar message priority
Date: Wed, 01 Apr 2009 07:55:38 -0500	[thread overview]
Message-ID: <49D3644A.9090001@cray.com> (raw)
In-Reply-To: <74EA92D6-25E8-42DA-A4FD-BDDCED233244@Sun.COM>

Oleg Drokin wrote:
> Hello!
>
>    It came to my attention that seastar network does not implement  
> message priorities for various reasons.
>    I really think there is very valid case for the priorities of some  
> sort to allow MPI and other
>    latency-critical traffic to go in front of bulk IO traffic on the  
> wire.
>   
In the ptllnd, the bulk traffic is setup via short messages, so if the
barrier is sent right after the write() returns, it really isn't backed
up behind the bulk data.

>    Consider this test I was running the other day on Jaguar. The  
> application writes 250M of data from every
>    core with plain write() system call, the write() syscall returns  
> very fast (less than 0.5 sec == 400+Mb/sec
>    app-perceived bandwidth) because the data just goes to the memory  
> cache to be flushed later.
>    Then I do 2 barriers one by one with nothing in between.
>    If I run it at sufficient scale (say 1200 cores), the first barrier  
> takes 4.5 seconds to complete and
>    the second one 1.5 seconds, all due to MPI RPCs being stuck behind  
> huge bulk data requests on the clients,
>    presumably (I do not have any other good explanations at least).
>    This makes for a lot of wasted time in applications that would like  
> to use the buffering capabilities provided
>    by the OS.
>   
This sounds much more like barrier jitter than backup. The network is
capable of servicing the 250M in < .15s. It would be my guess that some
of the writes() are taking longer than others and this is causing the
barrier to be delayed.

A few questions:
- how many OSS/OSTs are you writing to ?
- can you post the MPI app you are using to do this ?

The application folks @ ORNL should be able to help you use Craypat or
Apprentice to get some runtime data on this app to find where the time
is going. Until we have hard data, I don't think we can blame the network.

Cheers,
Nic

  parent reply	other threads:[~2009-04-01 12:55 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-01  4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01  5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke [this message]
2009-04-01 15:02   ` Oleg Drokin
2009-04-01 14:26 ` Lee Ward
2009-04-01 15:14   ` Oleg Drokin
2009-04-01 15:58     ` Lee Ward
2009-04-01 16:20       ` Eric Barton
2009-04-01 16:35       ` Oleg Drokin
2009-04-01 19:13         ` Lee Ward
2009-04-01 20:17           ` Oleg Drokin
2009-04-02  2:46             ` Oleg Drokin
2009-04-02  4:28               ` Lee Ward
2009-04-01 19:15         ` Nicholas Henke
2009-04-01 19:26           ` Oleg Drokin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49D3644A.9090001@cray.com \
    --to=nic@cray.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.