[Lustre-devel] SeaStar message priority

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nicholas Henke <nic@cray.com>
To: lustre-devel@lists.lustre.org
Subject: [Lustre-devel] SeaStar message priority
Date: Wed, 01 Apr 2009 14:15:53 -0500	[thread overview]
Message-ID: <49D3BD69.1070900@cray.com> (raw)
In-Reply-To: <F939F52D-614F-4CA8-A80F-50CE31E3303A@Sun.COM>

Oleg Drokin wrote:
> Hello!
> 
> On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
>> I think my point was that there is already a priority scheme in the
>> Seastar. Are there additional bits related to priority that you might
>> use, also?
> 
> But if we cannot use it, there is none.
> Like we want mpi rpcs go out first to some degree.

If we have to deal with ordering - we are already sunk. The Lustre RPCs will go 
out and affect MPI latency to some degree, introducing jitter into the calls and 
affecting application performance.

> 
> But since the only thing I have in my app inside barriers is write call,
> there is no much way to desynchronize.

Incorrect - you are running your app on all 4 CPUs on the node at the same time 
Lustre is sending RPCs. The kernel threads will get scheduled and run, pushing 
your app to the side and desynchronizing the barrier for the app as a whole.

> No, I do not think they would like the idea to forfeit 1/4 of their
> CPU just so io is better.
> If the jitter is due to cpu occupied with io, and apps stalled due to  
> this
> (though I have hard time believing an app to be not given a cpu for  
> 4.5 seconds,
> even though there are potentially 4 idle cpus, or even 3 (remember  
> other cores are
> also idle waiting on a barrier).

This gets easier to swallow in the future with 12core and larger nodes - 1/12 is 
much easier to sacrifice.

What we really need to "prove" is where the delay is occurring. The MPI_Barrier 
messages are 0-byte sends, effectively turning them into Portals headers and 
these are sent and processed very fast. In fact, the total amount of data being 
sent is _much_ less than the NIC is capable of. A rough estimate for 2 nodes 
talking to each other is 1700 MB/s and 50K lnet pings/s.

One thing to try is changing your aprun to use fewer CPUs per node:
aprun -n 1200 -N [1,2,3] -cc 1-3.

The -cc 1-3 will keep it off cpu 0 - a known location for some IRQs and other 
servicing.

You should also try to capture compute-node stats like cpu usage, # of threads 
active during barrier, etc to help narrow down where the time is going.

Nic

next prev parent reply	other threads:[~2009-04-01 19:15 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-01  4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01  5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke
2009-04-01 15:02   ` Oleg Drokin
2009-04-01 14:26 ` Lee Ward
2009-04-01 15:14   ` Oleg Drokin
2009-04-01 15:58     ` Lee Ward
2009-04-01 16:20       ` Eric Barton
2009-04-01 16:35       ` Oleg Drokin
2009-04-01 19:13         ` Lee Ward
2009-04-01 20:17           ` Oleg Drokin
2009-04-02  2:46             ` Oleg Drokin
2009-04-02  4:28               ` Lee Ward
2009-04-01 19:15         ` Nicholas Henke [this message]
2009-04-01 19:26           ` Oleg Drokin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49D3BD69.1070900@cray.com \
    --to=nic@cray.com \
    --cc=lustre-devel@lists.lustre.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.