* [Lustre-devel] SeaStar message priority
@ 2009-04-01 4:43 Oleg Drokin
2009-04-01 5:10 ` Andrew C. Uselton
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 4:43 UTC (permalink / raw)
To: lustre-devel
Hello!
It came to my attention that seastar network does not implement
message priorities for various reasons.
I really think there is very valid case for the priorities of some
sort to allow MPI and other
latency-critical traffic to go in front of bulk IO traffic on the
wire.
Consider this test I was running the other day on Jaguar. The
application writes 250M of data from every
core with plain write() system call, the write() syscall returns
very fast (less than 0.5 sec == 400+Mb/sec
app-perceived bandwidth) because the data just goes to the memory
cache to be flushed later.
Then I do 2 barriers one by one with nothing in between.
If I run it at sufficient scale (say 1200 cores), the first barrier
takes 4.5 seconds to complete and
the second one 1.5 seconds, all due to MPI RPCs being stuck behind
huge bulk data requests on the clients,
presumably (I do not have any other good explanations at least).
This makes for a lot of wasted time in applications that would like
to use the buffering capabilities provided
by the OS.
Do you think something like this could be organized if not for
current revision then at least for the next
version?
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
@ 2009-04-01 5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke
2009-04-01 14:26 ` Lee Ward
2 siblings, 0 replies; 15+ messages in thread
From: Andrew C. Uselton @ 2009-04-01 5:10 UTC (permalink / raw)
To: lustre-devel
I wonder if that scenario may have some bearing on the results I've
mentioned at:
http://www.nersc.gov/~uselton/frank_jag/
It would be interesting to step through the logic if anyone is
interested in doing so. The web page itself is terse, so feel free to
bug me for details if you have not seen this before.
Cheers,
Andrew
Oleg Drokin wrote:
> Hello!
>
> It came to my attention that seastar network does not implement
> message priorities for various reasons.
> I really think there is very valid case for the priorities of some
> sort to allow MPI and other
> latency-critical traffic to go in front of bulk IO traffic on the
> wire.
> Consider this test I was running the other day on Jaguar. The
> application writes 250M of data from every
> core with plain write() system call, the write() syscall returns
> very fast (less than 0.5 sec == 400+Mb/sec
> app-perceived bandwidth) because the data just goes to the memory
> cache to be flushed later.
> Then I do 2 barriers one by one with nothing in between.
> If I run it at sufficient scale (say 1200 cores), the first barrier
> takes 4.5 seconds to complete and
> the second one 1.5 seconds, all due to MPI RPCs being stuck behind
> huge bulk data requests on the clients,
> presumably (I do not have any other good explanations at least).
> This makes for a lot of wasted time in applications that would like
> to use the buffering capabilities provided
> by the OS.
>
> Do you think something like this could be organized if not for
> current revision then at least for the next
> version?
>
> Bye,
> Oleg
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01 5:10 ` Andrew C. Uselton
@ 2009-04-01 12:55 ` Nic Henke
2009-04-01 15:02 ` Oleg Drokin
2009-04-01 14:26 ` Lee Ward
2 siblings, 1 reply; 15+ messages in thread
From: Nic Henke @ 2009-04-01 12:55 UTC (permalink / raw)
To: lustre-devel
Oleg Drokin wrote:
> Hello!
>
> It came to my attention that seastar network does not implement
> message priorities for various reasons.
> I really think there is very valid case for the priorities of some
> sort to allow MPI and other
> latency-critical traffic to go in front of bulk IO traffic on the
> wire.
>
In the ptllnd, the bulk traffic is setup via short messages, so if the
barrier is sent right after the write() returns, it really isn't backed
up behind the bulk data.
> Consider this test I was running the other day on Jaguar. The
> application writes 250M of data from every
> core with plain write() system call, the write() syscall returns
> very fast (less than 0.5 sec == 400+Mb/sec
> app-perceived bandwidth) because the data just goes to the memory
> cache to be flushed later.
> Then I do 2 barriers one by one with nothing in between.
> If I run it at sufficient scale (say 1200 cores), the first barrier
> takes 4.5 seconds to complete and
> the second one 1.5 seconds, all due to MPI RPCs being stuck behind
> huge bulk data requests on the clients,
> presumably (I do not have any other good explanations at least).
> This makes for a lot of wasted time in applications that would like
> to use the buffering capabilities provided
> by the OS.
>
This sounds much more like barrier jitter than backup. The network is
capable of servicing the 250M in < .15s. It would be my guess that some
of the writes() are taking longer than others and this is causing the
barrier to be delayed.
A few questions:
- how many OSS/OSTs are you writing to ?
- can you post the MPI app you are using to do this ?
The application folks @ ORNL should be able to help you use Craypat or
Apprentice to get some runtime data on this app to find where the time
is going. Until we have hard data, I don't think we can blame the network.
Cheers,
Nic
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01 5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke
@ 2009-04-01 14:26 ` Lee Ward
2009-04-01 15:14 ` Oleg Drokin
2 siblings, 1 reply; 15+ messages in thread
From: Lee Ward @ 2009-04-01 14:26 UTC (permalink / raw)
To: lustre-devel
On Tue, 2009-03-31 at 22:43 -0600, Oleg Drokin wrote:
> Hello!
>
> It came to my attention that seastar network does not implement
> message priorities for various reasons.
That is incorrect. The seastar network does implement at least one
priority scheme based on age. It's not something an application can play
with if I remember right.
> I really think there is very valid case for the priorities of some
> sort to allow MPI and other
> latency-critical traffic to go in front of bulk IO traffic on the
> wire.
That would be very difficult to implement without making starvation
scenarios trivial.
> Consider this test I was running the other day on Jaguar. The
> application writes 250M of data from every
> core with plain write() system call, the write() syscall returns
> very fast (less than 0.5 sec == 400+Mb/sec
> app-perceived bandwidth) because the data just goes to the memory
> cache to be flushed later.
> Then I do 2 barriers one by one with nothing in between.
> If I run it at sufficient scale (say 1200 cores), the first barrier
> takes 4.5 seconds to complete and
> the second one 1.5 seconds, all due to MPI RPCs being stuck behind
> huge bulk data requests on the clients,
> presumably (I do not have any other good explanations at least).
> This makes for a lot of wasted time in applications that would like
> to use the buffering capabilities provided
> by the OS.
I strongly suspect OS jitter, probably related to FS activity, is a much
more likely explanation for the above. If just one node has the
process/rank suspended then it can't service the barrier; All will wait
until it can.
Jitter gets a bad rap. Usually for good reason. However, in this case,
it doesn't seem something to worry overly much about as it will cease.
Your test says the 1st barrier after the write completes in 4.5 sec and
the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
rapidly. Jitter is really only bad when it is chronic.
To me, you are worrying way too much about the situation immediately
after a write. Checkpoints are relatively rare, with long periods
between. Why worry about something that's only going to affect a very
small portion of the overall job? As long as the jitter dissipates in a
short time, things will work out fine.
Maybe you could convince yourself of the efficacy of write-back caching
in this scenario by altering the app to do an fsync() after the write
phase on the node but before the barrier? If the app can get back to
computing, even with the jitter-disrupted barrier, faster than it could
by waiting for the outstanding dirty buffers to be flushed then it's a
net win to just live with the jitter, no?
--Lee
>
> Do you think something like this could be organized if not for
> current revision then at least for the next
> version?
>
> Bye,
> Oleg
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 12:55 ` Nic Henke
@ 2009-04-01 15:02 ` Oleg Drokin
0 siblings, 0 replies; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 15:02 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 8:55 AM, Nic Henke wrote:
>> It came to my attention that seastar network does not implement
>> message priorities for various reasons.
>> I really think there is very valid case for the priorities of some
>> sort to allow MPI and other
>> latency-critical traffic to go in front of bulk IO traffic on the
>> wire.
> In the ptllnd, the bulk traffic is setup via short messages, so if the
> barrier is sent right after the write() returns, it really isn't
> backed
> up behind the bulk data.
Yes, it is.
Lustre starts to send RPCs as soon as 1M (+16) pages of data per RPC
become
available for sending.
So by the time write() syscall for 250M returns, I already potentially
have
16 (stripe count) * 4 (core count) * 8 (rpcs in flight) MB in flight
from this particular node (since chances are OSTs already accepted the
transfers if there are free threads).
> This sounds much more like barrier jitter than backup. The network is
> capable of servicing the 250M in < .15s. It would be my guess that
> some
> of the writes() are taking longer than others and this is causing the
> barrier to be delayed.
No.
I time each individual write separately.
I know all writes start at the same time (there is barrier before them),
I know that each write finishes in aprox 0.5 sec as well.
> A few questions:
> - how many OSS/OSTs are you writing to ?
up to 16 * 4 from a single node.
> - can you post the MPI app you are using to do this ?
Sure.
Attached. (with example output)
> The application folks @ ORNL should be able to help you use Craypat or
> Apprentice to get some runtime data on this app to find where the time
> is going. Until we have hard data, I don't think we can blame the
> network.
Interesting idea.
Please notice if I run the code at a scale of 4, barrier is instant.
As I scale up node count, barrier time begins to rise.
In the output you can see I run the code twice in a row.
This is done to make sure the grant is primed in case it was not, to
take
entire amount of data into the cache (otherwise in some runs some
individual writes take significant time to complete invalidating the
test).
Another thing of note is that since I did not want to take any chances,
the working files are precreated externally so that no files
share any ost for a single node, and the app itself just opens the
files,
not creates them.
Bye,
Oleg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed-big.c
Type: application/octet-stream
Size: 3955 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090401/adcae1f6/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed_big.o555745
Type: application/octet-stream
Size: 320418 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090401/adcae1f6/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: writespeed_big.pbs
Type: application/octet-stream
Size: 362 bytes
Desc: not available
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20090401/adcae1f6/attachment-0002.obj>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 14:26 ` Lee Ward
@ 2009-04-01 15:14 ` Oleg Drokin
2009-04-01 15:58 ` Lee Ward
0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 15:14 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:
>> It came to my attention that seastar network does not implement
>> message priorities for various reasons.
> That is incorrect. The seastar network does implement at least one
> priority scheme based on age. It's not something an application can
> play
> with if I remember right.
Well, then it's as good as none for our purposes, I think?
> I strongly suspect OS jitter, probably related to FS activity, is a
> much
> more likely explanation for the above. If just one node has the
> process/rank suspended then it can't service the barrier; All will
> wait
> until it can.
That's of course right and possible too.
Though given how nothing else is running on the nodes, I would think
it is somewhat irrelevant, since there is nothing else to give
resources to.
The Lustre processing of the outgoing queue is pretty fast in itself at
this phase.
Do you think it would be useful if I just run 1 thread per node, there
would be
3 empty cores to adsorb all the jitter there might be then?
> Jitter gets a bad rap. Usually for good reason. However, in this case,
> it doesn't seem something to worry overly much about as it will cease.
> Your test says the 1st barrier after the write completes in 4.5 sec
> and
> the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
> rapidly. Jitter is really only bad when it is chronic.
Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
specific job.
So I thought it would be a good idea to get to the root of it.
We hear many arguments here at the lab that "what good the buffered io
is for
me when my app performance is degraded if I don't do sync. I'll just do
the sync and be over with it". Of course I believe there is still
benefit to not
doing the sync, but that's just me.
> To me, you are worrying way too much about the situation immediately
> after a write. Checkpoints are relatively rare, with long periods
> between. Why worry about something that's only going to affect a very
> small portion of the overall job? As long as the jitter dissipates
> in a
> short time, things will work out fine.
I worry abut it specifically because users tend to do sync after the
write and that
wastes a lot of time. So as a result - I want as much of data to enter
into cache
and then trickle out all by itself and I want users not to see any bad
effects
(or otherwise to show to them that there are still benefits).
> Maybe you could convince yourself of the efficacy of write-back
> caching
> in this scenario by altering the app to do an fsync() after the write
> phase on the node but before the barrier? If the app can get back to
> computing, even with the jitter-disrupted barrier, faster than it
> could
> by waiting for the outstanding dirty buffers to be flushed then it's a
> net win to just live with the jitter, no?
I do not need to convince myself. IT's the app programmers that are
fixated
on "oh, look, my program is slower after the write if I do not do
sync, I must
do sync!"
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 15:14 ` Oleg Drokin
@ 2009-04-01 15:58 ` Lee Ward
2009-04-01 16:20 ` Eric Barton
2009-04-01 16:35 ` Oleg Drokin
0 siblings, 2 replies; 15+ messages in thread
From: Lee Ward @ 2009-04-01 15:58 UTC (permalink / raw)
To: lustre-devel
On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote:
> Hello!
>
> On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:
> >> It came to my attention that seastar network does not implement
> >> message priorities for various reasons.
> > That is incorrect. The seastar network does implement at least one
> > priority scheme based on age. It's not something an application can
> > play
> > with if I remember right.
>
> Well, then it's as good as none for our purposes, I think?
Other than that traffic moves (only very roughly) in a fair manner and
that packets from different nodes can arrive out of order, I guess.
I think my point was that there is already a priority scheme in the
Seastar. Are there additional bits related to priority that you might
use, also?
>
> > I strongly suspect OS jitter, probably related to FS activity, is a
> > much
> > more likely explanation for the above. If just one node has the
> > process/rank suspended then it can't service the barrier; All will
> > wait
> > until it can.
>
> That's of course right and possible too.
> Though given how nothing else is running on the nodes, I would think
> it is somewhat irrelevant, since there is nothing else to give
> resources to.
How and where memory is used on two nodes is different. How, where,
when, scheduling occurs on two nodes is different. Any two nodes, even
running the same app with barrier synchronization, perform things at
different times outside of the barriers; They very quickly desynchronize
in the presence of jitter.
> The Lustre processing of the outgoing queue is pretty fast in itself at
> this phase.
> Do you think it would be useful if I just run 1 thread per node, there
> would be
> 3 empty cores to adsorb all the jitter there might be then?
You will still get jitter. I would hope less, though, so it wouldn't
hurt to try to leave at least one idle core. We've toyed with the idea
of leaving a core idle for IO and other background processing in the
past. The idea was a non-starter with our apps folks though. Maybe the
ORNL folks will feel differently?
>
> > Jitter gets a bad rap. Usually for good reason. However, in this case,
> > it doesn't seem something to worry overly much about as it will cease.
> > Your test says the 1st barrier after the write completes in 4.5 sec
> > and
> > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
> > rapidly. Jitter is really only bad when it is chronic.
>
> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
> specific job.
That 1200 is the number of checkpoints? If so, I agree. If it's the
number of nodes, I do not.
> So I thought it would be a good idea to get to the root of it.
> We hear many arguments here at the lab that "what good the buffered io
> is for
> me when my app performance is degraded if I don't do sync. I'll just do
> the sync and be over with it". Of course I believe there is still
> benefit to not
> doing the sync, but that's just me.
If the time to settle the jitter is on the order of 10 seconds but it
takes 15 seconds to sync, it would be better to live with the jitter,
no? I suggested an experiment to make this comparison. Why argue with
them? just do the experiment and you can know which strategy is better.
>
> > To me, you are worrying way too much about the situation immediately
> > after a write. Checkpoints are relatively rare, with long periods
> > between. Why worry about something that's only going to affect a very
> > small portion of the overall job? As long as the jitter dissipates
> > in a
> > short time, things will work out fine.
>
> I worry abut it specifically because users tend to do sync after the
> write and that
> wastes a lot of time. So as a result - I want as much of data to enter
> into cache
> and then trickle out all by itself and I want users not to see any bad
> effects
> (or otherwise to show to them that there are still benefits).
Users tend to do sync for more reasons than making the IO deterministic.
They should be doing it so that they can have some faith that the last
checkpoint is actually persistent when interrupted.
However, they should do the sync right before they enter the IO phase,
in order to also get the benefits of write-back caching. Not after the
IO phase. In the event of an interrupt, this forces them to throw away
an in-progress checkpoint and the last one before that, to be safe, but
the one before the last should be good.
The apps could also be more reasonable about their checkpoints, I've
noticed. Often, for us anyway, the machine just behaves. If the app
began by assuming the machine was unreliable but as it ran for longer
and longer periods, it could (I argue should) allow the period between
checkpoints to grow. If the idea is to make progress, as I'm told, then
on a well behaved machine far fewer checkpoints are required. Most apps,
though, just use a fixed period and waste a lot of time doing their
checkpoints when the machine is being nice to them.
>
> > Maybe you could convince yourself of the efficacy of write-back
> > caching
> > in this scenario by altering the app to do an fsync() after the write
> > phase on the node but before the barrier? If the app can get back to
> > computing, even with the jitter-disrupted barrier, faster than it
> > could
> > by waiting for the outstanding dirty buffers to be flushed then it's a
> > net win to just live with the jitter, no?
>
> I do not need to convince myself. IT's the app programmers that are
> fixated
> on "oh, look, my program is slower after the write if I do not do
> sync, I must
> do sync!"
Try the experiment. Show them the data. They are, in theory, reasoning
people, right?
In some cases, your app programmers will be unfortunately correct. An
app that uses so much memory that the system cannot buffer the entire
write will incur at least some issues while doing IO; Some of the IO
must move synchronously and that amount will differ from node to node.
This will have the effect of magnifying this post-IO jitter they are so
worried about. It is also why I wrote in the original requirements for
Lustre that if write-back caching is employed there must be a way to
turn it off.
If they aren't sizing their app for the node's physical memory, though,
I would think that the experiment should show that write-back caching is
a win.
--Lee
>
> Bye,
> Oleg
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 15:58 ` Lee Ward
@ 2009-04-01 16:20 ` Eric Barton
2009-04-01 16:35 ` Oleg Drokin
1 sibling, 0 replies; 15+ messages in thread
From: Eric Barton @ 2009-04-01 16:20 UTC (permalink / raw)
To: lustre-devel
Lee,
I completely agree with your comments on measurement. I'd
really, really like to see some.
Cheers,
Eric
> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Lee Ward
> Sent: 01 April 2009 4:58 PM
> To: Oleg Drokin
> Cc: Lustre Development Mailing List
> Subject: Re: [Lustre-devel] SeaStar message priority
>
> On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote:
> > Hello!
> >
> > On Apr 1, 2009, at 10:26 AM, Lee Ward wrote:
> > >> It came to my attention that seastar network does not implement
> > >> message priorities for various reasons.
> > > That is incorrect. The seastar network does implement at least one
> > > priority scheme based on age. It's not something an application can
> > > play
> > > with if I remember right.
> >
> > Well, then it's as good as none for our purposes, I think?
>
> Other than that traffic moves (only very roughly) in a fair manner and
> that packets from different nodes can arrive out of order, I guess.
>
> I think my point was that there is already a priority scheme in the
> Seastar. Are there additional bits related to priority that you might
> use, also?
>
> >
> > > I strongly suspect OS jitter, probably related to FS activity, is a
> > > much
> > > more likely explanation for the above. If just one node has the
> > > process/rank suspended then it can't service the barrier; All will
> > > wait
> > > until it can.
> >
> > That's of course right and possible too.
> > Though given how nothing else is running on the nodes, I would think
> > it is somewhat irrelevant, since there is nothing else to give
> > resources to.
>
> How and where memory is used on two nodes is different. How, where,
> when, scheduling occurs on two nodes is different. Any two nodes, even
> running the same app with barrier synchronization, perform things at
> different times outside of the barriers; They very quickly desynchronize
> in the presence of jitter.
>
> > The Lustre processing of the outgoing queue is pretty fast in itself at
> > this phase.
> > Do you think it would be useful if I just run 1 thread per node, there
> > would be
> > 3 empty cores to adsorb all the jitter there might be then?
>
> You will still get jitter. I would hope less, though, so it wouldn't
> hurt to try to leave at least one idle core. We've toyed with the idea
> of leaving a core idle for IO and other background processing in the
> past. The idea was a non-starter with our apps folks though. Maybe the
> ORNL folks will feel differently?
>
> >
> > > Jitter gets a bad rap. Usually for good reason. However, in this case,
> > > it doesn't seem something to worry overly much about as it will cease.
> > > Your test says the 1st barrier after the write completes in 4.5 sec
> > > and
> > > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty
> > > rapidly. Jitter is really only bad when it is chronic.
> >
> > Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
> > specific job.
>
> That 1200 is the number of checkpoints? If so, I agree. If it's the
> number of nodes, I do not.
>
> > So I thought it would be a good idea to get to the root of it.
> > We hear many arguments here at the lab that "what good the buffered io
> > is for
> > me when my app performance is degraded if I don't do sync. I'll just do
> > the sync and be over with it". Of course I believe there is still
> > benefit to not
> > doing the sync, but that's just me.
>
> If the time to settle the jitter is on the order of 10 seconds but it
> takes 15 seconds to sync, it would be better to live with the jitter,
> no? I suggested an experiment to make this comparison. Why argue with
> them? just do the experiment and you can know which strategy is better.
>
> >
> > > To me, you are worrying way too much about the situation immediately
> > > after a write. Checkpoints are relatively rare, with long periods
> > > between. Why worry about something that's only going to affect a very
> > > small portion of the overall job? As long as the jitter dissipates
> > > in a
> > > short time, things will work out fine.
> >
> > I worry abut it specifically because users tend to do sync after the
> > write and that
> > wastes a lot of time. So as a result - I want as much of data to enter
> > into cache
> > and then trickle out all by itself and I want users not to see any bad
> > effects
> > (or otherwise to show to them that there are still benefits).
>
> Users tend to do sync for more reasons than making the IO deterministic.
> They should be doing it so that they can have some faith that the last
> checkpoint is actually persistent when interrupted.
>
> However, they should do the sync right before they enter the IO phase,
> in order to also get the benefits of write-back caching. Not after the
> IO phase. In the event of an interrupt, this forces them to throw away
> an in-progress checkpoint and the last one before that, to be safe, but
> the one before the last should be good.
>
> The apps could also be more reasonable about their checkpoints, I've
> noticed. Often, for us anyway, the machine just behaves. If the app
> began by assuming the machine was unreliable but as it ran for longer
> and longer periods, it could (I argue should) allow the period between
> checkpoints to grow. If the idea is to make progress, as I'm told, then
> on a well behaved machine far fewer checkpoints are required. Most apps,
> though, just use a fixed period and waste a lot of time doing their
> checkpoints when the machine is being nice to them.
>
> >
> > > Maybe you could convince yourself of the efficacy of write-back
> > > caching
> > > in this scenario by altering the app to do an fsync() after the write
> > > phase on the node but before the barrier? If the app can get back to
> > > computing, even with the jitter-disrupted barrier, faster than it
> > > could
> > > by waiting for the outstanding dirty buffers to be flushed then it's a
> > > net win to just live with the jitter, no?
> >
> > I do not need to convince myself. IT's the app programmers that are
> > fixated
> > on "oh, look, my program is slower after the write if I do not do
> > sync, I must
> > do sync!"
>
> Try the experiment. Show them the data. They are, in theory, reasoning
> people, right?
>
> In some cases, your app programmers will be unfortunately correct. An
> app that uses so much memory that the system cannot buffer the entire
> write will incur at least some issues while doing IO; Some of the IO
> must move synchronously and that amount will differ from node to node.
> This will have the effect of magnifying this post-IO jitter they are so
> worried about. It is also why I wrote in the original requirements for
> Lustre that if write-back caching is employed there must be a way to
> turn it off.
>
> If they aren't sizing their app for the node's physical memory, though,
> I would think that the experiment should show that write-back caching is
> a win.
>
> --Lee
>
> >
> > Bye,
> > Oleg
> >
>
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 15:58 ` Lee Ward
2009-04-01 16:20 ` Eric Barton
@ 2009-04-01 16:35 ` Oleg Drokin
2009-04-01 19:13 ` Lee Ward
2009-04-01 19:15 ` Nicholas Henke
1 sibling, 2 replies; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 16:35 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
> I think my point was that there is already a priority scheme in the
> Seastar. Are there additional bits related to priority that you might
> use, also?
But if we cannot use it, there is none.
Like we want mpi rpcs go out first to some degree.
>>> I strongly suspect OS jitter, probably related to FS activity, is a
>>> much
>>> more likely explanation for the above. If just one node has the
>>> process/rank suspended then it can't service the barrier; All will
>>> wait
>>> until it can.
>> That's of course right and possible too.
>> Though given how nothing else is running on the nodes, I would think
>> it is somewhat irrelevant, since there is nothing else to give
>> resources to.
> How and where memory is used on two nodes is different. How, where,
That's irrelevant.
> when, scheduling occurs on two nodes is different. Any two nodes, even
> running the same app with barrier synchronization, perform things at
> different times outside of the barriers; They very quickly
> desynchronize
> in the presence of jitter.
But since the only thing I have in my app inside barriers is write call,
there is no much way to desynchronize.
>> The Lustre processing of the outgoing queue is pretty fast in
>> itself at
>> this phase.
>> Do you think it would be useful if I just run 1 thread per node,
>> there
>> would be
>> 3 empty cores to adsorb all the jitter there might be then?
> You will still get jitter. I would hope less, though, so it wouldn't
> hurt to try to leave at least one idle core. We've toyed with the idea
> of leaving a core idle for IO and other background processing in the
> past. The idea was a non-starter with our apps folks though. Maybe the
> ORNL folks will feel differently?
No, I do not think they would like the idea to forfeit 1/4 of their
CPU just so io is better.
If the jitter is due to cpu occupied with io, and apps stalled due to
this
(though I have hard time believing an app to be not given a cpu for
4.5 seconds,
even though there are potentially 4 idle cpus, or even 3 (remember
other cores are
also idle waiting on a barrier).
>>> Jitter gets a bad rap. Usually for good reason. However, in this
>>> case,
>>> it doesn't seem something to worry overly much about as it will
>>> cease.
>>> Your test says the 1st barrier after the write completes in 4.5 sec
>>> and
>>> the 2nd in 1.5 sec. That seems to imply the jitter is settling
>>> pretty
>>> rapidly. Jitter is really only bad when it is chronic.
>> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
>> specific job.
> That 1200 is the number of checkpoints? If so, I agree. If it's the
> number of nodes, I do not.
1200 is number of cores waiting on a barrier.
Every core spends 4.5 seconds == total wasted single-cpu core time is
1.5 hours.
And the more often this happens the worse.
>> So I thought it would be a good idea to get to the root of it.
>> We hear many arguments here at the lab that "what good the buffered
>> io
>> is for
>> me when my app performance is degraded if I don't do sync. I'll
>> just do
>> the sync and be over with it". Of course I believe there is still
>> benefit to not
>> doing the sync, but that's just me.
> If the time to settle the jitter is on the order of 10 seconds but it
> takes 15 seconds to sync, it would be better to live with the jitter,
> no? I suggested an experiment to make this comparison. Why argue with
> them? just do the experiment and you can know which strategy is
> better.
I know which one is better. I did the experiment. (though I have no
realistic way
to measure when "jitter" settles out).
>>> To me, you are worrying way too much about the situation immediately
>>> after a write. Checkpoints are relatively rare, with long periods
>>> between. Why worry about something that's only going to affect a
>>> very
>>> small portion of the overall job? As long as the jitter dissipates
>>> in a
>>> short time, things will work out fine.
>> I worry abut it specifically because users tend to do sync after the
>> write and that
>> wastes a lot of time. So as a result - I want as much of data to
>> enter
>> into cache
>> and then trickle out all by itself and I want users not to see any
>> bad
>> effects
>> (or otherwise to show to them that there are still benefits).
> Users tend to do sync for more reasons than making the IO
> deterministic.
> They should be doing it so that they can have some faith that the last
> checkpoint is actually persistent when interrupted.
For that they only need to do fsync before their next checkpoint,
to make sure that the previous one completed.
> However, they should do the sync right before they enter the IO phase,
> in order to also get the benefits of write-back caching. Not after the
> IO phase. In the event of an interrupt, this forces them to throw away
> an in-progress checkpoint and the last one before that, to be safe,
> but
> the one before the last should be good.
Right.
Yet they do some microbenchmark and decide it is bad idea.
Besides, reducing jitter, or whatever is the cause for the delays
would still be useful.
> In some cases, your app programmers will be unfortunately correct. An
> app that uses so much memory that the system cannot buffer the entire
> write will incur at least some issues while doing IO; Some of the IO
> must move synchronously and that amount will differ from node to node.
> This will have the effect of magnifying this post-IO jitter they are
> so
> worried about. It is also why I wrote in the original requirements for
Why would it? There still is potentially a benefit for the available
cache size.
> Lustre that if write-back caching is employed there must be a way to
> turn it off.
There is around 3 ways to do that that I am aware of.
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 16:35 ` Oleg Drokin
@ 2009-04-01 19:13 ` Lee Ward
2009-04-01 20:17 ` Oleg Drokin
2009-04-01 19:15 ` Nicholas Henke
1 sibling, 1 reply; 15+ messages in thread
From: Lee Ward @ 2009-04-01 19:13 UTC (permalink / raw)
To: lustre-devel
On Wed, 2009-04-01 at 10:35 -0600, Oleg Drokin wrote:
> Hello!
>
> On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
> > I think my point was that there is already a priority scheme in the
> > Seastar. Are there additional bits related to priority that you might
> > use, also?
>
> But if we cannot use it, there is none.
> Like we want mpi rpcs go out first to some degree.
If you don't want to follow up, I'm ok with that. It's up to you.
I understood what you want. There are at least two things I can imagine
that would better the situation without trying to leverage something in
the network, itself.
1) Partition the adapter CAM so that there is always room to accommodate
a user-space receive.
2) Prioritize injection to favor sends originating from user-space.
One or both of these might already be implemented. I don't know.
>
> >>> I strongly suspect OS jitter, probably related to FS activity, is a
> >>> much
> >>> more likely explanation for the above. If just one node has the
> >>> process/rank suspended then it can't service the barrier; All will
> >>> wait
> >>> until it can.
> >> That's of course right and possible too.
> >> Though given how nothing else is running on the nodes, I would think
> >> it is somewhat irrelevant, since there is nothing else to give
> >> resources to.
> > How and where memory is used on two nodes is different. How, where,
>
> That's irrelevant.
>
> > when, scheduling occurs on two nodes is different. Any two nodes, even
> > running the same app with barrier synchronization, perform things at
> > different times outside of the barriers; They very quickly
> > desynchronize
> > in the presence of jitter.
>
> But since the only thing I have in my app inside barriers is write call,
> there is no much way to desynchronize.
Modify your test to report the length of time each node spent in the
barrier (not just rank 0, as it is written now) immediately after the
write call, then? If you are correct, they will all be roughly the same.
If they have desynchronized, most will have very long wait times but at
least one will be relatively short.
>
> >> The Lustre processing of the outgoing queue is pretty fast in
> >> itself at
> >> this phase.
> >> Do you think it would be useful if I just run 1 thread per node,
> >> there
> >> would be
> >> 3 empty cores to adsorb all the jitter there might be then?
> > You will still get jitter. I would hope less, though, so it wouldn't
> > hurt to try to leave at least one idle core. We've toyed with the idea
> > of leaving a core idle for IO and other background processing in the
> > past. The idea was a non-starter with our apps folks though. Maybe the
> > ORNL folks will feel differently?
>
> No, I do not think they would like the idea to forfeit 1/4 of their
> CPU just so io is better.
> If the jitter is due to cpu occupied with io, and apps stalled due to
> this
> (though I have hard time believing an app to be not given a cpu for
> 4.5 seconds,
> even though there are potentially 4 idle cpus, or even 3 (remember
> other cores are
> also idle waiting on a barrier).
Oh, I'm sure they're getting the CPU. They just won't come out of the
barrier until all have processed the operation. The rates at which the
nodes reach the barrier will be different. The rates at which they
proceed through will be different. The only invariant after a barrier is
that all the involved ranks *have* reached that point. Nothing about
when that happened is stated or implied.
>
> >>> Jitter gets a bad rap. Usually for good reason. However, in this
> >>> case,
> >>> it doesn't seem something to worry overly much about as it will
> >>> cease.
> >>> Your test says the 1st barrier after the write completes in 4.5 sec
> >>> and
> >>> the 2nd in 1.5 sec. That seems to imply the jitter is settling
> >>> pretty
> >>> rapidly. Jitter is really only bad when it is chronic.
> >> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
> >> specific job.
> > That 1200 is the number of checkpoints? If so, I agree. If it's the
> > number of nodes, I do not.
>
> 1200 is number of cores waiting on a barrier.
> Every core spends 4.5 seconds == total wasted single-cpu core time is
> 1.5 hours.
It doesn't work that way. The barrier operation is implemented as a
collective on the Cray. What you are missing in the math above is that
every core waited during the *same* 4.5 second period. Total wasted time
is only 4.5 seconds then.
> And the more often this happens the worse.
>
> >> So I thought it would be a good idea to get to the root of it.
> >> We hear many arguments here at the lab that "what good the buffered
> >> io
> >> is for
> >> me when my app performance is degraded if I don't do sync. I'll
> >> just do
> >> the sync and be over with it". Of course I believe there is still
> >> benefit to not
> >> doing the sync, but that's just me.
> > If the time to settle the jitter is on the order of 10 seconds but it
> > takes 15 seconds to sync, it would be better to live with the jitter,
> > no? I suggested an experiment to make this comparison. Why argue with
> > them? just do the experiment and you can know which strategy is
> > better.
>
> I know which one is better. I did the experiment. (though I have no
> realistic way
> to measure when "jitter" settles out).
Which was better then? By how much? Were you just measuring a barrier or
do those numbers still work out when the app uses the network heavily
after doing it's writes?
>
> >>> To me, you are worrying way too much about the situation immediately
> >>> after a write. Checkpoints are relatively rare, with long periods
> >>> between. Why worry about something that's only going to affect a
> >>> very
> >>> small portion of the overall job? As long as the jitter dissipates
> >>> in a
> >>> short time, things will work out fine.
> >> I worry abut it specifically because users tend to do sync after the
> >> write and that
> >> wastes a lot of time. So as a result - I want as much of data to
> >> enter
> >> into cache
> >> and then trickle out all by itself and I want users not to see any
> >> bad
> >> effects
> >> (or otherwise to show to them that there are still benefits).
> > Users tend to do sync for more reasons than making the IO
> > deterministic.
> > They should be doing it so that they can have some faith that the last
> > checkpoint is actually persistent when interrupted.
>
> For that they only need to do fsync before their next checkpoint,
> to make sure that the previous one completed.
>
> > However, they should do the sync right before they enter the IO phase,
> > in order to also get the benefits of write-back caching. Not after the
> > IO phase. In the event of an interrupt, this forces them to throw away
> > an in-progress checkpoint and the last one before that, to be safe,
> > but
> > the one before the last should be good.
>
> Right.
> Yet they do some microbenchmark and decide it is bad idea.
> Besides, reducing jitter, or whatever is the cause for the delays
> would still be useful.
You're making a wonderful argument for Catamount :)
>
> > In some cases, your app programmers will be unfortunately correct. An
> > app that uses so much memory that the system cannot buffer the entire
> > write will incur at least some issues while doing IO; Some of the IO
> > must move synchronously and that amount will differ from node to node.
> > This will have the effect of magnifying this post-IO jitter they are
> > so
> > worried about. It is also why I wrote in the original requirements for
>
> Why would it? There still is potentially a benefit for the available
> cache size.
In a fitted application, there is no useful amount of memory left over
for the cache. Using it, then, is just unnecessary overhead.
As I said, there's a very real possibility your app programmers are
correct. It goes beyond memory. Any resource under intense pressure due
to contention offers the possibility that it can take longer to perform
it's requests independently than to serialize them. For instance, if an
app does not use all of memory then there is plenty of room for Lustre
to cache. Since these apps presumably are going to communicate after the
IO phase (why else the barrier after the IO?) then they will contend
heavily with the Lustre client for the network interface and that
interface does not deal well with such a situation on the Cray. I can
easily believe it would take longer for the app to get back to computing
because of the asynchronous network traffic from the write-back than it
would to just force the IO phase to complete, via fsync, and, after, do
what it needs to do to get back to work. If, instead, an app does use
all of the memory then it's blocked for a long time in the IO calls
waiting for a free buffer, before the sync. If, when, that happens then
the fsync is nearly a no-op as most of the dirty data have already been
written.
Were I an app programmer, I could easily come to the conclusion that the
fsync is either useful or does not hurt.
The only cooperative app I can think of that seems to be able to win
universally is the one structured to:
for (;;) {
barrier
fsync
checkpoint
for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) {
compute
communicate
}
}
I don't know any that work that way though :(
>
> > Lustre that if write-back caching is employed there must be a way to
> > turn it off.
>
> There is around 3 ways to do that that I am aware of.
That's nice. It was a requirement, after all. ;)
--Lee
>
> Bye,
> Oleg
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 16:35 ` Oleg Drokin
2009-04-01 19:13 ` Lee Ward
@ 2009-04-01 19:15 ` Nicholas Henke
2009-04-01 19:26 ` Oleg Drokin
1 sibling, 1 reply; 15+ messages in thread
From: Nicholas Henke @ 2009-04-01 19:15 UTC (permalink / raw)
To: lustre-devel
Oleg Drokin wrote:
> Hello!
>
> On Apr 1, 2009, at 11:58 AM, Lee Ward wrote:
>> I think my point was that there is already a priority scheme in the
>> Seastar. Are there additional bits related to priority that you might
>> use, also?
>
> But if we cannot use it, there is none.
> Like we want mpi rpcs go out first to some degree.
If we have to deal with ordering - we are already sunk. The Lustre RPCs will go
out and affect MPI latency to some degree, introducing jitter into the calls and
affecting application performance.
>
> But since the only thing I have in my app inside barriers is write call,
> there is no much way to desynchronize.
Incorrect - you are running your app on all 4 CPUs on the node at the same time
Lustre is sending RPCs. The kernel threads will get scheduled and run, pushing
your app to the side and desynchronizing the barrier for the app as a whole.
> No, I do not think they would like the idea to forfeit 1/4 of their
> CPU just so io is better.
> If the jitter is due to cpu occupied with io, and apps stalled due to
> this
> (though I have hard time believing an app to be not given a cpu for
> 4.5 seconds,
> even though there are potentially 4 idle cpus, or even 3 (remember
> other cores are
> also idle waiting on a barrier).
This gets easier to swallow in the future with 12core and larger nodes - 1/12 is
much easier to sacrifice.
What we really need to "prove" is where the delay is occurring. The MPI_Barrier
messages are 0-byte sends, effectively turning them into Portals headers and
these are sent and processed very fast. In fact, the total amount of data being
sent is _much_ less than the NIC is capable of. A rough estimate for 2 nodes
talking to each other is 1700 MB/s and 50K lnet pings/s.
One thing to try is changing your aprun to use fewer CPUs per node:
aprun -n 1200 -N [1,2,3] -cc 1-3.
The -cc 1-3 will keep it off cpu 0 - a known location for some IRQs and other
servicing.
You should also try to capture compute-node stats like cpu usage, # of threads
active during barrier, etc to help narrow down where the time is going.
Nic
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 19:15 ` Nicholas Henke
@ 2009-04-01 19:26 ` Oleg Drokin
0 siblings, 0 replies; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 19:26 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 3:15 PM, Nicholas Henke wrote:
>> But since the only thing I have in my app inside barriers is write
>> call,
>> there is no much way to desynchronize.
> Incorrect - you are running your app on all 4 CPUs on the node at
> the same time Lustre is sending RPCs. The kernel threads will get
> scheduled and run, pushing your app to the side and desynchronizing
> the barrier for the app as a whole.
But I am measuring each write and I see that none of them
significantly exceeds 0.5 seconds.
Let it be 0.1 seconds difference.
So then 4.5 seconds - 0.1 seconds for the write speed difference = 4.4
seconds.
> What we really need to "prove" is where the delay is occurring. The
> MPI_Barrier messages are 0-byte sends, effectively turning them into
> Portals headers and these are sent and processed very fast. In fact,
> the total amount of data being sent is _much_ less than the NIC is
> capable of. A rough estimate for 2 nodes talking to each other is
> 1700 MB/s and 50K lnet pings/s.
Yes. I understand this point.
> One thing to try is changing your aprun to use fewer CPUs per node:
> aprun -n 1200 -N [1,2,3] -cc 1-3.
I just run with 1 cpu per node, 1200 threads. Leaving 3 cpus/core for
kernel and whatnot.
The actual write syscall return time decreased, but the barrier is
not, even though we know
that less data is in time at any given time now (due to only 16 osts
accessed per node, not 16*4).
So something is going on, but I do not think we can blindly attribute
it to just "ah kernel ate your
cpu for important things pushing the data"
0: barrier after write time: 4.528383 sec
0: barrier after write 2 time: 4.043252 sec
The pre-write barrier took only 0.096675 sec (to rule out general
network congestion).
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 19:13 ` Lee Ward
@ 2009-04-01 20:17 ` Oleg Drokin
2009-04-02 2:46 ` Oleg Drokin
0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2009-04-01 20:17 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 3:13 PM, Lee Ward wrote:
>> But if we cannot use it, there is none.
>> Like we want mpi rpcs go out first to some degree.
> If you don't want to follow up, I'm ok with that. It's up to you.
> I understood what you want. There are at least two things I can
> imagine
> that would better the situation without trying to leverage something
> in
> the network, itself.
> 1) Partition the adapter CAM so that there is always room to
> accommodate
> a user-space receive.
I cannot really comment on this option.
> 2) Prioritize injection to favor sends originating from user-space.
This is what I am speaking about, actually, perhaps not being able to
explain
myself clearly. Except perhaps just userspace is too generic and a bit
more
fine-grained controls would be more beneficial.
> One or both of these might already be implemented. I don't know.
The second option does not look like it is implemented.
>>> when, scheduling occurs on two nodes is different. Any two nodes,
>>> even
>>> running the same app with barrier synchronization, perform things at
>>> different times outside of the barriers; They very quickly
>>> desynchronize
>>> in the presence of jitter.
>> But since the only thing I have in my app inside barriers is write
>> call,
>> there is no much way to desynchronize.
> Modify your test to report the length of time each node spent in the
> barrier (not just rank 0, as it is written now) immediately after the
> write call, then? If you are correct, they will all be roughly the
> same.
> If they have desynchronized, most will have very long wait times but
> at
> least one will be relatively short.
That's a fair point. I just scheduled the run.
> Oh, I'm sure they're getting the CPU. They just won't come out of the
> barrier until all have processed the operation. The rates at which the
> nodes reach the barrier will be different. The rates at which they
I believe the rates at which they come to the barrier are the aprox
the same.
I do time the write system call. And the barrier is next to it. And
write
system call has relatively small variability in time, so we can assume
that all barriers start within 0.1 seconds from each other.
> proceed through will be different. The only invariant after a
> barrier is
> that all the involved ranks *have* reached that point. Nothing about
> when that happened is stated or implied.
Ok, I did not realize that, though that makes sense.
I believe in my test the problem is on the sending side - i.e. the
bottleneck
does not let all nodes to report that the point was reached by every
thread.
But as soon as all nodes gathered, whatever control node sends the
messages
(that are of course could be delayed in the queue if it is also doing
the io,
hm, I wonder what node coordinates this (set of nodes?). Rank 0?) and
once injected, they should
be processed instantly since we do not have any significant incoming
traffic on the
nodes.
Don't take my word for it of course, the test is already scheduled and
I'll share the results.
>>>>> Jitter gets a bad rap. Usually for good reason. However, in this
>>>>> case,
>>>>> it doesn't seem something to worry overly much about as it will
>>>>> cease.
>>>>> Your test says the 1st barrier after the write completes in 4.5
>>>>> sec
>>>>> and
>>>>> the 2nd in 1.5 sec. That seems to imply the jitter is settling
>>>>> pretty
>>>>> rapidly. Jitter is really only bad when it is chronic.
>>>> Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my
>>>> specific job.
>>> That 1200 is the number of checkpoints? If so, I agree. If it's the
>>> number of nodes, I do not.
>> 1200 is number of cores waiting on a barrier.
>> Every core spends 4.5 seconds == total wasted single-cpu core time is
>> 1.5 hours.
> It doesn't work that way. The barrier operation is implemented as a
> collective on the Cray. What you are missing in the math above is that
> every core waited during the *same* 4.5 second period. Total wasted
> time
> is only 4.5 seconds then.
I have a feeling we are speaking about different subjects here.
You are speaking about wall-clock time. I am speaking about total cpu-
cycles
wasted across all nodes.
>>> If the time to settle the jitter is on the order of 10 seconds but
>>> it
>>> takes 15 seconds to sync, it would be better to live with the
>>> jitter,
>>> no? I suggested an experiment to make this comparison. Why argue
>>> with
>>> them? just do the experiment and you can know which strategy is
>>> better.
>> I know which one is better. I did the experiment. (though I have no
>> realistic way
>> to measure when "jitter" settles out).
> Which was better then? By how much? Were you just measuring a
> barrier or
> do those numbers still work out when the app uses the network heavily
> after doing it's writes?
Unfortunately I do not have any real applications instrumented, so my
barrier
is a substitute for "network-heavy app activity". I started with it
because
app programmers I spoke with complained about how their network
latency is
affected if they do buffered writes.
The fsync takes upward from 10 seconds, depending on other load in the
system,
I guess. I have no easy way to measure the jitter.
I do not think writeout with or without fsync would take significantly
different
time because the underlying io paths don't change, but that's non-
scientific.
Unfortunately just doing a write, time the fsync then doing the write
again and
wait the same amount of time as fsync took, then do another fsync and
see if it is
instantly returned, since lustre only eagerly writes 1M chunks, and vm
pressure only
ensures data older than 30 seconds would be pushed out. And that
before taking into
account possible variability of the loads on OSTs over time (since I
cannot have
entire Jaguar all for myself).
>>> However, they should do the sync right before they enter the IO
>>> phase,
>>> in order to also get the benefits of write-back caching. Not after
>>> the
>>> IO phase. In the event of an interrupt, this forces them to throw
>>> away
>>> an in-progress checkpoint and the last one before that, to be safe,
>>> but
>>> the one before the last should be good.
>>
>> Right.
>> Yet they do some microbenchmark and decide it is bad idea.
>> Besides, reducing jitter, or whatever is the cause for the delays
>> would still be useful.
> You're making a wonderful argument for Catamount :)
Actually, catamount definitely has its strong points, but there are
drawbacks as well. With Linux it's just another set of benefits and
drawbacks.
>>> In some cases, your app programmers will be unfortunately correct.
>>> An
>>> app that uses so much memory that the system cannot buffer the
>>> entire
>>> write will incur at least some issues while doing IO; Some of the IO
>>> must move synchronously and that amount will differ from node to
>>> node.
>>> This will have the effect of magnifying this post-IO jitter they are
>>> so
>>> worried about. It is also why I wrote in the original requirements
>>> for
>> Why would it? There still is potentially a benefit for the available
>> cache size.
> In a fitted application, there is no useful amount of memory left over
> for the cache. Using it, then, is just unnecessary overhead.
Right. In this case it is even better to do non-caching io (directio
style)
to reduce the memory copy overhead as well.
> As I said, there's a very real possibility your app programmers are
> correct. It goes beyond memory. Any resource under intense pressure
> due
> to contention offers the possibility that it can take longer to
> perform
> it's requests independently than to serialize them. For instance, if
> an
> app does not use all of memory then there is plenty of room for Lustre
> to cache. Since these apps presumably are going to communicate after
> the
> IO phase (why else the barrier after the IO?) then they will contend
> heavily with the Lustre client for the network interface and that
> interface does not deal well with such a situation on the Cray. I can
> easily believe it would take longer for the app to get back to
> computing
> because of the asynchronous network traffic from the write-back than
> it
> would to just force the IO phase to complete, via fsync, and, after,
> do
> what it needs to do to get back to work. If, instead, an app does use
> all of the memory then it's blocked for a long time in the IO calls
> waiting for a free buffer, before the sync. If, when, that happens
> then
> the fsync is nearly a no-op as most of the dirty data have already
> been
> written.
This is all very true.
Currently I am only focusing on applications that do leave enough space
for the fs cache, since that's where the possible benefit is and there
is
no drawback for applications that don't do cached io. (And this is the
case for the app programmers I spoke with).
> The only cooperative app I can think of that seems to be able to win
> universally is the one structured to:
>
> for (;;) {
> barrier
> fsync
> checkpoint
> for (n = 0; n < TIME_STEPS_TWEEN_CHECKPOINT; n++) {
> compute
> communicate
> }
> }
> I don't know any that work that way though :(
We here at ORNL are trying hard to convice app programmers that this is
indeed beneficial.
Unfortunately it is not all that clean-cut, the machine itself behaves
differently every time due to all different workloads going on is part
of the problem too.
Of course we stand in our way too, with the default 32Mb limit of
dirty cache
per osc, in order to get meaningful caching size we need to stripe the
files
over waay to many OSTs, and as a result the overall IO performance is
degraded compared to just the case of outputting to a single OST from
every
job, due to reduced randomness in IO pattern.
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-01 20:17 ` Oleg Drokin
@ 2009-04-02 2:46 ` Oleg Drokin
2009-04-02 4:28 ` Lee Ward
0 siblings, 1 reply; 15+ messages in thread
From: Oleg Drokin @ 2009-04-02 2:46 UTC (permalink / raw)
To: lustre-devel
Hello!
On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote:
>>>> when, scheduling occurs on two nodes is different. Any two nodes,
>>>> even
>>>> running the same app with barrier synchronization, perform things
>>>> at
>>>> different times outside of the barriers; They very quickly
>>>> desynchronize
>>>> in the presence of jitter.
>>> But since the only thing I have in my app inside barriers is write
>>> call,
>>> there is no much way to desynchronize.
>> Modify your test to report the length of time each node spent in the
>> barrier (not just rank 0, as it is written now) immediately after the
>> write call, then? If you are correct, they will all be roughly the
>> same.
>> If they have desynchronized, most will have very long wait times
>> but at
>> least one will be relatively short.
> That's a fair point. I just scheduled the run.
Ok.
The results are in. I scheduled 2 runs. One at 4 threads/node and one
at 1 thread/node.
For the 4 threads/node case the 1st barrier took anywhere from 1.497
sec to
3.025 sec with rank 0 reporting 1.627 sec.
The second barrier took 0.916 to 2.758 seconds with rank 0 reporting
1.992 sec.
For the barrier 2 I can actually clearly observe that thread terminate
in
groups of 4 with very close times, and ranks suggest those nids are on
the same
nodes. On 1st barrier this trend is much less visible, though.
On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and
slowest was 10.176
For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is
pretty close
to the difference between fastest and slowest 1st barrier, since
amount of data
written per node in this case 4 smaller, I guess we just flushed all
the data
to the disk before the 1st barrier finished and the difference in
waiting was due
to the differences in start times.
As you can see, numbers tend to jump around, but there are still
relatively big delays
due to something else than just threads getting out of sync.
Bye,
Oleg
^ permalink raw reply [flat|nested] 15+ messages in thread
* [Lustre-devel] SeaStar message priority
2009-04-02 2:46 ` Oleg Drokin
@ 2009-04-02 4:28 ` Lee Ward
0 siblings, 0 replies; 15+ messages in thread
From: Lee Ward @ 2009-04-02 4:28 UTC (permalink / raw)
To: lustre-devel
On Wed, 2009-04-01 at 20:46 -0600, Oleg Drokin wrote:
> Hello!
>
> On Apr 1, 2009, at 4:17 PM, Oleg Drokin wrote:
>
> >>>> when, scheduling occurs on two nodes is different. Any two nodes,
> >>>> even
> >>>> running the same app with barrier synchronization, perform things
> >>>> at
> >>>> different times outside of the barriers; They very quickly
> >>>> desynchronize
> >>>> in the presence of jitter.
> >>> But since the only thing I have in my app inside barriers is write
> >>> call,
> >>> there is no much way to desynchronize.
> >> Modify your test to report the length of time each node spent in the
> >> barrier (not just rank 0, as it is written now) immediately after the
> >> write call, then? If you are correct, they will all be roughly the
> >> same.
> >> If they have desynchronized, most will have very long wait times
> >> but at
> >> least one will be relatively short.
> > That's a fair point. I just scheduled the run.
>
> Ok.
> The results are in. I scheduled 2 runs. One at 4 threads/node and one
> at 1 thread/node.
>
> For the 4 threads/node case the 1st barrier took anywhere from 1.497
> sec to
> 3.025 sec with rank 0 reporting 1.627 sec.
> The second barrier took 0.916 to 2.758 seconds with rank 0 reporting
> 1.992 sec.
> For the barrier 2 I can actually clearly observe that thread terminate
> in
> groups of 4 with very close times, and ranks suggest those nids are on
> the same
> nodes. On 1st barrier this trend is much less visible, though.
>
> On the 1 thread/node case the fastest 1st barrier was 7.515 seconds and
> slowest was 10.176
> For the 2nd barrier, fastest was 0.085 and slowest 2.756 which is
> pretty close
> to the difference between fastest and slowest 1st barrier, since
> amount of data
> written per node in this case 4 smaller, I guess we just flushed all
> the data
> to the disk before the 1st barrier finished and the difference in
> waiting was due
> to the differences in start times.
>
> As you can see, numbers tend to jump around, but there are still
> relatively big delays
> due to something else than just threads getting out of sync.
Agreed. It's something more than simple jitter.
From everything you have described, the nodes are otherwise idle. The
only other thing I can think, then, of would be one or more Lustre
client threads, injecting traffic into the network, which is where you
started.
A useful test might be to grab the MPI ping-pong from the test suite,
modify it to slow it down a bit. Say 4 times a second? Augment it to
report the ping-pong time and a time stamp. Augment your existing test
to report time stamps for the beginning of the write call. Launch one,
each, of these on your set of nodes; I.e., each node has both your write
test and the ping-pong running at the same time. This presumes you can
launch two mpi jobs onto your set of nodes. If not, come up with an
equivalent that is supported?
If the ping-pong latency goes way up at the write calls you can claim a
correlation. Not definitive as correlation does not equal cause but it
is pretty strong.
If there is correlation, it means Cray has kind of messed up the portals
implementation. The portals implementation would be attempting to send
*everything* in order. All portals needs is for traffic to go in order
per nid and pid pair. An implementation is free to mix in unrelated
traffic, and should, to prevent one process from starving others.
An idea... Does the Lustre service side restrict the number of
simultaneous get operations it issues? I don't just mean to a particular
client, but to all from a single server, be it OST or MDS. If not,
consider it. If there are too many outstanding receives an arriving
message may miss the corresponding CAM entry due to a flush. What
happens after that can't be pretty. At one time, it caused the client to
resend. Does it still? If so, and resends are occurring the affected
clients have their bandwidth reduced by more than 50% for the affected
operations. Since there is a barrier operation stuck behind it, well...
Mr. Booth has suggested that the portals client might offer to send less
data per transfer. This would allow latency sensitive sends to reach the
front of the queue more quickly. It would also, I think, lower overall
throughput. It's an idea worth considering but is a case of two evils.
Can this be mitigated by peeking at the portals send queue in some way?
If Lustre can identify outbound traffic in the queue that it didn't
present then it could respond as Mr. Booth has suggested or back off on
the rate at which it presents traffic, or both even? Initial latencies
would be unchanged but would get better as the app did more
communication, especially if it used the one-sided calls and overlapped
them.
I'm sorry, if it's contention for the adapter I don't see a work around
without changing Lustre or Cray changing the driver to more fairly
service the independent streams.
In any case, right now, your apps guys suspicions probably have merit if
it is indeed contention on the network adapter. They may really be
better off forcing the IO to complete before moving to the next phase if
that phase involves the network. How sad.
You do need to do the test, though, before you try to "fix" anything.
Right now, it's only supposition that contention for the network adapter
is the evil here.
--Lee
>
> Bye,
> Oleg
>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-04-02 4:28 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-01 4:43 [Lustre-devel] SeaStar message priority Oleg Drokin
2009-04-01 5:10 ` Andrew C. Uselton
2009-04-01 12:55 ` Nic Henke
2009-04-01 15:02 ` Oleg Drokin
2009-04-01 14:26 ` Lee Ward
2009-04-01 15:14 ` Oleg Drokin
2009-04-01 15:58 ` Lee Ward
2009-04-01 16:20 ` Eric Barton
2009-04-01 16:35 ` Oleg Drokin
2009-04-01 19:13 ` Lee Ward
2009-04-01 20:17 ` Oleg Drokin
2009-04-02 2:46 ` Oleg Drokin
2009-04-02 4:28 ` Lee Ward
2009-04-01 19:15 ` Nicholas Henke
2009-04-01 19:26 ` Oleg Drokin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.