From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Barton Date: Wed, 01 Apr 2009 17:20:36 +0100 Subject: [Lustre-devel] SeaStar message priority In-Reply-To: <1238601506.5091.65.camel@wheel> References: <74EA92D6-25E8-42DA-A4FD-BDDCED233244@Sun.COM> <1238596005.5091.31.camel@wheel> <1238601506.5091.65.camel@wheel> Message-ID: <004101c9b2e5$cc74ee80$655ecb80$@com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: lustre-devel@lists.lustre.org Lee, I completely agree with your comments on measurement. I'd really, really like to see some. Cheers, Eric > -----Original Message----- > From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf Of Lee Ward > Sent: 01 April 2009 4:58 PM > To: Oleg Drokin > Cc: Lustre Development Mailing List > Subject: Re: [Lustre-devel] SeaStar message priority > > On Wed, 2009-04-01 at 09:14 -0600, Oleg Drokin wrote: > > Hello! > > > > On Apr 1, 2009, at 10:26 AM, Lee Ward wrote: > > >> It came to my attention that seastar network does not implement > > >> message priorities for various reasons. > > > That is incorrect. The seastar network does implement at least one > > > priority scheme based on age. It's not something an application can > > > play > > > with if I remember right. > > > > Well, then it's as good as none for our purposes, I think? > > Other than that traffic moves (only very roughly) in a fair manner and > that packets from different nodes can arrive out of order, I guess. > > I think my point was that there is already a priority scheme in the > Seastar. Are there additional bits related to priority that you might > use, also? > > > > > > I strongly suspect OS jitter, probably related to FS activity, is a > > > much > > > more likely explanation for the above. If just one node has the > > > process/rank suspended then it can't service the barrier; All will > > > wait > > > until it can. > > > > That's of course right and possible too. > > Though given how nothing else is running on the nodes, I would think > > it is somewhat irrelevant, since there is nothing else to give > > resources to. > > How and where memory is used on two nodes is different. How, where, > when, scheduling occurs on two nodes is different. Any two nodes, even > running the same app with barrier synchronization, perform things at > different times outside of the barriers; They very quickly desynchronize > in the presence of jitter. > > > The Lustre processing of the outgoing queue is pretty fast in itself at > > this phase. > > Do you think it would be useful if I just run 1 thread per node, there > > would be > > 3 empty cores to adsorb all the jitter there might be then? > > You will still get jitter. I would hope less, though, so it wouldn't > hurt to try to leave at least one idle core. We've toyed with the idea > of leaving a core idle for IO and other background processing in the > past. The idea was a non-starter with our apps folks though. Maybe the > ORNL folks will feel differently? > > > > > > Jitter gets a bad rap. Usually for good reason. However, in this case, > > > it doesn't seem something to worry overly much about as it will cease. > > > Your test says the 1st barrier after the write completes in 4.5 sec > > > and > > > the 2nd in 1.5 sec. That seems to imply the jitter is settling pretty > > > rapidly. Jitter is really only bad when it is chronic. > > > > Well, 4.5*1200 = 1.5 hours of completely wasted cputime for my > > specific job. > > That 1200 is the number of checkpoints? If so, I agree. If it's the > number of nodes, I do not. > > > So I thought it would be a good idea to get to the root of it. > > We hear many arguments here at the lab that "what good the buffered io > > is for > > me when my app performance is degraded if I don't do sync. I'll just do > > the sync and be over with it". Of course I believe there is still > > benefit to not > > doing the sync, but that's just me. > > If the time to settle the jitter is on the order of 10 seconds but it > takes 15 seconds to sync, it would be better to live with the jitter, > no? I suggested an experiment to make this comparison. Why argue with > them? just do the experiment and you can know which strategy is better. > > > > > > To me, you are worrying way too much about the situation immediately > > > after a write. Checkpoints are relatively rare, with long periods > > > between. Why worry about something that's only going to affect a very > > > small portion of the overall job? As long as the jitter dissipates > > > in a > > > short time, things will work out fine. > > > > I worry abut it specifically because users tend to do sync after the > > write and that > > wastes a lot of time. So as a result - I want as much of data to enter > > into cache > > and then trickle out all by itself and I want users not to see any bad > > effects > > (or otherwise to show to them that there are still benefits). > > Users tend to do sync for more reasons than making the IO deterministic. > They should be doing it so that they can have some faith that the last > checkpoint is actually persistent when interrupted. > > However, they should do the sync right before they enter the IO phase, > in order to also get the benefits of write-back caching. Not after the > IO phase. In the event of an interrupt, this forces them to throw away > an in-progress checkpoint and the last one before that, to be safe, but > the one before the last should be good. > > The apps could also be more reasonable about their checkpoints, I've > noticed. Often, for us anyway, the machine just behaves. If the app > began by assuming the machine was unreliable but as it ran for longer > and longer periods, it could (I argue should) allow the period between > checkpoints to grow. If the idea is to make progress, as I'm told, then > on a well behaved machine far fewer checkpoints are required. Most apps, > though, just use a fixed period and waste a lot of time doing their > checkpoints when the machine is being nice to them. > > > > > > Maybe you could convince yourself of the efficacy of write-back > > > caching > > > in this scenario by altering the app to do an fsync() after the write > > > phase on the node but before the barrier? If the app can get back to > > > computing, even with the jitter-disrupted barrier, faster than it > > > could > > > by waiting for the outstanding dirty buffers to be flushed then it's a > > > net win to just live with the jitter, no? > > > > I do not need to convince myself. IT's the app programmers that are > > fixated > > on "oh, look, my program is slower after the write if I do not do > > sync, I must > > do sync!" > > Try the experiment. Show them the data. They are, in theory, reasoning > people, right? > > In some cases, your app programmers will be unfortunately correct. An > app that uses so much memory that the system cannot buffer the entire > write will incur at least some issues while doing IO; Some of the IO > must move synchronously and that amount will differ from node to node. > This will have the effect of magnifying this post-IO jitter they are so > worried about. It is also why I wrote in the original requirements for > Lustre that if write-back caching is employed there must be a way to > turn it off. > > If they aren't sizing their app for the node's physical memory, though, > I would think that the experiment should show that write-back caching is > a win. > > --Lee > > > > > Bye, > > Oleg > > > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel