Re: High latencies writing to a memory mapped file

From: Shawn Bohrer <shawn.bohrer@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: xfs@oss.sgi.com
Subject: Re: High latencies writing to a memory mapped file
Date: Tue, 3 Aug 2010 17:03:53 -0500	[thread overview]
Message-ID: <20100803220336.GA7282@BohrerMBP.rgmadvisors.com> (raw)
In-Reply-To: <20100726232224.GJ7362@dastard>

Hey Dave,

Thanks for the suggestions.

On Tue, Jul 27, 2010 at 09:22:24AM +1000, Dave Chinner wrote:
> On Mon, Jul 26, 2010 at 05:09:09PM -0500, Shawn Bohrer wrote:
> > On Fri, Jul 23, 2010 at 09:28:46AM +1000, Dave Chinner wrote:
> > > On Thu, Jul 22, 2010 at 09:47:06AM -0500, Shawn Bohrer wrote:
> > > > Hello,
> > > > 
> > > > We have an application that receives data over the network, and then
> > > > writes data out to a shared memory mapped file on xfs.  Previously
> > > > everything worked great when we were running a CentOS 5.3 kernel
> > > > (2.6.18-128.7.1.el5), but when we upgraded to the newer CentOS 5.4 or
> > > > 5.5 kernels we started to see occasional spikes (up to 1 second) in
> > > > the time it takes to write the data to the memory mapped file.  I
> > > > noticed that between the CentOS 5.3 and 5.4 kernels the upstream xfs
> > > > changes were backported.  Testing on 2.6.32.16 shows the same
> > > > occasional latency spikes.
> > > > 
> > > > After doing some debugging on 2.6.32.16 I have found the cause of the
> > > > latency spikes.  The spikes occur every time pdflush runs to flush the
> > > > dirty pages.  Note our app only writes data at about 1.5 MB/s and we
> > > > are well below both the dirty_background_ratio and the dirty_ratio so
> > > > these flushes are only of pages that are dirty_expire_centisecs old.
> > > > Capturing one of the spikes with ftrace shows:
> > > > 
> > > > ...
> > > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |  xfs_ilock() {
> > > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:                   |    down_write() {
> > > > flush-8:0-601   [011] 1291592.127590: funcgraph_entry:        0.213 us   |      _cond_resched();
> > > > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         0.657 us   |    }
> > > > flush-8:0-601   [011] 1291592.127591: funcgraph_exit:         1.174 us   |  }
> > > > flush-8:0-601   [011] 1291592.127764: sched_switch:         task flush-8:0:601 [120] (D) ==> swapper:0 [120]
> > > > ...
> > > 
> > > what is the lead-in context to this hunk? i.e. how does xfs_ilock()
> > > get called? That will tell me if it is blocking on the iolock or the
> > > ilock...
> > 
> > I don't have the lead-in context here.  I'm using the function_graph
> > tracer and with stalls up to a second in length it easily overflowed
> > the ftrace ring buffer.  Once I found I my process was blocking on the
> > xfs_ilock(), I limited my traces to include xfs_ilock, xfs_iunlock,
> > handle_mm_fault, and the scheduler events.  If you are still
> > interested in seeing some more detail here I can try to collect some
> > more captures.
> 
> I'm not sure isthe tracepoint buffer is the same as the ftrace
> buffer, but if it is you can increase the size by doing this sort of
> thing:
> 
> # echo 32768 > /sys/kernel/debug/tracing/buffer_size_kb
> 
> I often do that to capture long XFS traces during testing....
> 
> [snip]
> 
> > > Hmmm - perhaps the issue is that the transaction commit has not been
> > > able to get a iclog buffer to write the transaction into and so is
> > > sleeping waiting for an iclog IO to complete. That would be woken by
> > > the xfslogd, and then the transaction would write into the
> > > now-available iclog and unlock the inode.
> > > 
> > > So, I'd say that in this case your application is waiting for log IO
> > > completion to occur. Nothing you can do to avoid that, but you can
> > > minimise the impact - increase the log buffer size via the mount
> > > option logbsize (i.e. -o logbsize=262144) to reduce the frequency of
> > > iclog buffer IOs and make it less likely the system will run stall
> > > having run out of iclog space.
> > > 
> > > The other possibility is that the log IO is being slowed by barriers
> > > being issued with the log writes. If you've got non-volatile caches
> > > on your storage, you can turn off barriers and that should help
> > > reduce log IO latency significantly.
> > 
> > I've tried mounting with "noatime,nobarrier,logbufs=8,logbsize=262144"
> > but I don't notice much difference and still see the stalls.
> 
> Ok. One other possibility came to mind - the transaction commit is
> blocking while locking the superblock buffer to apply free block
> count changes. Do you have the "lazy-count" option set (xfs_info
> <mtpt> will tell you). If it isn't set, then transaction commit
> latency can be quite high if the sb buffer is being written to disk
> when the transaction tries to lock it. If it is not set, you can use
> xfs_admin to set that bit (mkfs has defaulted to lazy-count=1 for
> some time now).

Setting lazy-count=1 looks like it has helped quite a bit to reduce
the frequency that I was seeing the spikes.  I am still seeing spikes
of up to a second, but I haven't had time to capture any more traces
to verify if they are the same issue.  Overall though this may be good
enough.

Thanks,
Shawn

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs