Re: Latencies writing to memory mapped files

From: Dave Chinner <david@fromorbit.com>
To: Shawn Bohrer <shawn.bohrer@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: Latencies writing to memory mapped files
Date: Wed, 22 Sep 2010 09:15:31 +1000	[thread overview]
Message-ID: <20100921231531.GE2614@dastard> (raw)
In-Reply-To: <20100921180541.GA6877@BohrerMBP.rgmadvisors.com>

On Tue, Sep 21, 2010 at 01:05:41PM -0500, Shawn Bohrer wrote:
> On Tue, Sep 21, 2010 at 08:48:33AM +1000, Dave Chinner wrote:
> > On Mon, Sep 20, 2010 at 05:17:26PM -0500, Shawn Bohrer wrote:
> > > On Mon, Sep 20, 2010 at 10:05:35AM +1000, Dave Chinner wrote:
> > > > FWIW, if you can work out the magic necessary (perhaps systemtap) it
> > > > woul dbe really interesting to know whether it is the same buffer
> > > > that is causing the latency every time (i.e. get the bp->b_bn field
> > > > from the metadata buffer). With the block number, we can use xfs_db
> > > > to dig out what the buffer actually is....
.....
> > Hmmm - it would be good to know which one produced the latency,
> > given there does not appear to be a pattern in the block numbers.
> 
> OK here is a little more information which may be relevant.  I've
> currently got 12 processes that read data from a socket and each write
> to a different memory mapped file.  The apps are only appending to the
> file, but they don't write the data in one sequential chunk so it is
> quite possible that it appears as random IO for a short time period.
> 
> This time I made some more modifications so that I would only capture
> the bp->b_bn when one of my processes sees a delay of 300ms or greater.
> There still is a small chance for false positives, but most of these
> should have caused real delays.  All of the following came from the
> same process:
> 
> [001] 81758.886627: _xfs_buf_find: xfs_buf_lock blocked for 871ms on bp->b_bn: 474487328
> [001] 81920.808163: _xfs_buf_find: xfs_buf_lock blocked for 1038ms on bp->b_bn: 474487568
> [001] 82933.428627: _xfs_buf_find: xfs_buf_lock blocked for 767ms on bp->b_bn: 542273864
> [000] 83048.558413: _xfs_buf_find: xfs_buf_lock blocked for 363ms on bp->b_bn: 474488080
> [001] 86717.251189: _xfs_buf_find: xfs_buf_lock blocked for 514ms on bp->b_bn: 474487560
> [001] 88466.635802: _xfs_buf_find: xfs_buf_lock blocked for 380ms on bp->b_bn: 542274544
> [000] 88972.122062: _xfs_buf_find: xfs_buf_lock blocked for 521ms on bp->b_bn: 474487560
> [001] 89271.988589: _xfs_buf_find: xfs_buf_lock blocked for 496ms on bp->b_bn: 474487560
> [001] 91046.566309: _xfs_buf_find: xfs_buf_lock blocked for 540ms on bp->b_bn: 474487560
> [001] 91047.280042: _xfs_buf_find: xfs_buf_lock blocked for 713ms on bp->b_bn: 542274216
> [001] 91212.458472: _xfs_buf_find: xfs_buf_lock blocked for 490ms on bp->b_bn: 542274544
> [001] 92397.667196: _xfs_buf_find: xfs_buf_lock blocked for 549ms on bp->b_bn: 474487560
> 
> > > What do I need to do to convert the block numbers to a buffer?
> > 
> > # xfs_db -r -c "daddr 812730376" -c "print" <dev>
> > 
> > Will dump the sector at that address. That should be enough to tell
> > us what is in the buffer (by the magic number).
> 
> So I have no idea what I'm looking at but here is the output for the
> above numbers (duplicates removed):
> 
> xfs_db -r -c "daddr 474487328" -c "print" /dev/sda5
> 000: 424d4150 0000007f 00000000 07000082 00000000 07000092 00000000 0039a000
       ^^^^^^^^
        B M A P

#define XFS_BMAP_MAGIC  0x424d4150      /* 'BMAP' */

So these are inode extent btree blocks your application is getting
stuck on.  These only get written back as a result of either log
pressure (i.e.  tail pushing) or by the xfsbufd based on age. They
aren't actually flushed with the data because changes are logged.
IOWs, the writeback of the bmap btree blocks is asynchronous to any
operation that modifies them, so there's no direct connection
between modification and writeback.

I'm not sure that there is anything that can really be done to
prevent this. If the cause of writeback is age-based flushing on the
metadata buffers, you could try increasing the xfsbufd writeback age
so that only log pressure will cause them to be flushed.

Alternatively, you could change your application to pre-fault pages
in an async thread so the latency of allocation during the page
fault is not taken by the main writer...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs