From: Brian Foster <bfoster@redhat.com>
To: Mao Cheng <chengmao2010@gmail.com>
Cc: david@fromorbit.com, linux-xfs@vger.kernel.org
Subject: Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
Date: Wed, 24 Oct 2018 08:11:13 -0400 [thread overview]
Message-ID: <20181024121112.GB46681@bfoster> (raw)
In-Reply-To: <CAGiyNfBMjQPD0bxqHNMeUVZzr_uVdqRqMZpDOkVUEnenEobxNQ@mail.gmail.com>
On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote:
> Hi,
> Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道:
> >
> > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > Hi Brian,
> > > Thanks for your response.
> > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道:
> > > >
> > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > sending, so resend it.
> > > > > If you have received previous email please ignore it, thanks
> > > > >
> > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and
> > > > > exported via samba.
> > > > >
> > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks
> > > > > = sectsz=4096 attr=2, projid32bit=1
> > > > > = crc=1 finobt=0 spinodes=0
> > > > > data = bsize=4096 blocks=524288000, imaxpct=5
> > > > > = sunit=0 swidth=0 blks
> > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> > > > > log =internal bsize=4096 blocks=256000, version=2
> > > > > = sectsz=4096 sunit=1 blks, lazy-count=1
> > > > > realtime =none extsz=4096 blocks=0, rtextents=0
> > > > >
> > > > > free space about allocation groups:
> > > > > from to extents blocks pct
> > > > > 1 1 9 9 0.00
> > > > > 2 3 14291 29124 0.19
> > > > > 4 7 5689 22981 0.15
> > > > > 8 15 119 1422 0.01
> > > > > 16 31 754657 15093035 99.65
> >
> > 750,000 fragmented free extents means something like 1600 btree
> > leaf blocks to hold them all.....
> >
> > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > algorithms in XFS. That function itself includes a couple different
> > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > looks like an intra-block search of the by-size free space btree (if not
> > > > many suitably sized extents are available) and the second looks like an
> > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > extent.
> >
> > Yup, just like the free inode allocation search, which is capped
> > at about 10 btree blocks left and right to prevent searching the
> > entire tree for the one free inode that remains in it.
> >
> > The problem here is that the first algorithm fails immediately
> > because there isn't a contiguous free space large enough for the
> > allocation being requested, and so it finds the largest block whose
> > /location/ is less than target block as the start point for the
> > nearest largest freespace.
> >
> > IOW, we do an expanding radius size search based on physical
> > locality rather than finding a free space based on size. Once we
> > find a good extent to either the left or right, we then stop that
> > search and try to find a better extent to the other direction
> > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > search the entire of the tree in that remaining directory without
> > finding a better match.
> >
> > We can't cut the initial left/right search shorter - we've got to
> > find a large enough free extent to succeed, but we can chop
> > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > xfs_dialloc_ag_inobt(). The patch below does that.
> >
> > Really, though, I think what we need to a better size based search
> > before falling back to a locality based search. This is more
> > complex, so not a few minutes work and requires a lot more thought
> > and testing.
> >
> > > We share an xfs filesystem to windows via SMB protocol.
> > > There are about 5 windows copy small files to the samba share at the same time.
> > > The main problem is the throughput degraded from 30MB/s to around
> > > 10KB/s periodically and recovered about 5s later.
> > > The kworker consumes 100% of one CPU when the throughput degraded and
> > > kworker task is wrteback.
> > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > > and nr_writeback is too small(is that means there too many dirty pages
> > > in page cache and can't be written out to disk?)
> >
> > incoming writes are throttled at the rate writeback makes progress,
> > hence the system will sit at the threshold. This is normal.
> > Writeback is just slow because of the freespace fragmentation in the
> > filesystem.
> Does running xfs_fsr periodically alleviate this problem?
> And is it advisable to run xfs_fsr regularly to reduce the
> fragmentation in xfs filesystems?
>
I think xfs_fsr is more likely to contribute to this problem than
alleviate it. xfs_fsr defragments files whereas the problem here is
fragmentation of free space.
Could you determine whether Dave's patch helps with performance at all?
Also, would you be able to share a metadump of this filesystem?
Brian
> Regards,
>
> Mao.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> >
> >
> > xfs: cap search distance in xfs_alloc_ag_vextent_near()
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Don't waste too much CPU time finding the perfect free extent when
> > we don't have a large enough contiguous free space and there are
> > many, many small free spaces that we'd do a linear search through.
> > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> > the same problem with the cost of finding the last free inodes in
> > the inode allocation btree.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
> > 1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index e1c0c0d2f1b0..c0c0a018e3bb 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
> > }
> >
> > /*
> > - * Search the btree in a given direction via the search cursor and compare
> > - * the records found against the good extent we've already found.
> > + * Search the btree in a given direction via the search cursor and compare the
> > + * records found against the good extent we've already found.
> > + *
> > + * We cap this search to a number of records to prevent searching hundreds of
> > + * thousands of records in a potentially futile search for a larger freespace
> > + * when free space is really badly fragmented. Spending more CPU time than the
> > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> > + * a full btree block (~500 records on a 4k block size fs).
> > */
> > STATIC int
> > xfs_alloc_find_best_extent(
> > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
> > int error;
> > int i;
> > unsigned busy_gen;
> > + int searchdistance = args->mp->m_alloc_mxr[0];
> >
> > /* The good extent is perfect, no need to search. */
> > if (!gdiff)
> > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
> > error = xfs_btree_decrement(*scur, 0, &i);
> > if (error)
> > goto error0;
> > - } while (i);
> > + } while (i && searchdistance-- > 0);
> >
> > out_use_good:
> > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);
next prev parent reply other threads:[~2018-10-24 20:39 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-10-23 7:56 xfs_alloc_ag_vextent_near() takes about 30ms to complete Mao Cheng
2018-10-23 14:53 ` Brian Foster
2018-10-24 3:01 ` Mao Cheng
2018-10-24 4:34 ` Dave Chinner
2018-10-24 9:02 ` Mao Cheng
2018-10-24 12:11 ` Brian Foster [this message]
2018-10-25 4:01 ` Mao Cheng
2018-10-25 14:55 ` Brian Foster
2018-10-24 12:09 ` Brian Foster
2018-10-24 22:35 ` Dave Chinner
2018-10-25 13:21 ` Brian Foster
2018-10-26 1:03 ` Dave Chinner
2018-10-26 13:03 ` Brian Foster
2018-10-27 3:16 ` Dave Chinner
2018-10-28 14:09 ` Brian Foster
2018-10-29 0:17 ` Dave Chinner
2018-10-29 9:53 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181024121112.GB46681@bfoster \
--to=bfoster@redhat.com \
--cc=chengmao2010@gmail.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox