xfs_alloc_ag_vextent_near() takes about 30ms to complete

linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* xfs_alloc_ag_vextent_near() takes about 30ms to complete
@ 2018-10-23  7:56 Mao Cheng
  2018-10-23 14:53 ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Mao Cheng @ 2018-10-23  7:56 UTC (permalink / raw)
  To: linux-xfs

Sorry for trouble again. I just wrote wrong function name in previous
sending, so resend it.
If you have received previous email please ignore it, thanks

we have a XFS mkfs with "-k" and mount with the default options(
rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
exported via samba.

[root@test1 home]# xfs_info /dev/sdk
meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=524288000, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=256000, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

free space about allocation groups:
   from      to extents  blocks    pct
      1       1       9       9   0.00
      2       3   14291   29124   0.19
      4       7    5689   22981   0.15
      8      15     119    1422   0.01
     16      31  754657 15093035  99.65
     32      63       1      33   0.00
total free extents 774766
total free blocks 15146604
average free extent size 19.5499
   from      to extents  blocks    pct
      1       1     253     253   0.00
      2       3    7706   16266   0.21
      4       7    7718   30882   0.39
      8      15      24     296   0.00
     16      31  381976 7638130  96.71
     32      63     753   38744   0.49
 131072  262143       1  173552   2.20
total free extents 398431
total free blocks 7898123
average free extent size 19.8231
   from      to extents  blocks    pct
      1       1     370     370   0.00
      2       3    2704    5775   0.01
      4       7    1016    4070   0.01
      8      15      24     254   0.00
     16      31  546614 10931743  20.26
     32      63   19191 1112600   2.06
     64     127       2     184   0.00
 131072  262143       1  163713   0.30
 524288 1048575       2 1438626   2.67
1048576 2097151       4 5654463  10.48
2097152 4194303       1 3489060   6.47
4194304 8388607       2 12656637  23.46
16777216 33554431       1 18502975  34.29
total free extents 569932
total free blocks 53960470
average free extent size 94.6788
   from      to extents  blocks    pct
      1       1       8       8   0.00
      2       3    5566   11229   0.06
      4       7    9622   38537   0.21
      8      15      57     686   0.00
     16      31  747242 14944852  80.31
     32      63     570   32236   0.17
2097152 4194303       1 3582074  19.25
total free extents 763066
total free blocks 18609622
average free extent size 24.38

we copy small files(about 150kb) from windows to xfs via SMB protocal,
sometines  kworker process consumes 100% of one CPU, and "perf top"
shows xfs_extent_busy_trim() and xfs_btree_increment()  consume too much
cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to
complete.

In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64).

Any suggestions are welcome.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-23  7:56 xfs_alloc_ag_vextent_near() takes about 30ms to complete Mao Cheng
@ 2018-10-23 14:53 ` Brian Foster
  2018-10-24  3:01   ` Mao Cheng
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-23 14:53 UTC (permalink / raw)
  To: Mao Cheng; +Cc: linux-xfs

On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> Sorry for trouble again. I just wrote wrong function name in previous
> sending, so resend it.
> If you have received previous email please ignore it, thanks
> 
> we have a XFS mkfs with "-k" and mount with the default options(
> rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> exported via samba.
> 
> [root@test1 home]# xfs_info /dev/sdk
> meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=0 spinodes=0
> data     =                       bsize=4096   blocks=524288000, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=256000, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> free space about allocation groups:
>    from      to extents  blocks    pct
>       1       1       9       9   0.00
>       2       3   14291   29124   0.19
>       4       7    5689   22981   0.15
>       8      15     119    1422   0.01
>      16      31  754657 15093035  99.65
>      32      63       1      33   0.00
> total free extents 774766
> total free blocks 15146604
> average free extent size 19.5499
>    from      to extents  blocks    pct
>       1       1     253     253   0.00
>       2       3    7706   16266   0.21
>       4       7    7718   30882   0.39
>       8      15      24     296   0.00
>      16      31  381976 7638130  96.71
>      32      63     753   38744   0.49
>  131072  262143       1  173552   2.20
> total free extents 398431
> total free blocks 7898123
> average free extent size 19.8231
>    from      to extents  blocks    pct
>       1       1     370     370   0.00
>       2       3    2704    5775   0.01
>       4       7    1016    4070   0.01
>       8      15      24     254   0.00
>      16      31  546614 10931743  20.26
>      32      63   19191 1112600   2.06
>      64     127       2     184   0.00
>  131072  262143       1  163713   0.30
>  524288 1048575       2 1438626   2.67
> 1048576 2097151       4 5654463  10.48
> 2097152 4194303       1 3489060   6.47
> 4194304 8388607       2 12656637  23.46
> 16777216 33554431       1 18502975  34.29
> total free extents 569932
> total free blocks 53960470
> average free extent size 94.6788
>    from      to extents  blocks    pct
>       1       1       8       8   0.00
>       2       3    5566   11229   0.06
>       4       7    9622   38537   0.21
>       8      15      57     686   0.00
>      16      31  747242 14944852  80.31
>      32      63     570   32236   0.17
> 2097152 4194303       1 3582074  19.25
> total free extents 763066
> total free blocks 18609622
> average free extent size 24.38
> 

So it looks like free space in 3 out of 4 AGs is mostly fragmented to
16-31 block extents. Those same AGs appear to have a much higher number
(~15k-20k) of even smaller extents.

> we copy small files(about 150kb) from windows to xfs via SMB protocal,
> sometines  kworker process consumes 100% of one CPU, and "perf top"
> shows xfs_extent_busy_trim() and xfs_btree_increment()  consume too much
> cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to
> complete.
> 

This is kind of a vague performance report. Some process consumes a full
CPU and this is a problem for some (??) reason given unknown CPU and
unknown storage (with unknown amount of RAM). I assume that kworker task
is writeback, but you haven't really specified that either.

xfs_alloc_ag_vextent_near() is one of the several block allocation
algorithms in XFS. That function itself includes a couple different
algorithms for the "near" allocation based on the state of the AG. One
looks like an intra-block search of the by-size free space btree (if not
many suitably sized extents are available) and the second looks like an
outward sweep of the by-block free space btree to find a suitably sized
extent. I could certainly see the latter taking some time for certain
sized allocation requests under fragmented free space conditions. If you
wanted more detail over what's going on here, I'd suggest to capture a
sample of the xfs_alloc* (and perhaps xfs_extent_busy*) tracepoints
during the workload.

That aside, it's probably best to step back and describe for the list
the overall environment, workload and performance problem you observed
that caused this level of digging in the first place. For example, has
throughput degraded over time? Latency increased? How many writers are
active at once? Is preallocation involved (I thought Samba/Windows
triggered it certain cases, but I don't recall)?

Brian

> In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64).
> 
> Any suggestions are welcome.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-23 14:53 ` Brian Foster
@ 2018-10-24  3:01   ` Mao Cheng
  2018-10-24  4:34     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Mao Cheng @ 2018-10-24  3:01 UTC (permalink / raw)
  To: bfoster; +Cc: linux-xfs

Hi Brian,
Thanks for your response.
Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
>
> On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > Sorry for trouble again. I just wrote wrong function name in previous
> > sending, so resend it.
> > If you have received previous email please ignore it, thanks
> >
> > we have a XFS mkfs with "-k" and mount with the default options(
> > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > exported via samba.
> >
> > [root@test1 home]# xfs_info /dev/sdk
> > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> >          =                       sectsz=4096  attr=2, projid32bit=1
> >          =                       crc=1        finobt=0 spinodes=0
> > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=256000, version=2
> >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
> > free space about allocation groups:
> >    from      to extents  blocks    pct
> >       1       1       9       9   0.00
> >       2       3   14291   29124   0.19
> >       4       7    5689   22981   0.15
> >       8      15     119    1422   0.01
> >      16      31  754657 15093035  99.65
> >      32      63       1      33   0.00
> > total free extents 774766
> > total free blocks 15146604
> > average free extent size 19.5499
> >    from      to extents  blocks    pct
> >       1       1     253     253   0.00
> >       2       3    7706   16266   0.21
> >       4       7    7718   30882   0.39
> >       8      15      24     296   0.00
> >      16      31  381976 7638130  96.71
> >      32      63     753   38744   0.49
> >  131072  262143       1  173552   2.20
> > total free extents 398431
> > total free blocks 7898123
> > average free extent size 19.8231
> >    from      to extents  blocks    pct
> >       1       1     370     370   0.00
> >       2       3    2704    5775   0.01
> >       4       7    1016    4070   0.01
> >       8      15      24     254   0.00
> >      16      31  546614 10931743  20.26
> >      32      63   19191 1112600   2.06
> >      64     127       2     184   0.00
> >  131072  262143       1  163713   0.30
> >  524288 1048575       2 1438626   2.67
> > 1048576 2097151       4 5654463  10.48
> > 2097152 4194303       1 3489060   6.47
> > 4194304 8388607       2 12656637  23.46
> > 16777216 33554431       1 18502975  34.29
> > total free extents 569932
> > total free blocks 53960470
> > average free extent size 94.6788
> >    from      to extents  blocks    pct
> >       1       1       8       8   0.00
> >       2       3    5566   11229   0.06
> >       4       7    9622   38537   0.21
> >       8      15      57     686   0.00
> >      16      31  747242 14944852  80.31
> >      32      63     570   32236   0.17
> > 2097152 4194303       1 3582074  19.25
> > total free extents 763066
> > total free blocks 18609622
> > average free extent size 24.38
> >
>
> So it looks like free space in 3 out of 4 AGs is mostly fragmented to
> 16-31 block extents. Those same AGs appear to have a much higher number
> (~15k-20k) of even smaller extents.
>
> > we copy small files(about 150kb) from windows to xfs via SMB protocal,
> > sometines  kworker process consumes 100% of one CPU, and "perf top"
> > shows xfs_extent_busy_trim() and xfs_btree_increment()  consume too much
> > cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to
> > complete.
> >
>
> This is kind of a vague performance report. Some process consumes a full
> CPU and this is a problem for some (??) reason given unknown CPU and
> unknown storage (with unknown amount of RAM). I assume that kworker task
> is writeback, but you haven't really specified that either.
yes, the kworker task is writeback, and the storage is XFS-formatted disk
that is alos the target we copy files to.

>
> xfs_alloc_ag_vextent_near() is one of the several block allocation
> algorithms in XFS. That function itself includes a couple different
> algorithms for the "near" allocation based on the state of the AG. One
> looks like an intra-block search of the by-size free space btree (if not
> many suitably sized extents are available) and the second looks like an
> outward sweep of the by-block free space btree to find a suitably sized
> extent. I could certainly see the latter taking some time for certain
> sized allocation requests under fragmented free space conditions. If you
> wanted more detail over what's going on here, I'd suggest to capture a
> sample of the xfs_alloc* (and perhaps xfs_extent_busy*) tracepoints
> during the workload.
>
> That aside, it's probably best to step back and describe for the list
> the overall environment, workload and performance problem you observed
> that caused this level of digging in the first place. For example, has
> throughput degraded over time? Latency increased? How many writers are
> active at once? Is preallocation involved (I thought Samba/Windows
> triggered it certain cases, but I don't recall)?
We share an xfs filesystem to windows via SMB protocol.
There are about 5 windows copy small files to the samba share at the same time.
The main problem is the throughput degraded from 30MB/s to around
10KB/s periodically and recovered about 5s later.
The kworker consumes 100% of one CPU when the throughput degraded and
kworker task is wrteback.
/proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
and nr_writeback is too small(is that means there too many dirty pages
in page cache and can't be written out to disk?)

Mao
>
> Brian
>
> > In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64).
> >
> > Any suggestions are welcome.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24  3:01   ` Mao Cheng
@ 2018-10-24  4:34     ` Dave Chinner
  2018-10-24  9:02       ` Mao Cheng
  2018-10-24 12:09       ` Brian Foster
  0 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2018-10-24  4:34 UTC (permalink / raw)
  To: Mao Cheng; +Cc: bfoster, linux-xfs

On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> Hi Brian,
> Thanks for your response.
> Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> >
> > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > Sorry for trouble again. I just wrote wrong function name in previous
> > > sending, so resend it.
> > > If you have received previous email please ignore it, thanks
> > >
> > > we have a XFS mkfs with "-k" and mount with the default options(
> > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > exported via samba.
> > >
> > > [root@test1 home]# xfs_info /dev/sdk
> > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > >          =                       sectsz=4096  attr=2, projid32bit=1
> > >          =                       crc=1        finobt=0 spinodes=0
> > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > >          =                       sunit=0      swidth=0 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal               bsize=4096   blocks=256000, version=2
> > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > >
> > > free space about allocation groups:
> > >    from      to extents  blocks    pct
> > >       1       1       9       9   0.00
> > >       2       3   14291   29124   0.19
> > >       4       7    5689   22981   0.15
> > >       8      15     119    1422   0.01
> > >      16      31  754657 15093035  99.65

750,000 fragmented free extents means something like 1600 btree
leaf blocks to hold them all.....

> > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > algorithms in XFS. That function itself includes a couple different
> > algorithms for the "near" allocation based on the state of the AG. One
> > looks like an intra-block search of the by-size free space btree (if not
> > many suitably sized extents are available) and the second looks like an
> > outward sweep of the by-block free space btree to find a suitably sized
> > extent.

Yup, just like the free inode allocation search, which is capped
at about 10 btree blocks left and right to prevent searching the
entire tree for the one free inode that remains in it.

The problem here is that the first algorithm fails immediately
because there isn't a contiguous free space large enough for the
allocation being requested, and so it finds the largest block whose
/location/ is less than target block as the start point for the
nearest largest freespace.

IOW, we do an expanding radius size search based on physical
locality rather than finding a free space based on size. Once we
find a good extent to either the left or right, we then stop that
search and try to find a better extent to the other direction
(xfs_alloc_find_best_extent()). That search is not bound, so can
search the entire of the tree in that remaining directory without
finding a better match.

We can't cut the initial left/right search shorter - we've got to
find a large enough free extent to succeed, but we can chop
xfs_alloc_find_best_extent() short, similar to searchdistance in
xfs_dialloc_ag_inobt(). The patch below does that.

Really, though, I think what we need to a better size based search
before falling back to a locality based search. This is more
complex, so not a few minutes work and requires a lot more thought
and testing.

> We share an xfs filesystem to windows via SMB protocol.
> There are about 5 windows copy small files to the samba share at the same time.
> The main problem is the throughput degraded from 30MB/s to around
> 10KB/s periodically and recovered about 5s later.
> The kworker consumes 100% of one CPU when the throughput degraded and
> kworker task is wrteback.
> /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> and nr_writeback is too small(is that means there too many dirty pages
> in page cache and can't be written out to disk?)

incoming writes are throttled at the rate writeback makes progress,
hence the system will sit at the threshold. This is normal.
Writeback is just slow because of the freespace fragmentation in the
filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: cap search distance in xfs_alloc_ag_vextent_near()

From: Dave Chinner <dchinner@redhat.com>

Don't waste too much CPU time finding the perfect free extent when
we don't have a large enough contiguous free space and there are
many, many small free spaces that we'd do a linear search through.
Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
the same problem with the cost of finding the last free inodes in
the inode allocation btree.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e1c0c0d2f1b0..c0c0a018e3bb 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
 }
 
 /*
- * Search the btree in a given direction via the search cursor and compare
- * the records found against the good extent we've already found.
+ * Search the btree in a given direction via the search cursor and compare the
+ * records found against the good extent we've already found.
+ *
+ * We cap this search to a number of records to prevent searching hundreds of
+ * thousands of records in a potentially futile search for a larger freespace
+ * when free space is really badly fragmented. Spending more CPU time than the
+ * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
+ * a full btree block (~500 records on a 4k block size fs).
  */
 STATIC int
 xfs_alloc_find_best_extent(
@@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
 	int			error;
 	int			i;
 	unsigned		busy_gen;
+	int			searchdistance = args->mp->m_alloc_mxr[0];
 
 	/* The good extent is perfect, no need to  search. */
 	if (!gdiff)
@@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
 			error = xfs_btree_decrement(*scur, 0, &i);
 		if (error)
 			goto error0;
-	} while (i);
+	} while (i && searchdistance-- > 0);
 
 out_use_good:
 	xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24  4:34     ` Dave Chinner
@ 2018-10-24  9:02       ` Mao Cheng
  2018-10-24 12:11         ` Brian Foster
  2018-10-24 12:09       ` Brian Foster
  1 sibling, 1 reply; 17+ messages in thread
From: Mao Cheng @ 2018-10-24  9:02 UTC (permalink / raw)
  To: david; +Cc: bfoster, linux-xfs

Hi,
Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道：
>
> On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > Hi Brian,
> > Thanks for your response.
> > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > >
> > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > sending, so resend it.
> > > > If you have received previous email please ignore it, thanks
> > > >
> > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > exported via samba.
> > > >
> > > > [root@test1 home]# xfs_info /dev/sdk
> > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > >          =                       crc=1        finobt=0 spinodes=0
> > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > >          =                       sunit=0      swidth=0 blks
> > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > >
> > > > free space about allocation groups:
> > > >    from      to extents  blocks    pct
> > > >       1       1       9       9   0.00
> > > >       2       3   14291   29124   0.19
> > > >       4       7    5689   22981   0.15
> > > >       8      15     119    1422   0.01
> > > >      16      31  754657 15093035  99.65
>
> 750,000 fragmented free extents means something like 1600 btree
> leaf blocks to hold them all.....
>
> > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > algorithms in XFS. That function itself includes a couple different
> > > algorithms for the "near" allocation based on the state of the AG. One
> > > looks like an intra-block search of the by-size free space btree (if not
> > > many suitably sized extents are available) and the second looks like an
> > > outward sweep of the by-block free space btree to find a suitably sized
> > > extent.
>
> Yup, just like the free inode allocation search, which is capped
> at about 10 btree blocks left and right to prevent searching the
> entire tree for the one free inode that remains in it.
>
> The problem here is that the first algorithm fails immediately
> because there isn't a contiguous free space large enough for the
> allocation being requested, and so it finds the largest block whose
> /location/ is less than target block as the start point for the
> nearest largest freespace.
>
> IOW, we do an expanding radius size search based on physical
> locality rather than finding a free space based on size. Once we
> find a good extent to either the left or right, we then stop that
> search and try to find a better extent to the other direction
> (xfs_alloc_find_best_extent()). That search is not bound, so can
> search the entire of the tree in that remaining directory without
> finding a better match.
>
> We can't cut the initial left/right search shorter - we've got to
> find a large enough free extent to succeed, but we can chop
> xfs_alloc_find_best_extent() short, similar to searchdistance in
> xfs_dialloc_ag_inobt(). The patch below does that.
>
> Really, though, I think what we need to a better size based search
> before falling back to a locality based search. This is more
> complex, so not a few minutes work and requires a lot more thought
> and testing.
>
> > We share an xfs filesystem to windows via SMB protocol.
> > There are about 5 windows copy small files to the samba share at the same time.
> > The main problem is the throughput degraded from 30MB/s to around
> > 10KB/s periodically and recovered about 5s later.
> > The kworker consumes 100% of one CPU when the throughput degraded and
> > kworker task is wrteback.
> > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > and nr_writeback is too small(is that means there too many dirty pages
> > in page cache and can't be written out to disk?)
>
> incoming writes are throttled at the rate writeback makes progress,
> hence the system will sit at the threshold. This is normal.
> Writeback is just slow because of the freespace fragmentation in the
> filesystem.
Does running xfs_fsr periodically alleviate this problem?
And is it advisable to run xfs_fsr regularly to reduce the
fragmentation in xfs filesystems?

Regards,

Mao.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
>
> xfs: cap search distance in xfs_alloc_ag_vextent_near()
>
> From: Dave Chinner <dchinner@redhat.com>
>
> Don't waste too much CPU time finding the perfect free extent when
> we don't have a large enough contiguous free space and there are
> many, many small free spaces that we'd do a linear search through.
> Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> the same problem with the cost of finding the last free inodes in
> the inode allocation btree.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index e1c0c0d2f1b0..c0c0a018e3bb 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
>  }
>
>  /*
> - * Search the btree in a given direction via the search cursor and compare
> - * the records found against the good extent we've already found.
> + * Search the btree in a given direction via the search cursor and compare the
> + * records found against the good extent we've already found.
> + *
> + * We cap this search to a number of records to prevent searching hundreds of
> + * thousands of records in a potentially futile search for a larger freespace
> + * when free space is really badly fragmented. Spending more CPU time than the
> + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> + * a full btree block (~500 records on a 4k block size fs).
>   */
>  STATIC int
>  xfs_alloc_find_best_extent(
> @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
>         int                     error;
>         int                     i;
>         unsigned                busy_gen;
> +       int                     searchdistance = args->mp->m_alloc_mxr[0];
>
>         /* The good extent is perfect, no need to  search. */
>         if (!gdiff)
> @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
>                         error = xfs_btree_decrement(*scur, 0, &i);
>                 if (error)
>                         goto error0;
> -       } while (i);
> +       } while (i && searchdistance-- > 0);
>
>  out_use_good:
>         xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24  9:02       ` Mao Cheng
@ 2018-10-24 12:11         ` Brian Foster
  2018-10-25  4:01           ` Mao Cheng
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-24 12:11 UTC (permalink / raw)
  To: Mao Cheng; +Cc: david, linux-xfs

On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote:
> Hi,
> Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道：
> >
> > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > Hi Brian,
> > > Thanks for your response.
> > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > > >
> > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > sending, so resend it.
> > > > > If you have received previous email please ignore it, thanks
> > > > >
> > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > > exported via samba.
> > > > >
> > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > > >          =                       crc=1        finobt=0 spinodes=0
> > > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > > >          =                       sunit=0      swidth=0 blks
> > > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > > >
> > > > > free space about allocation groups:
> > > > >    from      to extents  blocks    pct
> > > > >       1       1       9       9   0.00
> > > > >       2       3   14291   29124   0.19
> > > > >       4       7    5689   22981   0.15
> > > > >       8      15     119    1422   0.01
> > > > >      16      31  754657 15093035  99.65
> >
> > 750,000 fragmented free extents means something like 1600 btree
> > leaf blocks to hold them all.....
> >
> > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > algorithms in XFS. That function itself includes a couple different
> > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > looks like an intra-block search of the by-size free space btree (if not
> > > > many suitably sized extents are available) and the second looks like an
> > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > extent.
> >
> > Yup, just like the free inode allocation search, which is capped
> > at about 10 btree blocks left and right to prevent searching the
> > entire tree for the one free inode that remains in it.
> >
> > The problem here is that the first algorithm fails immediately
> > because there isn't a contiguous free space large enough for the
> > allocation being requested, and so it finds the largest block whose
> > /location/ is less than target block as the start point for the
> > nearest largest freespace.
> >
> > IOW, we do an expanding radius size search based on physical
> > locality rather than finding a free space based on size. Once we
> > find a good extent to either the left or right, we then stop that
> > search and try to find a better extent to the other direction
> > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > search the entire of the tree in that remaining directory without
> > finding a better match.
> >
> > We can't cut the initial left/right search shorter - we've got to
> > find a large enough free extent to succeed, but we can chop
> > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > xfs_dialloc_ag_inobt(). The patch below does that.
> >
> > Really, though, I think what we need to a better size based search
> > before falling back to a locality based search. This is more
> > complex, so not a few minutes work and requires a lot more thought
> > and testing.
> >
> > > We share an xfs filesystem to windows via SMB protocol.
> > > There are about 5 windows copy small files to the samba share at the same time.
> > > The main problem is the throughput degraded from 30MB/s to around
> > > 10KB/s periodically and recovered about 5s later.
> > > The kworker consumes 100% of one CPU when the throughput degraded and
> > > kworker task is wrteback.
> > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > > and nr_writeback is too small(is that means there too many dirty pages
> > > in page cache and can't be written out to disk?)
> >
> > incoming writes are throttled at the rate writeback makes progress,
> > hence the system will sit at the threshold. This is normal.
> > Writeback is just slow because of the freespace fragmentation in the
> > filesystem.
> Does running xfs_fsr periodically alleviate this problem?
> And is it advisable to run xfs_fsr regularly to reduce the
> fragmentation in xfs filesystems?
> 

I think xfs_fsr is more likely to contribute to this problem than
alleviate it. xfs_fsr defragments files whereas the problem here is
fragmentation of free space.

Could you determine whether Dave's patch helps with performance at all?
Also, would you be able to share a metadump of this filesystem?

Brian

> Regards,
> 
> Mao.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> >
> >
> > xfs: cap search distance in xfs_alloc_ag_vextent_near()
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Don't waste too much CPU time finding the perfect free extent when
> > we don't have a large enough contiguous free space and there are
> > many, many small free spaces that we'd do a linear search through.
> > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> > the same problem with the cost of finding the last free inodes in
> > the inode allocation btree.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index e1c0c0d2f1b0..c0c0a018e3bb 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
> >  }
> >
> >  /*
> > - * Search the btree in a given direction via the search cursor and compare
> > - * the records found against the good extent we've already found.
> > + * Search the btree in a given direction via the search cursor and compare the
> > + * records found against the good extent we've already found.
> > + *
> > + * We cap this search to a number of records to prevent searching hundreds of
> > + * thousands of records in a potentially futile search for a larger freespace
> > + * when free space is really badly fragmented. Spending more CPU time than the
> > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> > + * a full btree block (~500 records on a 4k block size fs).
> >   */
> >  STATIC int
> >  xfs_alloc_find_best_extent(
> > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
> >         int                     error;
> >         int                     i;
> >         unsigned                busy_gen;
> > +       int                     searchdistance = args->mp->m_alloc_mxr[0];
> >
> >         /* The good extent is perfect, no need to  search. */
> >         if (!gdiff)
> > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
> >                         error = xfs_btree_decrement(*scur, 0, &i);
> >                 if (error)
> >                         goto error0;
> > -       } while (i);
> > +       } while (i && searchdistance-- > 0);
> >
> >  out_use_good:
> >         xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24 12:11         ` Brian Foster
@ 2018-10-25  4:01           ` Mao Cheng
  2018-10-25 14:55             ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Mao Cheng @ 2018-10-25  4:01 UTC (permalink / raw)
  To: bfoster; +Cc: david, linux-xfs

Brian Foster <bfoster@redhat.com> 于2018年10月24日周三 下午8:11写道：
>
> On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote:
> > Hi,
> > Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道：
> > >
> > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > > Hi Brian,
> > > > Thanks for your response.
> > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > > > >
> > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > > sending, so resend it.
> > > > > > If you have received previous email please ignore it, thanks
> > > > > >
> > > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > > > exported via samba.
> > > > > >
> > > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > > > >          =                       crc=1        finobt=0 spinodes=0
> > > > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > > > >          =                       sunit=0      swidth=0 blks
> > > > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > > > >
> > > > > > free space about allocation groups:
> > > > > >    from      to extents  blocks    pct
> > > > > >       1       1       9       9   0.00
> > > > > >       2       3   14291   29124   0.19
> > > > > >       4       7    5689   22981   0.15
> > > > > >       8      15     119    1422   0.01
> > > > > >      16      31  754657 15093035  99.65
> > >
> > > 750,000 fragmented free extents means something like 1600 btree
> > > leaf blocks to hold them all.....
> > >
> > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > > algorithms in XFS. That function itself includes a couple different
> > > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > > looks like an intra-block search of the by-size free space btree (if not
> > > > > many suitably sized extents are available) and the second looks like an
> > > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > > extent.
> > >
> > > Yup, just like the free inode allocation search, which is capped
> > > at about 10 btree blocks left and right to prevent searching the
> > > entire tree for the one free inode that remains in it.
> > >
> > > The problem here is that the first algorithm fails immediately
> > > because there isn't a contiguous free space large enough for the
> > > allocation being requested, and so it finds the largest block whose
> > > /location/ is less than target block as the start point for the
> > > nearest largest freespace.
> > >
> > > IOW, we do an expanding radius size search based on physical
> > > locality rather than finding a free space based on size. Once we
> > > find a good extent to either the left or right, we then stop that
> > > search and try to find a better extent to the other direction
> > > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > > search the entire of the tree in that remaining directory without
> > > finding a better match.
> > >
> > > We can't cut the initial left/right search shorter - we've got to
> > > find a large enough free extent to succeed, but we can chop
> > > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > > xfs_dialloc_ag_inobt(). The patch below does that.
> > >
> > > Really, though, I think what we need to a better size based search
> > > before falling back to a locality based search. This is more
> > > complex, so not a few minutes work and requires a lot more thought
> > > and testing.
> > >
> > > > We share an xfs filesystem to windows via SMB protocol.
> > > > There are about 5 windows copy small files to the samba share at the same time.
> > > > The main problem is the throughput degraded from 30MB/s to around
> > > > 10KB/s periodically and recovered about 5s later.
> > > > The kworker consumes 100% of one CPU when the throughput degraded and
> > > > kworker task is wrteback.
> > > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > > > and nr_writeback is too small(is that means there too many dirty pages
> > > > in page cache and can't be written out to disk?)
> > >
> > > incoming writes are throttled at the rate writeback makes progress,
> > > hence the system will sit at the threshold. This is normal.
> > > Writeback is just slow because of the freespace fragmentation in the
> > > filesystem.
> > Does running xfs_fsr periodically alleviate this problem?
> > And is it advisable to run xfs_fsr regularly to reduce the
> > fragmentation in xfs filesystems?
> >
>
> I think xfs_fsr is more likely to contribute to this problem than
> alleviate it. xfs_fsr defragments files whereas the problem here is
> fragmentation of free space.
>
> Could you determine whether Dave's patch helps with performance at all?
I will test the patch later.
>
> Also, would you be able to share a metadump of this filesystem?
The metadump has been uploaded to google driver, and link as follows:
https://drive.google.com/open?id=1RLekC-BnbAujXDl-xZ-vteMudrl2xC9D

Thanks,

Mao
>
> Brian
>
> > Regards,
> >
> > Mao.
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > david@fromorbit.com
> > >
> > >
> > > xfs: cap search distance in xfs_alloc_ag_vextent_near()
> > >
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Don't waste too much CPU time finding the perfect free extent when
> > > we don't have a large enough contiguous free space and there are
> > > many, many small free spaces that we'd do a linear search through.
> > > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> > > the same problem with the cost of finding the last free inodes in
> > > the inode allocation btree.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index e1c0c0d2f1b0..c0c0a018e3bb 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
> > >  }
> > >
> > >  /*
> > > - * Search the btree in a given direction via the search cursor and compare
> > > - * the records found against the good extent we've already found.
> > > + * Search the btree in a given direction via the search cursor and compare the
> > > + * records found against the good extent we've already found.
> > > + *
> > > + * We cap this search to a number of records to prevent searching hundreds of
> > > + * thousands of records in a potentially futile search for a larger freespace
> > > + * when free space is really badly fragmented. Spending more CPU time than the
> > > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> > > + * a full btree block (~500 records on a 4k block size fs).
> > >   */
> > >  STATIC int
> > >  xfs_alloc_find_best_extent(
> > > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
> > >         int                     error;
> > >         int                     i;
> > >         unsigned                busy_gen;
> > > +       int                     searchdistance = args->mp->m_alloc_mxr[0];
> > >
> > >         /* The good extent is perfect, no need to  search. */
> > >         if (!gdiff)
> > > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
> > >                         error = xfs_btree_decrement(*scur, 0, &i);
> > >                 if (error)
> > >                         goto error0;
> > > -       } while (i);
> > > +       } while (i && searchdistance-- > 0);
> > >
> > >  out_use_good:
> > >         xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-25  4:01           ` Mao Cheng
@ 2018-10-25 14:55             ` Brian Foster
  0 siblings, 0 replies; 17+ messages in thread
From: Brian Foster @ 2018-10-25 14:55 UTC (permalink / raw)
  To: Mao Cheng; +Cc: david, linux-xfs

On Thu, Oct 25, 2018 at 12:01:06PM +0800, Mao Cheng wrote:
> Brian Foster <bfoster@redhat.com> 于2018年10月24日周三 下午8:11写道：
> >
> > On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote:
> > > Hi,
> > > Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道：
> > > >
> > > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > > > Hi Brian,
> > > > > Thanks for your response.
> > > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > > > > >
> > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > > > sending, so resend it.
> > > > > > > If you have received previous email please ignore it, thanks
> > > > > > >
> > > > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > > > > exported via samba.
> > > > > > >
> > > > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > > > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > > > > >          =                       crc=1        finobt=0 spinodes=0
> > > > > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > > > > >          =                       sunit=0      swidth=0 blks
> > > > > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > > > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > > > > >
> > > > > > > free space about allocation groups:
> > > > > > >    from      to extents  blocks    pct
> > > > > > >       1       1       9       9   0.00
> > > > > > >       2       3   14291   29124   0.19
> > > > > > >       4       7    5689   22981   0.15
> > > > > > >       8      15     119    1422   0.01
> > > > > > >      16      31  754657 15093035  99.65
> > > >
> > > > 750,000 fragmented free extents means something like 1600 btree
> > > > leaf blocks to hold them all.....
> > > >
> > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > > > algorithms in XFS. That function itself includes a couple different
> > > > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > > > looks like an intra-block search of the by-size free space btree (if not
> > > > > > many suitably sized extents are available) and the second looks like an
> > > > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > > > extent.
> > > >
> > > > Yup, just like the free inode allocation search, which is capped
> > > > at about 10 btree blocks left and right to prevent searching the
> > > > entire tree for the one free inode that remains in it.
> > > >
> > > > The problem here is that the first algorithm fails immediately
> > > > because there isn't a contiguous free space large enough for the
> > > > allocation being requested, and so it finds the largest block whose
> > > > /location/ is less than target block as the start point for the
> > > > nearest largest freespace.
> > > >
> > > > IOW, we do an expanding radius size search based on physical
> > > > locality rather than finding a free space based on size. Once we
> > > > find a good extent to either the left or right, we then stop that
> > > > search and try to find a better extent to the other direction
> > > > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > > > search the entire of the tree in that remaining directory without
> > > > finding a better match.
> > > >
> > > > We can't cut the initial left/right search shorter - we've got to
> > > > find a large enough free extent to succeed, but we can chop
> > > > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > > > xfs_dialloc_ag_inobt(). The patch below does that.
> > > >
> > > > Really, though, I think what we need to a better size based search
> > > > before falling back to a locality based search. This is more
> > > > complex, so not a few minutes work and requires a lot more thought
> > > > and testing.
> > > >
> > > > > We share an xfs filesystem to windows via SMB protocol.
> > > > > There are about 5 windows copy small files to the samba share at the same time.
> > > > > The main problem is the throughput degraded from 30MB/s to around
> > > > > 10KB/s periodically and recovered about 5s later.
> > > > > The kworker consumes 100% of one CPU when the throughput degraded and
> > > > > kworker task is wrteback.
> > > > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > > > > and nr_writeback is too small(is that means there too many dirty pages
> > > > > in page cache and can't be written out to disk?)
> > > >
> > > > incoming writes are throttled at the rate writeback makes progress,
> > > > hence the system will sit at the threshold. This is normal.
> > > > Writeback is just slow because of the freespace fragmentation in the
> > > > filesystem.
> > > Does running xfs_fsr periodically alleviate this problem?
> > > And is it advisable to run xfs_fsr regularly to reduce the
> > > fragmentation in xfs filesystems?
> > >
> >
> > I think xfs_fsr is more likely to contribute to this problem than
> > alleviate it. xfs_fsr defragments files whereas the problem here is
> > fragmentation of free space.
> >
> > Could you determine whether Dave's patch helps with performance at all?
> I will test the patch later.
> >
> > Also, would you be able to share a metadump of this filesystem?
> The metadump has been uploaded to google driver, and link as follows:
> https://drive.google.com/open?id=1RLekC-BnbAujXDl-xZ-vteMudrl2xC9D
> 

Thanks. xfs_alloc_ag_vextent_near() shows up as a hot path if I restore
this and throw an fs_mark small file workload at it. Some observations..

A trace of xfs_alloc_near* events over a 5 minute period shows a
breakdown like the following:

    513 xfs_alloc_near_first:
   8102 xfs_alloc_near_greater:
    180 xfs_alloc_near_lesser:

If I re-mkfs the restored image and run the same workload, I end up
with (to no real surprise):

  61561 xfs_alloc_near_first:

So clearly we are falling back to that second algorithm most of the
time. Most of these lesser/greater allocs have minlen == maxlen == 38
blocks and occur mostly split between AG 0 and AG 2. Looking at the
(initial) per-ag free space summary:

# for i in $(seq 0 3); do xfs_db -c "freesp -a $i" /mnt/img ; done
   from      to extents  blocks    pct
      1       1       9       9   0.00
      2       3   14243   28983   0.06
      4       7   10595   42615   0.09
      8      15     123    1468   0.00
     16      31  862232 17244531  37.66
     32      63  126968 7364144  16.08
     64     127       1      88   0.00
  16384   32767       1   30640   0.07
 131072  262143       1  131092   0.29
 524288 1048575       3 1835043   4.01
1048576 2097151       2 2883584   6.30
2097152 4194303       3 10456093  22.84
4194304 8388607       1 5767201  12.60
   from      to extents  blocks    pct
      1       1       8       8   0.00
      2       3    6557   13115   0.08
      4       7    7710   30844   0.18
      8      15      26     320   0.00
     16      31  393039 7859395  47.08
     32      63     250    9568   0.06
8388608 16777215       1 8780040  52.60
   from      to extents  blocks    pct
      1       1    2418    2418   0.01
      2       3    6126   16025   0.06
      4       7    1052    4263   0.02
      8      15      84     998   0.00
     16      31  873095 17461168  62.84
     32      63   35224 2042992   7.35
4194304 8388607       1 8259469  29.72
   from      to extents  blocks    pct
      1       1     258     258   0.00
      2       3    7951   16550   0.06
      4       7   10484   42007   0.14
      8      15      68     827   0.00
     16      31  864835 17296959  57.85
     32      63      58    2700   0.01
1048576 2097151       1 1310993   4.38
4194304 8388607       2 11228101  37.55

We can see that both AGs 0 and 2 have many likely >= 38 block extents,
but each also has a significant number of < 32 block extents. The first
bit can contribute to skipping the cntbt algorithm, the second bit
leaves a proverbial minefield of too small extents that the second
algorithm may have to sift through.

AGs 1 and 3 look like they have decent (< 32) fragmentation, but both at
least start with a much smaller number of 32+ block extents and thus
increased odds of finding an extent in the cntbt. That said, I'm not
seeing any (38 block, near) allocation requests in these AGs at all for
some reason. Perhaps my trace window was too small to catch any..

Brian

> Thanks,
> 
> Mao
> >
> > Brian
> >
> > > Regards,
> > >
> > > Mao.
> > > >
> > > > Cheers,
> > > >
> > > > Dave.
> > > > --
> > > > Dave Chinner
> > > > david@fromorbit.com
> > > >
> > > >
> > > > xfs: cap search distance in xfs_alloc_ag_vextent_near()
> > > >
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > >
> > > > Don't waste too much CPU time finding the perfect free extent when
> > > > we don't have a large enough contiguous free space and there are
> > > > many, many small free spaces that we'd do a linear search through.
> > > > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> > > > the same problem with the cost of finding the last free inodes in
> > > > the inode allocation btree.
> > > >
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
> > > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > > index e1c0c0d2f1b0..c0c0a018e3bb 100644
> > > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
> > > >  }
> > > >
> > > >  /*
> > > > - * Search the btree in a given direction via the search cursor and compare
> > > > - * the records found against the good extent we've already found.
> > > > + * Search the btree in a given direction via the search cursor and compare the
> > > > + * records found against the good extent we've already found.
> > > > + *
> > > > + * We cap this search to a number of records to prevent searching hundreds of
> > > > + * thousands of records in a potentially futile search for a larger freespace
> > > > + * when free space is really badly fragmented. Spending more CPU time than the
> > > > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> > > > + * a full btree block (~500 records on a 4k block size fs).
> > > >   */
> > > >  STATIC int
> > > >  xfs_alloc_find_best_extent(
> > > > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
> > > >         int                     error;
> > > >         int                     i;
> > > >         unsigned                busy_gen;
> > > > +       int                     searchdistance = args->mp->m_alloc_mxr[0];
> > > >
> > > >         /* The good extent is perfect, no need to  search. */
> > > >         if (!gdiff)
> > > > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
> > > >                         error = xfs_btree_decrement(*scur, 0, &i);
> > > >                 if (error)
> > > >                         goto error0;
> > > > -       } while (i);
> > > > +       } while (i && searchdistance-- > 0);
> > > >
> > > >  out_use_good:
> > > >         xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24  4:34     ` Dave Chinner
  2018-10-24  9:02       ` Mao Cheng
@ 2018-10-24 12:09       ` Brian Foster
  2018-10-24 22:35         ` Dave Chinner
  1 sibling, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-24 12:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mao Cheng, linux-xfs

On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote:
> On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > Hi Brian,
> > Thanks for your response.
> > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > >
> > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > sending, so resend it.
> > > > If you have received previous email please ignore it, thanks
> > > >
> > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > exported via samba.
> > > >
> > > > [root@test1 home]# xfs_info /dev/sdk
> > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > >          =                       crc=1        finobt=0 spinodes=0
> > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > >          =                       sunit=0      swidth=0 blks
> > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > >
> > > > free space about allocation groups:
> > > >    from      to extents  blocks    pct
> > > >       1       1       9       9   0.00
> > > >       2       3   14291   29124   0.19
> > > >       4       7    5689   22981   0.15
> > > >       8      15     119    1422   0.01
> > > >      16      31  754657 15093035  99.65
> 
> 750,000 fragmented free extents means something like 1600 btree
> leaf blocks to hold them all.....
> 
> > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > algorithms in XFS. That function itself includes a couple different
> > > algorithms for the "near" allocation based on the state of the AG. One
> > > looks like an intra-block search of the by-size free space btree (if not
> > > many suitably sized extents are available) and the second looks like an
> > > outward sweep of the by-block free space btree to find a suitably sized
> > > extent.
> 
> Yup, just like the free inode allocation search, which is capped
> at about 10 btree blocks left and right to prevent searching the
> entire tree for the one free inode that remains in it.
> 
> The problem here is that the first algorithm fails immediately
> because there isn't a contiguous free space large enough for the
> allocation being requested, and so it finds the largest block whose
> /location/ is less than target block as the start point for the
> nearest largest freespace.
> 

Hmm, not sure I follow what you're saying here wrt to why we end up in
the second algorithm. I was thinking the most likely condition is that
there are actually plenty of suitably sized extents, but as shown by the
free space data, they're amidst a huge number of too small extents. The
first algorithm is only active if a lookup in the cntbt lands in the
last block (the far right leaf) of the btree, so unless I'm missing
something this would mean we'd skip right past it to the second
algorithm if the last N blocks (where N > 1) of the cnt_bt have large
enough extents.

IOW, the first algo seems like an optimization for when we know there
are only a small number of minimum sized extents available and the
second (location based) algorithm would mostly churn. Regardless, we end
up in the same place in the end...

> IOW, we do an expanding radius size search based on physical
> locality rather than finding a free space based on size. Once we
> find a good extent to either the left or right, we then stop that
> search and try to find a better extent to the other direction
> (xfs_alloc_find_best_extent()). That search is not bound, so can
> search the entire of the tree in that remaining directory without
> finding a better match.
> 
> We can't cut the initial left/right search shorter - we've got to
> find a large enough free extent to succeed, but we can chop
> xfs_alloc_find_best_extent() short, similar to searchdistance in
> xfs_dialloc_ag_inobt(). The patch below does that.
> 

This search looks like it goes as far in the opposite direction as the
current candidate extent. So I take it this could potentially go off the
rails if we find one suitable extent in one direction that is relatively
far off wrt to startblock, and then the opposite direction happens to be
populated with a ton of too small extents before we extend to the range
that breaks the search.

I'm curious whether this contributes to the reporter's problem at all,
but this sounds like a reasonable change to me either way.

> Really, though, I think what we need to a better size based search
> before falling back to a locality based search. This is more
> complex, so not a few minutes work and requires a lot more thought
> and testing.
> 

Indeed. As noted above, the current size based search strikes me as an
optimization that only executes under particular conditions.

Since the purpose of this function is locality allocation, I'm wondering
if we could implement a smarter location based search using information
available in the by-size tree. For example, suppose we could identify
the closest minimally sized extents to agbno in order to better seed the
left/right starting points of the location based search. This of course
would require careful heuristics/tradeoffs to make sure we don't just
replace a bnobt scan with a cntbt scan.

Or perhaps do something like locate the range of cntbt extents within
the min/max size, then make an intelligent decision over whether to scan
that set of cntbt records, perform a smarter bnobt scan or resort to the
current algorithm. Just thinking out loud...

Brian

> > We share an xfs filesystem to windows via SMB protocol.
> > There are about 5 windows copy small files to the samba share at the same time.
> > The main problem is the throughput degraded from 30MB/s to around
> > 10KB/s periodically and recovered about 5s later.
> > The kworker consumes 100% of one CPU when the throughput degraded and
> > kworker task is wrteback.
> > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold
> > and nr_writeback is too small(is that means there too many dirty pages
> > in page cache and can't be written out to disk?)
> 
> incoming writes are throttled at the rate writeback makes progress,
> hence the system will sit at the threshold. This is normal.
> Writeback is just slow because of the freespace fragmentation in the
> filesystem.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: cap search distance in xfs_alloc_ag_vextent_near()
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Don't waste too much CPU time finding the perfect free extent when
> we don't have a large enough contiguous free space and there are
> many, many small free spaces that we'd do a linear search through.
> Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved
> the same problem with the cost of finding the last free inodes in
> the inode allocation btree.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index e1c0c0d2f1b0..c0c0a018e3bb 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact(
>  }
>  
>  /*
> - * Search the btree in a given direction via the search cursor and compare
> - * the records found against the good extent we've already found.
> + * Search the btree in a given direction via the search cursor and compare the
> + * records found against the good extent we've already found.
> + *
> + * We cap this search to a number of records to prevent searching hundreds of
> + * thousands of records in a potentially futile search for a larger freespace
> + * when free space is really badly fragmented. Spending more CPU time than the
> + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching
> + * a full btree block (~500 records on a 4k block size fs).
>   */
>  STATIC int
>  xfs_alloc_find_best_extent(
> @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent(
>  	int			error;
>  	int			i;
>  	unsigned		busy_gen;
> +	int			searchdistance = args->mp->m_alloc_mxr[0];
>  
>  	/* The good extent is perfect, no need to  search. */
>  	if (!gdiff)
> @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent(
>  			error = xfs_btree_decrement(*scur, 0, &i);
>  		if (error)
>  			goto error0;
> -	} while (i);
> +	} while (i && searchdistance-- > 0);
>  
>  out_use_good:
>  	xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24 12:09       ` Brian Foster
@ 2018-10-24 22:35         ` Dave Chinner
  2018-10-25 13:21           ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2018-10-24 22:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mao Cheng, linux-xfs

On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote:
> > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > Hi Brian,
> > > Thanks for your response.
> > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > > >
> > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > sending, so resend it.
> > > > > If you have received previous email please ignore it, thanks
> > > > >
> > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > > exported via samba.
> > > > >
> > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > > >          =                       crc=1        finobt=0 spinodes=0
> > > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > > >          =                       sunit=0      swidth=0 blks
> > > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > > >
> > > > > free space about allocation groups:
> > > > >    from      to extents  blocks    pct
> > > > >       1       1       9       9   0.00
> > > > >       2       3   14291   29124   0.19
> > > > >       4       7    5689   22981   0.15
> > > > >       8      15     119    1422   0.01
> > > > >      16      31  754657 15093035  99.65
> > 
> > 750,000 fragmented free extents means something like 1600 btree
> > leaf blocks to hold them all.....
> > 
> > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > algorithms in XFS. That function itself includes a couple different
> > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > looks like an intra-block search of the by-size free space btree (if not
> > > > many suitably sized extents are available) and the second looks like an
> > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > extent.
> > 
> > Yup, just like the free inode allocation search, which is capped
> > at about 10 btree blocks left and right to prevent searching the
> > entire tree for the one free inode that remains in it.
> > 
> > The problem here is that the first algorithm fails immediately
> > because there isn't a contiguous free space large enough for the
> > allocation being requested, and so it finds the largest block whose
> > /location/ is less than target block as the start point for the
> > nearest largest freespace.
> > 
> 
> Hmm, not sure I follow what you're saying here wrt to why we end up in
> the second algorithm.

I probably didn't explain it well. I wrote it quickly, didn't proof
read. What I meant was "...because there isn't enough contiguous
free space large for the allocation requested to land in the last
btree block, and so...."

> I was thinking the most likely condition is that
> there are actually plenty of suitably sized extents, but as shown by the
> free space data, they're amidst a huge number of too small extents. The

Yup, if we do a by-size lookup for >= 22 blocks and there are a
1000 free 22 block extents, the lookup doesn't land in the last
block. Straight to physical locality search.

> first algorithm is only active if a lookup in the cntbt lands in the
> last block (the far right leaf) of the btree, so unless I'm missing
> something this would mean we'd skip right past it to the second
> algorithm if the last N blocks (where N > 1) of the cnt_bt have large
> enough extents.

*nod*.

> IOW, the first algo seems like an optimization for when we know there
> are only a small number of minimum sized extents available and the
> second (location based) algorithm would mostly churn. Regardless, we end
> up in the same place in the end...
> 
> > IOW, we do an expanding radius size search based on physical
> > locality rather than finding a free space based on size. Once we
> > find a good extent to either the left or right, we then stop that
> > search and try to find a better extent to the other direction
> > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > search the entire of the tree in that remaining directory without
> > finding a better match.
> > 
> > We can't cut the initial left/right search shorter - we've got to
> > find a large enough free extent to succeed, but we can chop
> > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > xfs_dialloc_ag_inobt(). The patch below does that.
> > 
> 
> This search looks like it goes as far in the opposite direction as the
> current candidate extent.  So I take it this could potentially go off the
> rails if we find one suitable extent in one direction that is relatively
> far off wrt to startblock, and then the opposite direction happens to be
> populated with a ton of too small extents before we extend to the range
> that breaks the search.

Precisely.

> I'm curious whether this contributes to the reporter's problem at all,
> but this sounds like a reasonable change to me either way.

So am I. It's the low hanging fruit - we have ot search until we
find the first candidate block (no chioce in that) but we can chose
to terminate the "is there a better choice" search.

> > Really, though, I think what we need to a better size based search
> > before falling back to a locality based search. This is more
> > complex, so not a few minutes work and requires a lot more thought
> > and testing.
> > 
> 
> Indeed. As noted above, the current size based search strikes me as an
> optimization that only executes under particular conditions.

It's the common condition in a typical filesystem - if large,
contiguous free spaces in the filesystem, then the lookup will
almost always land in the last block of the btree.

> Since the purpose of this function is locality allocation,

Well, locality is the /second/ consideration - the first algorithm
prioritises maxlen for contiguous allocation, then selects the best
candidate by locality. The second alogortihm prioiritises locality
over allocation length.

> I'm wondering
> if we could implement a smarter location based search using information
> available in the by-size tree. For example, suppose we could identify
> the closest minimally sized extents to agbno in order to better seed the
> left/right starting points of the location based search. This of course
> would require careful heuristics/tradeoffs to make sure we don't just
> replace a bnobt scan with a cntbt scan.

I wouldn't bother. I'd just take the "last block" algorithm and make
it search all the >= contiguous free space extents for best locality
before dropping back to the minlen search.

Really, that's what the first algorithm should be. Looking at the
last block and selecting the best free space by size and then
locality is really just a degenerate case of the more general
algorithm.

Back when this algorithm was designed, AGs could only be 4GB in
size, so searching the oly the last block by size made sense - the
total number of free space extents is fairly well bound by the
AG size. That bound essentially went away with expanding AGs to 1TB,
but the algorithm wasn't changed to reflect that even a small amount
of free space fragmentation could result almost never hitting the
last block of the btree....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-24 22:35         ` Dave Chinner
@ 2018-10-25 13:21           ` Brian Foster
  2018-10-26  1:03             ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-25 13:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mao Cheng, linux-xfs

On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote:
> > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote:
> > > > Hi Brian,
> > > > Thanks for your response.
> > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道：
> > > > >
> > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote:
> > > > > > Sorry for trouble again. I just wrote wrong function name in previous
> > > > > > sending, so resend it.
> > > > > > If you have received previous email please ignore it, thanks
> > > > > >
> > > > > > we have a XFS mkfs with "-k" and mount with the default options(
> > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB，and
> > > > > > exported via samba.
> > > > > >
> > > > > > [root@test1 home]# xfs_info /dev/sdk
> > > > > > meta-data=/dev/sdk               isize=512    agcount=4, agsize=131072000 blks
> > > > > >          =                       sectsz=4096  attr=2, projid32bit=1
> > > > > >          =                       crc=1        finobt=0 spinodes=0
> > > > > > data     =                       bsize=4096   blocks=524288000, imaxpct=5
> > > > > >          =                       sunit=0      swidth=0 blks
> > > > > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > > > > log      =internal               bsize=4096   blocks=256000, version=2
> > > > > >          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > > > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > > > >
> > > > > > free space about allocation groups:
> > > > > >    from      to extents  blocks    pct
> > > > > >       1       1       9       9   0.00
> > > > > >       2       3   14291   29124   0.19
> > > > > >       4       7    5689   22981   0.15
> > > > > >       8      15     119    1422   0.01
> > > > > >      16      31  754657 15093035  99.65
> > > 
> > > 750,000 fragmented free extents means something like 1600 btree
> > > leaf blocks to hold them all.....
> > > 
> > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation
> > > > > algorithms in XFS. That function itself includes a couple different
> > > > > algorithms for the "near" allocation based on the state of the AG. One
> > > > > looks like an intra-block search of the by-size free space btree (if not
> > > > > many suitably sized extents are available) and the second looks like an
> > > > > outward sweep of the by-block free space btree to find a suitably sized
> > > > > extent.
> > > 
> > > Yup, just like the free inode allocation search, which is capped
> > > at about 10 btree blocks left and right to prevent searching the
> > > entire tree for the one free inode that remains in it.
> > > 
> > > The problem here is that the first algorithm fails immediately
> > > because there isn't a contiguous free space large enough for the
> > > allocation being requested, and so it finds the largest block whose
> > > /location/ is less than target block as the start point for the
> > > nearest largest freespace.
> > > 
> > 
> > Hmm, not sure I follow what you're saying here wrt to why we end up in
> > the second algorithm.
> 
> I probably didn't explain it well. I wrote it quickly, didn't proof
> read. What I meant was "...because there isn't enough contiguous
> free space large for the allocation requested to land in the last
> btree block, and so...."
> 
> > I was thinking the most likely condition is that
> > there are actually plenty of suitably sized extents, but as shown by the
> > free space data, they're amidst a huge number of too small extents. The
> 
> Yup, if we do a by-size lookup for >= 22 blocks and there are a
> 1000 free 22 block extents, the lookup doesn't land in the last
> block. Straight to physical locality search.
> 
> > first algorithm is only active if a lookup in the cntbt lands in the
> > last block (the far right leaf) of the btree, so unless I'm missing
> > something this would mean we'd skip right past it to the second
> > algorithm if the last N blocks (where N > 1) of the cnt_bt have large
> > enough extents.
> 
> *nod*.
> 
> > IOW, the first algo seems like an optimization for when we know there
> > are only a small number of minimum sized extents available and the
> > second (location based) algorithm would mostly churn. Regardless, we end
> > up in the same place in the end...
> > 
> > > IOW, we do an expanding radius size search based on physical
> > > locality rather than finding a free space based on size. Once we
> > > find a good extent to either the left or right, we then stop that
> > > search and try to find a better extent to the other direction
> > > (xfs_alloc_find_best_extent()). That search is not bound, so can
> > > search the entire of the tree in that remaining directory without
> > > finding a better match.
> > > 
> > > We can't cut the initial left/right search shorter - we've got to
> > > find a large enough free extent to succeed, but we can chop
> > > xfs_alloc_find_best_extent() short, similar to searchdistance in
> > > xfs_dialloc_ag_inobt(). The patch below does that.
> > > 
> > 
> > This search looks like it goes as far in the opposite direction as the
> > current candidate extent.  So I take it this could potentially go off the
> > rails if we find one suitable extent in one direction that is relatively
> > far off wrt to startblock, and then the opposite direction happens to be
> > populated with a ton of too small extents before we extend to the range
> > that breaks the search.
> 
> Precisely.
> 
> > I'm curious whether this contributes to the reporter's problem at all,
> > but this sounds like a reasonable change to me either way.
> 
> So am I. It's the low hanging fruit - we have ot search until we
> find the first candidate block (no chioce in that) but we can chose
> to terminate the "is there a better choice" search.
> 
> > > Really, though, I think what we need to a better size based search
> > > before falling back to a locality based search. This is more
> > > complex, so not a few minutes work and requires a lot more thought
> > > and testing.
> > > 
> > 
> > Indeed. As noted above, the current size based search strikes me as an
> > optimization that only executes under particular conditions.
> 
> It's the common condition in a typical filesystem - if large,
> contiguous free spaces in the filesystem, then the lookup will
> almost always land in the last block of the btree.
> 

I guess that makes sense for a clean fs. I wonder how long that state
persists in practice as usage increases, however. The more smaller free
extents that are created, the more this algorithm decision varies to the
size (or maxlen at least) of the request.

> > Since the purpose of this function is locality allocation,
> 
> Well, locality is the /second/ consideration - the first algorithm
> prioritises maxlen for contiguous allocation, then selects the best
> candidate by locality. The second alogortihm prioiritises locality
> over allocation length.
> 

Ah yes, good point. I missed the maxlen/minlen dynamic on my first read
through. So if we have some small number of extents that satisfy maxlen
(as opposed to the [minlen, maxlen] range), we prioritize those extents
and apply locality to that subset. If we have none of such extents or
too many, we move on to the bno-based fan out search. There, we find the
closest extent that satisfies the [minlen, maxlen] request.

> > I'm wondering
> > if we could implement a smarter location based search using information
> > available in the by-size tree. For example, suppose we could identify
> > the closest minimally sized extents to agbno in order to better seed the
> > left/right starting points of the location based search. This of course
> > would require careful heuristics/tradeoffs to make sure we don't just
> > replace a bnobt scan with a cntbt scan.
> 
> I wouldn't bother. I'd just take the "last block" algorithm and make
> it search all the >= contiguous free space extents for best locality
> before dropping back to the minlen search.
> 

Ok, that makes sense. The caveat seems to be though that the "last
block" algorithm searches all of the applicable records to discover the
best locality. We could open up this search as such, but if free space
happens to be completely fragmented to >= requested extents, that could
mean every allocation falls into a full cntbt scan where a bnobt lookup
would result in a much faster allocation.

So ISTM that we still need some kind of intelligent heuristic to fall
back to the second algorithm to cover the "too many" case. What exactly
that is may take some more thought, experimentation and testing.

> Really, that's what the first algorithm should be. Looking at the
> last block and selecting the best free space by size and then
> locality is really just a degenerate case of the more general
> algorithm.
> 
> Back when this algorithm was designed, AGs could only be 4GB in
> size, so searching the oly the last block by size made sense - the
> total number of free space extents is fairly well bound by the
> AG size. That bound essentially went away with expanding AGs to 1TB,
> but the algorithm wasn't changed to reflect that even a small amount
> of free space fragmentation could result almost never hitting the
> last block of the btree....
> 

Ok, thanks for the historical context.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-25 13:21           ` Brian Foster
@ 2018-10-26  1:03             ` Dave Chinner
  2018-10-26 13:03               ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2018-10-26  1:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mao Cheng, linux-xfs

On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > > I'm wondering
> > > if we could implement a smarter location based search using information
> > > available in the by-size tree. For example, suppose we could identify
> > > the closest minimally sized extents to agbno in order to better seed the
> > > left/right starting points of the location based search. This of course
> > > would require careful heuristics/tradeoffs to make sure we don't just
> > > replace a bnobt scan with a cntbt scan.
> > 
> > I wouldn't bother. I'd just take the "last block" algorithm and make
> > it search all the >= contiguous free space extents for best locality
> > before dropping back to the minlen search.
> > 
> 
> Ok, that makes sense. The caveat seems to be though that the "last
> block" algorithm searches all of the applicable records to discover the
> best locality. We could open up this search as such, but if free space
> happens to be completely fragmented to >= requested extents, that could
> mean every allocation falls into a full cntbt scan where a bnobt lookup
> would result in a much faster allocation.

Yup, we'll need to bound it so it doesn't do stupid things. :P

> So ISTM that we still need some kind of intelligent heuristic to fall
> back to the second algorithm to cover the "too many" case. What exactly
> that is may take some more thought, experimentation and testing.

Yeah, that's the difficulty with making core allocator algorithm
changes - how to characterise the effect of the change. I'm not sure
that's a huge problem in this case, though, because selecting a
matching contig freespace is almost always going to be better for
filesystem longetivty and freespace fragmentation resistance than
slecting a shorter freespace and doing lots more small allocations.
it's the 'lots of small allocations' that really makes the freespace
framgmentation spiral out of control, so if we can avoid that until
we've used all the matching contig free spaces we'll be better off
in the long run.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-26  1:03             ` Dave Chinner
@ 2018-10-26 13:03               ` Brian Foster
  2018-10-27  3:16                 ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-26 13:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mao Cheng, linux-xfs

On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote:
> On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > > > I'm wondering
> > > > if we could implement a smarter location based search using information
> > > > available in the by-size tree. For example, suppose we could identify
> > > > the closest minimally sized extents to agbno in order to better seed the
> > > > left/right starting points of the location based search. This of course
> > > > would require careful heuristics/tradeoffs to make sure we don't just
> > > > replace a bnobt scan with a cntbt scan.
> > > 
> > > I wouldn't bother. I'd just take the "last block" algorithm and make
> > > it search all the >= contiguous free space extents for best locality
> > > before dropping back to the minlen search.
> > > 
> > 
> > Ok, that makes sense. The caveat seems to be though that the "last
> > block" algorithm searches all of the applicable records to discover the
> > best locality. We could open up this search as such, but if free space
> > happens to be completely fragmented to >= requested extents, that could
> > mean every allocation falls into a full cntbt scan where a bnobt lookup
> > would result in a much faster allocation.
> 
> Yup, we'll need to bound it so it doesn't do stupid things. :P
> 

Yep.

> > So ISTM that we still need some kind of intelligent heuristic to fall
> > back to the second algorithm to cover the "too many" case. What exactly
> > that is may take some more thought, experimentation and testing.
> 
> Yeah, that's the difficulty with making core allocator algorithm
> changes - how to characterise the effect of the change. I'm not sure
> that's a huge problem in this case, though, because selecting a
> matching contig freespace is almost always going to be better for
> filesystem longetivty and freespace fragmentation resistance than
> slecting a shorter freespace and doing lots more small allocations.
> it's the 'lots of small allocations' that really makes the freespace
> framgmentation spiral out of control, so if we can avoid that until
> we've used all the matching contig free spaces we'll be better off
> in the long run.
> 

Ok, so I ran fs_mark against the metadump with your patch and a quick
hack to unconditionally scan the cntbt if maxlen extents are available
(up to mxr[0] records similar to your patch, to avoid excessive scans).
The xfs_alloc_find_best_extent() patch alone didn't have much of a
noticeable effect, but that is an isolated optimization and I'm only
doing coarse measurements atm that probably hide it in the noise.

The write workload improves quite a bit with the addition of the cntbt
change. Both throughput (via iostat 60s intervals) and fs_mark files/sec
change from a slow high/low sweeping behavior to much more consistent
and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec
change to a much more consistent 27-39MB/s and ~200-300 f/s.

A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first
events which means we never fell back to the bnobt based search. I'm not
sure the mxr thing is the right approach necessarily, I just wanted
something quick that would demonstrate the potential upside gains
without going off the rails. One related concern I have with restricting
the locality of the search too much, for example, is that we use
NEAR_BNO allocs for other things like inode allocation locality that
might not be represented in this simple write only workload.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-26 13:03               ` Brian Foster
@ 2018-10-27  3:16                 ` Dave Chinner
  2018-10-28 14:09                   ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2018-10-27  3:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mao Cheng, linux-xfs

On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote:
> On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote:
> > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > > > > I'm wondering
> > > > > if we could implement a smarter location based search using information
> > > > > available in the by-size tree. For example, suppose we could identify
> > > > > the closest minimally sized extents to agbno in order to better seed the
> > > > > left/right starting points of the location based search. This of course
> > > > > would require careful heuristics/tradeoffs to make sure we don't just
> > > > > replace a bnobt scan with a cntbt scan.
> > > > 
> > > > I wouldn't bother. I'd just take the "last block" algorithm and make
> > > > it search all the >= contiguous free space extents for best locality
> > > > before dropping back to the minlen search.
> > > > 
> > > 
> > > Ok, that makes sense. The caveat seems to be though that the "last
> > > block" algorithm searches all of the applicable records to discover the
> > > best locality. We could open up this search as such, but if free space
> > > happens to be completely fragmented to >= requested extents, that could
> > > mean every allocation falls into a full cntbt scan where a bnobt lookup
> > > would result in a much faster allocation.
> > 
> > Yup, we'll need to bound it so it doesn't do stupid things. :P
> > 
> 
> Yep.
> 
> > > So ISTM that we still need some kind of intelligent heuristic to fall
> > > back to the second algorithm to cover the "too many" case. What exactly
> > > that is may take some more thought, experimentation and testing.
> > 
> > Yeah, that's the difficulty with making core allocator algorithm
> > changes - how to characterise the effect of the change. I'm not sure
> > that's a huge problem in this case, though, because selecting a
> > matching contig freespace is almost always going to be better for
> > filesystem longetivty and freespace fragmentation resistance than
> > slecting a shorter freespace and doing lots more small allocations.
> > it's the 'lots of small allocations' that really makes the freespace
> > framgmentation spiral out of control, so if we can avoid that until
> > we've used all the matching contig free spaces we'll be better off
> > in the long run.
> > 
> 
> Ok, so I ran fs_mark against the metadump with your patch and a quick
> hack to unconditionally scan the cntbt if maxlen extents are available
> (up to mxr[0] records similar to your patch, to avoid excessive scans).
> The xfs_alloc_find_best_extent() patch alone didn't have much of a
> noticeable effect, but that is an isolated optimization and I'm only
> doing coarse measurements atm that probably hide it in the noise.
> 
> The write workload improves quite a bit with the addition of the cntbt
> change. Both throughput (via iostat 60s intervals) and fs_mark files/sec
> change from a slow high/low sweeping behavior to much more consistent
> and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec
> change to a much more consistent 27-39MB/s and ~200-300 f/s.

That looks really promising. :)

> A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first
> events which means we never fell back to the bnobt based search. I'm not
> sure the mxr thing is the right approach necessarily, I just wanted
> something quick that would demonstrate the potential upside gains
> without going off the rails.

*nod*. it's a good first approximation, though. The inobt limits
search to ~10 inobt records left and right (~1200 nearest inodes)
and if there were none free it allocated a new chunk.

The records in the by-size tree have a secondary sort order of
by-bno, so we know that as we walk the records of the same size,
we'll get closer to the target we want.

Hmmm. I wonder.

xfs_cntbt_key_diff() discriminates first by size, and then if size
matches by startblock. 

But xfs_alloc_vextent_near() does this:

      if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
                goto error0;

It sets the startblock of the target lookup to be 0, which means it
will always find the extent closest to the start of the AG of the
same size of larger. IOWs, it looks for size without considering
locality.

But the lookup can do both. ~if we change that to be:

	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i);

if will actually find the first block of size >= args->maxlen and
>= args->agbno. IOWs, it does the majority of the locality search
for us. All we need to do is check that the extent on the left side
of the return extent is closer to the target than the extent that is
returned.....

Which makes me wonder - can we just get rid of that first algorithm
(the lastblock search for largest) and replace it with a simple
lookup for args->agbno, args->maxlen + args->alignment?

That way we'll get an extent that will be big enough for alignment
to succeed if alignment is required, big enough to fit if alignment
is not required, or if nothing is found, we can then do a <= to find
the first extent smaller and closest to the target:

	error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i);

If the record returned has a length = args->maxlen, then we've
got the physically closest exact match and we should use it.

if the record is shorter than args->maxlen but > args->minlen, then
there are no extents large enough for maxlen, then we should check
if the right side record is closer to target and select between the
two.

And that can replace the entirity of xfs_alloc_ag_vextent_near.

The existing "nothing >= maxlen" case that goes to
xfs_alloc_ag_vextent_small() selects the largest free extent at the
highest block number (right nost record) and so ignores locality.
The xfs_alloc_lookup_le() lookup does they same thing, except it
also picks the physically closest extent of the largest size. That's
an improvement.

The lastblock case currently selects the physically closest largest
free extent in the block that will fit the (aligned) length we
require. We can get really close to that with a >= (args->agbno,
args->maxlen + args->alignment) lookup

And the  <= (args->agbno, args->maxlen) finds us the largest,
closest free extent that we can check against minlen and return
that...

IOWs, something like this, which is a whole lot simpler and faster
than the linear searches we do now and should return much better
candidate extents:

	check_left = true;

	/* look for size and closeness */
	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno,
				args->maxlen + alignment, &i);
	if (error)
		goto error0;

	if (!i) {
		/*
		 * nothing >= target. Search for best <= instead so
		 * we get the get the largest closest match to what
		 * we are asking for.
		 */
		error = xfs_alloc_lookup_le(cnt_cur, args->agbno,
					args->maxlen + alignment, &i);
		if (error)
			goto error0;

		/* try off the free list */
		if (!i) {
			error = xfs_alloc_ag_vextent_small(args, cnt_cur, &ltbno
			                                &ltlen, &i);
			......
			return ....;
		}
		check_left = false;
	}

	/* best candidate found, get the record and check adjacent */
	error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i)
	....

	if (check_left) {
		xfs_btree_increment(cnt_cur)
		xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i)
		....
	} else {
		xfs_btree_decrement(cnt_cur)
		xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i)
		....
	}

	/* align results if required */

	/* check against minlen */

	/* compute distance diff to target */

	/* select best extent, fixup trees */

	/* return best extent */

> One related concern I have with restricting
> the locality of the search too much, for example, is that we use
> NEAR_BNO allocs for other things like inode allocation locality that
> might not be represented in this simple write only workload.

Inode chunk allocation sets minlen == maxlen, in which case we
should run your new by-size search to exhaustion and never hit the
existing second algorithm. (i.e. don't bound it if minlen = maxlen)
i.e. your new algorithm should get the same result as the existing
code, but much faster because it's sonly searching extents we know
can satisfy the allocation requirements. The proposed algorithm
above would be even faster :P

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-27  3:16                 ` Dave Chinner
@ 2018-10-28 14:09                   ` Brian Foster
  2018-10-29  0:17                     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Brian Foster @ 2018-10-28 14:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mao Cheng, linux-xfs

On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote:
> On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote:
> > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote:
> > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > > > > > I'm wondering
> > > > > > if we could implement a smarter location based search using information
> > > > > > available in the by-size tree. For example, suppose we could identify
> > > > > > the closest minimally sized extents to agbno in order to better seed the
> > > > > > left/right starting points of the location based search. This of course
> > > > > > would require careful heuristics/tradeoffs to make sure we don't just
> > > > > > replace a bnobt scan with a cntbt scan.
> > > > > 
> > > > > I wouldn't bother. I'd just take the "last block" algorithm and make
> > > > > it search all the >= contiguous free space extents for best locality
> > > > > before dropping back to the minlen search.
> > > > > 
> > > > 
> > > > Ok, that makes sense. The caveat seems to be though that the "last
> > > > block" algorithm searches all of the applicable records to discover the
> > > > best locality. We could open up this search as such, but if free space
> > > > happens to be completely fragmented to >= requested extents, that could
> > > > mean every allocation falls into a full cntbt scan where a bnobt lookup
> > > > would result in a much faster allocation.
> > > 
> > > Yup, we'll need to bound it so it doesn't do stupid things. :P
> > > 
> > 
> > Yep.
> > 
> > > > So ISTM that we still need some kind of intelligent heuristic to fall
> > > > back to the second algorithm to cover the "too many" case. What exactly
> > > > that is may take some more thought, experimentation and testing.
> > > 
> > > Yeah, that's the difficulty with making core allocator algorithm
> > > changes - how to characterise the effect of the change. I'm not sure
> > > that's a huge problem in this case, though, because selecting a
> > > matching contig freespace is almost always going to be better for
> > > filesystem longetivty and freespace fragmentation resistance than
> > > slecting a shorter freespace and doing lots more small allocations.
> > > it's the 'lots of small allocations' that really makes the freespace
> > > framgmentation spiral out of control, so if we can avoid that until
> > > we've used all the matching contig free spaces we'll be better off
> > > in the long run.
> > > 
> > 
> > Ok, so I ran fs_mark against the metadump with your patch and a quick
> > hack to unconditionally scan the cntbt if maxlen extents are available
> > (up to mxr[0] records similar to your patch, to avoid excessive scans).
> > The xfs_alloc_find_best_extent() patch alone didn't have much of a
> > noticeable effect, but that is an isolated optimization and I'm only
> > doing coarse measurements atm that probably hide it in the noise.
> > 
> > The write workload improves quite a bit with the addition of the cntbt
> > change. Both throughput (via iostat 60s intervals) and fs_mark files/sec
> > change from a slow high/low sweeping behavior to much more consistent
> > and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec
> > change to a much more consistent 27-39MB/s and ~200-300 f/s.
> 
> That looks really promising. :)
> 
> > A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first
> > events which means we never fell back to the bnobt based search. I'm not
> > sure the mxr thing is the right approach necessarily, I just wanted
> > something quick that would demonstrate the potential upside gains
> > without going off the rails.
> 
> *nod*. it's a good first approximation, though. The inobt limits
> search to ~10 inobt records left and right (~1200 nearest inodes)
> and if there were none free it allocated a new chunk.
> 
> The records in the by-size tree have a secondary sort order of
> by-bno, so we know that as we walk the records of the same size,
> we'll get closer to the target we want.
> 

Yeah, this occurred to me while poking at the state of the cntbt trees
of the metadump. I was thinking specifically about whether we could use
that to optimize the existing algorithm a bit. For example, if we skip
the lastblock logic and do find many maxlen extents, use the agbno of
the first record to avoid sifting through the entire set. If the agbno
was > the requested agbno, for example, we could probably end the search
right there...

> Hmmm. I wonder.
> 
> xfs_cntbt_key_diff() discriminates first by size, and then if size
> matches by startblock. 
> 
> But xfs_alloc_vextent_near() does this:
> 
>       if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
>                 goto error0;
> 
> It sets the startblock of the target lookup to be 0, which means it
> will always find the extent closest to the start of the AG of the
> same size of larger. IOWs, it looks for size without considering
> locality.
> 

... but I wasn't aware we could do this. ;)

> But the lookup can do both. ~if we change that to be:
> 
> 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i);
> 
> if will actually find the first block of size >= args->maxlen and
> >= args->agbno. IOWs, it does the majority of the locality search
> for us. All we need to do is check that the extent on the left side
> of the return extent is closer to the target than the extent that is
> returned.....
> 
> Which makes me wonder - can we just get rid of that first algorithm
> (the lastblock search for largest) and replace it with a simple
> lookup for args->agbno, args->maxlen + args->alignment?
> 
> That way we'll get an extent that will be big enough for alignment
> to succeed if alignment is required, big enough to fit if alignment
> is not required, or if nothing is found, we can then do a <= to find
> the first extent smaller and closest to the target:
> 
> 	error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i);
> 
> If the record returned has a length = args->maxlen, then we've
> got the physically closest exact match and we should use it.
> 
> if the record is shorter than args->maxlen but > args->minlen, then
> there are no extents large enough for maxlen, then we should check
> if the right side record is closer to target and select between the
> two.
> 
> And that can replace the entirity of xfs_alloc_ag_vextent_near.
> 

My general sense to this point from the code and your feedback about the
priority of the algorithm is the fundamental problem here is that the
scope of the first algorithm is simply too narrow and the
second/fallback algorithm too expensive. At minimum, I think applying
agbno lookups to the cntbt lookup as you describe here allows us to
incorporate more locality into the first (maxlen) algorithm and widen
its scope accordingly. This sounds like a great starting point to me.
The tradeoff may be that we don't get the locality benefit of all >
maxlen sized extents since the agbno part of the lookup is secondary to
the size, but that may be a fine tradeoff if the benefit is we can use
the first/faster algorithm for a whole lot more cases.

I'm actually curious if doing that as a first step to open up the first
algo to _all_ maxlen cases has a noticeable effect on this workload. If
so, that could be a nice intermediate step to avoid paying the penalty
for the "too many >= maxlen extents" case before rewriting the broader
algorithm. The remaining effort is then to absorb the second algorithm
into the first such that the second is eventually no longer necessary.

> The existing "nothing >= maxlen" case that goes to
> xfs_alloc_ag_vextent_small() selects the largest free extent at the
> highest block number (right nost record) and so ignores locality.
> The xfs_alloc_lookup_le() lookup does they same thing, except it
> also picks the physically closest extent of the largest size. That's
> an improvement.
> 

That looks like another point where the first algorithm bails out too
quickly as well. We find !maxlen extents, decrement to get the next
(highest agbno) smallest record, then go on to check it for alignment
and whatnot without any consideration for any other records that may
exist of the same size. Unless I'm missing something, the fact that we
decrement in the _small() case and jump back to the incrementing
algorithm of the caller pretty much ensures we'll only consider one such
record before going into the fan out search.

> The lastblock case currently selects the physically closest largest
> free extent in the block that will fit the (aligned) length we
> require. We can get really close to that with a >= (args->agbno,
> args->maxlen + args->alignment) lookup
> 
> And the  <= (args->agbno, args->maxlen) finds us the largest,
> closest free extent that we can check against minlen and return
> that...
> 
> IOWs, something like this, which is a whole lot simpler and faster
> than the linear searches we do now and should return much better
> candidate extents:
> 

This all sounds pretty reasonable to me. I need to think more about the
details. I.e., whether we'd still want/need to fallback to a worst case
scan in certain cases, which may not be a problem if the first algorithm
is updated to find extents in almost all cases instead of being limited
to when there are a small number of maxlen extents.

I'm also wondering if we could enhance this further by repeating the
agbno lookup at the top for the next smallest extent size in the min/max
range when no suitable extent is found. Then perhaps the high level
algorithm is truly simplified to find the "largest available extent with
best locality for that size."

> 	check_left = true;
> 
> 	/* look for size and closeness */
> 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno,
> 				args->maxlen + alignment, &i);
> 	if (error)
> 		goto error0;
> 
> 	if (!i) {
> 		/*
> 		 * nothing >= target. Search for best <= instead so
> 		 * we get the get the largest closest match to what
> 		 * we are asking for.
> 		 */
> 		error = xfs_alloc_lookup_le(cnt_cur, args->agbno,
> 					args->maxlen + alignment, &i);
> 		if (error)
> 			goto error0;
> 
> 		/* try off the free list */
> 		if (!i) {
> 			error = xfs_alloc_ag_vextent_small(args, cnt_cur, &ltbno
> 			                                &ltlen, &i);
> 			......
> 			return ....;
> 		}
> 		check_left = false;
> 	}
> 
> 	/* best candidate found, get the record and check adjacent */
> 	error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i)
> 	....
> 
> 	if (check_left) {
> 		xfs_btree_increment(cnt_cur)
> 		xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i)
> 		....
> 	} else {
> 		xfs_btree_decrement(cnt_cur)
> 		xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i)
> 		....
> 	}
> 
> 	/* align results if required */
> 
> 	/* check against minlen */
> 
> 	/* compute distance diff to target */
> 
> 	/* select best extent, fixup trees */
> 
> 	/* return best extent */
> 
> > One related concern I have with restricting
> > the locality of the search too much, for example, is that we use
> > NEAR_BNO allocs for other things like inode allocation locality that
> > might not be represented in this simple write only workload.
> 
> Inode chunk allocation sets minlen == maxlen, in which case we
> should run your new by-size search to exhaustion and never hit the
> existing second algorithm. (i.e. don't bound it if minlen = maxlen)
> i.e. your new algorithm should get the same result as the existing
> code, but much faster because it's sonly searching extents we know
> can satisfy the allocation requirements. The proposed algorithm
> above would be even faster :P
> 

Yes, that makes sense. The search can potentially be made simpler in
that case. Also note that minlen == maxlen for most of the data extent
allocations in the test case I ran as well. It looks like the
xfs_bmap_btalloc*() path can use the longest free extent in the AG to
set a starting minlen and then we loosen constraints over multiple
allocation attempts if the first happen to fail.

Anyways, I'll play around with this more next week. Thanks for all of
the thoughts.

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-28 14:09                   ` Brian Foster
@ 2018-10-29  0:17                     ` Dave Chinner
  2018-10-29  9:53                       ` Brian Foster
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2018-10-29  0:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: Mao Cheng, linux-xfs

On Sun, Oct 28, 2018 at 10:09:08AM -0400, Brian Foster wrote:
> On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote:
> > On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote:
> > > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote:
> > > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> > > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
> > > > > > > I'm wondering
> > > > > > > if we could implement a smarter location based search using information
> > > > > > > available in the by-size tree. For example, suppose we could identify
> > > > > > > the closest minimally sized extents to agbno in order to better seed the
> > > > > > > left/right starting points of the location based search. This of course
> > > > > > > would require careful heuristics/tradeoffs to make sure we don't just
> > > > > > > replace a bnobt scan with a cntbt scan.
> > > > > > 
> > > > > > I wouldn't bother. I'd just take the "last block" algorithm and make
> > > > > > it search all the >= contiguous free space extents for best locality
> > > > > > before dropping back to the minlen search.
> > > > > > 
> > > > > 
> > > > > Ok, that makes sense. The caveat seems to be though that the "last
> > > > > block" algorithm searches all of the applicable records to discover the
> > > > > best locality. We could open up this search as such, but if free space
> > > > > happens to be completely fragmented to >= requested extents, that could
> > > > > mean every allocation falls into a full cntbt scan where a bnobt lookup
> > > > > would result in a much faster allocation.
> > > > 
> > > > Yup, we'll need to bound it so it doesn't do stupid things. :P
> > > > 
> > > 
> > > Yep.
> > > 
> > > > > So ISTM that we still need some kind of intelligent heuristic to fall
> > > > > back to the second algorithm to cover the "too many" case. What exactly
> > > > > that is may take some more thought, experimentation and testing.
> > > > 
> > > > Yeah, that's the difficulty with making core allocator algorithm
> > > > changes - how to characterise the effect of the change. I'm not sure
> > > > that's a huge problem in this case, though, because selecting a
> > > > matching contig freespace is almost always going to be better for
> > > > filesystem longetivty and freespace fragmentation resistance than
> > > > slecting a shorter freespace and doing lots more small allocations.
> > > > it's the 'lots of small allocations' that really makes the freespace
> > > > framgmentation spiral out of control, so if we can avoid that until
> > > > we've used all the matching contig free spaces we'll be better off
> > > > in the long run.
> > > > 
> > > 
> > > Ok, so I ran fs_mark against the metadump with your patch and a quick
> > > hack to unconditionally scan the cntbt if maxlen extents are available
> > > (up to mxr[0] records similar to your patch, to avoid excessive scans).
> > > The xfs_alloc_find_best_extent() patch alone didn't have much of a
> > > noticeable effect, but that is an isolated optimization and I'm only
> > > doing coarse measurements atm that probably hide it in the noise.
> > > 
> > > The write workload improves quite a bit with the addition of the cntbt
> > > change. Both throughput (via iostat 60s intervals) and fs_mark files/sec
> > > change from a slow high/low sweeping behavior to much more consistent
> > > and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec
> > > change to a much more consistent 27-39MB/s and ~200-300 f/s.
> > 
> > That looks really promising. :)
> > 
> > > A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first
> > > events which means we never fell back to the bnobt based search. I'm not
> > > sure the mxr thing is the right approach necessarily, I just wanted
> > > something quick that would demonstrate the potential upside gains
> > > without going off the rails.
> > 
> > *nod*. it's a good first approximation, though. The inobt limits
> > search to ~10 inobt records left and right (~1200 nearest inodes)
> > and if there were none free it allocated a new chunk.
> > 
> > The records in the by-size tree have a secondary sort order of
> > by-bno, so we know that as we walk the records of the same size,
> > we'll get closer to the target we want.
> > 
> 
> Yeah, this occurred to me while poking at the state of the cntbt trees
> of the metadump. I was thinking specifically about whether we could use
> that to optimize the existing algorithm a bit. For example, if we skip
> the lastblock logic and do find many maxlen extents, use the agbno of
> the first record to avoid sifting through the entire set. If the agbno
> was > the requested agbno, for example, we could probably end the search
> right there...

I hadn't considered that, but yes, that would also shorten the
second algorithm significantly in that case.

> > Hmmm. I wonder.
> > 
> > xfs_cntbt_key_diff() discriminates first by size, and then if size
> > matches by startblock. 
> > 
> > But xfs_alloc_vextent_near() does this:
> > 
> >       if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
> >                 goto error0;
> > 
> > It sets the startblock of the target lookup to be 0, which means it
> > will always find the extent closest to the start of the AG of the
> > same size of larger. IOWs, it looks for size without considering
> > locality.
> > 
> 
> ... but I wasn't aware we could do this. ;)

It's a side effect of the way the xfs_cntbt_key_diff() calculates
the distance. It will binary search on the size (by returning +/-
diff based on size), but when the size matches (i.e. diff == 0), it
then allows the binary search to continue by returning +/- based on
the startblock diff.

i.e. it will find either the first extent of a closest size match
(no locality) or the closest physical locality match of the desired
size....

> > But the lookup can do both. ~if we change that to be:
> > 
> > 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i);
> > 
> > if will actually find the first block of size >= args->maxlen and
> > >= args->agbno. IOWs, it does the majority of the locality search
> > for us. All we need to do is check that the extent on the left side
> > of the return extent is closer to the target than the extent that is
> > returned.....
> > 
> > Which makes me wonder - can we just get rid of that first algorithm
> > (the lastblock search for largest) and replace it with a simple
> > lookup for args->agbno, args->maxlen + args->alignment?
> > 
> > That way we'll get an extent that will be big enough for alignment
> > to succeed if alignment is required, big enough to fit if alignment
> > is not required, or if nothing is found, we can then do a <= to find
> > the first extent smaller and closest to the target:
> > 
> > 	error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i);
> > 
> > If the record returned has a length = args->maxlen, then we've
> > got the physically closest exact match and we should use it.
> > 
> > if the record is shorter than args->maxlen but > args->minlen, then
> > there are no extents large enough for maxlen, then we should check
> > if the right side record is closer to target and select between the
> > two.
> > 
> > And that can replace the entirity of xfs_alloc_ag_vextent_near.
> 
> My general sense to this point from the code and your feedback about the
> priority of the algorithm is the fundamental problem here is that the
> scope of the first algorithm is simply too narrow and the
> second/fallback algorithm too expensive.

A good summary :)

> At minimum, I think applying
> agbno lookups to the cntbt lookup as you describe here allows us to
> incorporate more locality into the first (maxlen) algorithm and widen
> its scope accordingly. This sounds like a great starting point to me.
> The tradeoff may be that we don't get the locality benefit of all >
> maxlen sized extents since the agbno part of the lookup is secondary to
> the size, but that may be a fine tradeoff if the benefit is we can use
> the first/faster algorithm for a whole lot more cases.
> 
> I'm actually curious if doing that as a first step to open up the first
> algo to _all_ maxlen cases has a noticeable effect on this workload. If
> so, that could be a nice intermediate step to avoid paying the penalty
> for the "too many >= maxlen extents" case before rewriting the broader
> algorithm. The remaining effort is then to absorb the second algorithm
> into the first such that the second is eventually no longer necessary.

Yes, that seems like a sensible first experiment to perform.

> > The existing "nothing >= maxlen" case that goes to
> > xfs_alloc_ag_vextent_small() selects the largest free extent at the
> > highest block number (right nost record) and so ignores locality.
> > The xfs_alloc_lookup_le() lookup does they same thing, except it
> > also picks the physically closest extent of the largest size. That's
> > an improvement.
> > 
> 
> That looks like another point where the first algorithm bails out too
> quickly as well. We find !maxlen extents, decrement to get the next
> (highest agbno) smallest record, then go on to check it for alignment
> and whatnot without any consideration for any other records that may
> exist of the same size.

Not quite....

> Unless I'm missing something, the fact that we
> decrement in the _small() case and jump back to the incrementing
> algorithm of the caller pretty much ensures we'll only consider one such
> record before going into the fan out search.

In that case the _small() allocation finds a candidate it sets ltlen
which triggers the "reset search from start of block" case in the
first algorithm. This walks from the start of the last block to the
first extent >= minlen in the last block and then begins the search
from there.  So it does consider all the candidate free extents
in the last block large enough to be valid in the _small() case.

But, yes, it may not be considering them all.

> > The lastblock case currently selects the physically closest largest
> > free extent in the block that will fit the (aligned) length we
> > require. We can get really close to that with a >= (args->agbno,
> > args->maxlen + args->alignment) lookup
> > 
> > And the  <= (args->agbno, args->maxlen) finds us the largest,
> > closest free extent that we can check against minlen and return
> > that...
> > 
> > IOWs, something like this, which is a whole lot simpler and faster
> > than the linear searches we do now and should return much better
> > candidate extents:
> > 
> 
> This all sounds pretty reasonable to me. I need to think more about the
> details. I.e., whether we'd still want/need to fallback to a worst case
> scan in certain cases, which may not be a problem if the first algorithm
> is updated to find extents in almost all cases instead of being limited
> to when there are a small number of maxlen extents.

I think we can avoid the brute force by-bno search by being smart
with a by minlen by size search in the worst case.

> I'm also wondering if we could enhance this further by repeating the
> agbno lookup at the top for the next smallest extent size in the min/max
> range when no suitable extent is found. Then perhaps the high level
> algorithm is truly simplified to find the "largest available extent with
> best locality for that size."

We get that automatically from a <= lookup on the by size tree. i.e:

> > 	check_left = true;
> > 
> > 	/* look for size and closeness */
> > 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno,
> > 				args->maxlen + alignment, &i);
> > 	if (error)
> > 		goto error0;
> > 
> > 	if (!i) {
> > 		/*
> > 		 * nothing >= target. Search for best <= instead so
> > 		 * we get the get the largest closest match to what
> > 		 * we are asking for.
> > 		 */
> > 		error = xfs_alloc_lookup_le(cnt_cur, args->agbno,
> > 					args->maxlen + alignment, &i);

This search here will find the largest extent size that is <=
maxlen + alignment. it will either match the size and be the closest
physical extent, or it will be the largest smaller extent size
available (w/ no physical locality implied).

If we then want the physically closest extent of that largest size,
then we have to grab the length out of the record and redo the
lookup with that exact size so the by-cnt diff matches and it falls
through to checking the block numbers in the records.

I suspect what we want here is a xfs_btree_lookup_from() still of
operation that doesn't completely restart the btree search. The
cursor already holds the the path we searched to get to the current
block, so the optimal search method here is to walk back up the
current to find the point where there might be ptrs to a different
block and rerun the search from there. i.e. look at the parent block
to see if the adjacent records indicate the search might span
multiple blocks at the current level, and restart the binary search
at the level where multiple ptrs to the same extent size are found.
This means we don't have to binary search from the top of the tree
down if we've already got a cursor that points to the first extent
of the candidate size...

> > Inode chunk allocation sets minlen == maxlen, in which case we
> > should run your new by-size search to exhaustion and never hit the
> > existing second algorithm. (i.e. don't bound it if minlen = maxlen)
> > i.e. your new algorithm should get the same result as the existing
> > code, but much faster because it's sonly searching extents we know
> > can satisfy the allocation requirements. The proposed algorithm
> > above would be even faster :P
> 
> Yes, that makes sense. The search can potentially be made simpler in
> that case. Also note that minlen == maxlen for most of the data extent
> allocations in the test case I ran as well. It looks like the
> xfs_bmap_btalloc*() path can use the longest free extent in the AG to
> set a starting minlen and then we loosen constraints over multiple
> allocation attempts if the first happen to fail.

*nod*

FWIW, xfs_alloc_ag_vextent_size() is effectively minlen == maxlen
allocation. It will either return a maxlen freespace (aligned if
necessary) or fail.

However, now that I look, it does not consider physical locality at
all - it's just a "take the first extent of size large enough to fit
the desired size" algorithm, and it also falls back to a linear
search of extents >= the size target until it finds an extent that
aligns correctly.

I think this follows from the case that xfs_alloc_ag_vextent_size()
is used - it's used for the XFS_ALLOCTYPE_THIS_AG policy, which
means "find the first extent of at least maxlen in this AG". So
it's really just a special case of a near allocation where the
physical locatlity target is the start of the AG. i.e:
XFS_ALLOCTYPE_NEAR_BNO w/ {minlen = maxlen, args->agbno = 0}

IOWs, I suspect we need to step back for a moment and consider how
we should refactor this code because the different algorithms are
currently a result of separation by high level policy rather than
low level allocation selection requirements and as such, we have
some semi-duplication in functionality in the algorithms that could
be removed. Right now it seems the policies fall into these
categories:

	1. size >= maxlen, nearest bno 0 (_size)
	2. exact bno, size >= minlen (_exact)
	3. size >= maxlen, nearest bno target (_near algorithm 1)
	4. size >= minlen, nearest bno target (_near, algorithm 2)

Case 1 is the same as case 3 but with a neares bno target of 0.
Case 2 is the same as case 4 except it fails if the nearest bno
target is not an exact match. Seems like a lot of scope for
factoring and simplification here - there's only two free extent
selection algorithms here, not 4....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete
  2018-10-29  0:17                     ` Dave Chinner
@ 2018-10-29  9:53                       ` Brian Foster
  0 siblings, 0 replies; 17+ messages in thread
From: Brian Foster @ 2018-10-29  9:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Mao Cheng, linux-xfs

On Mon, Oct 29, 2018 at 11:17:39AM +1100, Dave Chinner wrote:
> On Sun, Oct 28, 2018 at 10:09:08AM -0400, Brian Foster wrote:
> > On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote:
> > > On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote:
> > > > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote:
> > > > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote:
> > > > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote:
> > > > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote:
...
> > > The records in the by-size tree have a secondary sort order of
> > > by-bno, so we know that as we walk the records of the same size,
> > > we'll get closer to the target we want.
> > > 
> > 
> > Yeah, this occurred to me while poking at the state of the cntbt trees
> > of the metadump. I was thinking specifically about whether we could use
> > that to optimize the existing algorithm a bit. For example, if we skip
> > the lastblock logic and do find many maxlen extents, use the agbno of
> > the first record to avoid sifting through the entire set. If the agbno
> > was > the requested agbno, for example, we could probably end the search
> > right there...
> 
> I hadn't considered that, but yes, that would also shorten the
> second algorithm significantly in that case.
> 
> > > Hmmm. I wonder.
> > > 
> > > xfs_cntbt_key_diff() discriminates first by size, and then if size
> > > matches by startblock. 
> > > 
> > > But xfs_alloc_vextent_near() does this:
> > > 
> > >       if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
> > >                 goto error0;
> > > 
> > > It sets the startblock of the target lookup to be 0, which means it
> > > will always find the extent closest to the start of the AG of the
> > > same size of larger. IOWs, it looks for size without considering
> > > locality.
> > > 
> > 
> > ... but I wasn't aware we could do this. ;)
> 
> It's a side effect of the way the xfs_cntbt_key_diff() calculates
> the distance. It will binary search on the size (by returning +/-
> diff based on size), but when the size matches (i.e. diff == 0), it
> then allows the binary search to continue by returning +/- based on
> the startblock diff.
> 
> i.e. it will find either the first extent of a closest size match
> (no locality) or the closest physical locality match of the desired
> size....
> 

Yep, makes sense.

> > > But the lookup can do both. ~if we change that to be:
> > > 
> > > 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i);
> > > 
> > > if will actually find the first block of size >= args->maxlen and
> > > >= args->agbno. IOWs, it does the majority of the locality search
> > > for us. All we need to do is check that the extent on the left side
> > > of the return extent is closer to the target than the extent that is
> > > returned.....
> > > 
> > > Which makes me wonder - can we just get rid of that first algorithm
> > > (the lastblock search for largest) and replace it with a simple
> > > lookup for args->agbno, args->maxlen + args->alignment?
> > > 
> > > That way we'll get an extent that will be big enough for alignment
> > > to succeed if alignment is required, big enough to fit if alignment
> > > is not required, or if nothing is found, we can then do a <= to find
> > > the first extent smaller and closest to the target:
> > > 
> > > 	error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i);
> > > 
> > > If the record returned has a length = args->maxlen, then we've
> > > got the physically closest exact match and we should use it.
> > > 
> > > if the record is shorter than args->maxlen but > args->minlen, then
> > > there are no extents large enough for maxlen, then we should check
> > > if the right side record is closer to target and select between the
> > > two.
> > > 
> > > And that can replace the entirity of xfs_alloc_ag_vextent_near.
> > 
...
> > > The existing "nothing >= maxlen" case that goes to
> > > xfs_alloc_ag_vextent_small() selects the largest free extent at the
> > > highest block number (right nost record) and so ignores locality.
> > > The xfs_alloc_lookup_le() lookup does they same thing, except it
> > > also picks the physically closest extent of the largest size. That's
> > > an improvement.
> > > 
> > 
> > That looks like another point where the first algorithm bails out too
> > quickly as well. We find !maxlen extents, decrement to get the next
> > (highest agbno) smallest record, then go on to check it for alignment
> > and whatnot without any consideration for any other records that may
> > exist of the same size.
> 
> Not quite....
> 
> > Unless I'm missing something, the fact that we
> > decrement in the _small() case and jump back to the incrementing
> > algorithm of the caller pretty much ensures we'll only consider one such
> > record before going into the fan out search.
> 
> In that case the _small() allocation finds a candidate it sets ltlen
> which triggers the "reset search from start of block" case in the
> first algorithm. This walks from the start of the last block to the
> first extent >= minlen in the last block and then begins the search
> from there.  So it does consider all the candidate free extents
> in the last block large enough to be valid in the _small() case.
> 

Oops, right. I glossed over that hunk when looking at the _small() alloc
path. Never mind.

Side node: the ->bc_ptrs hacks in this code are pretty nasty.

> But, yes, it may not be considering them all.
> 
> > > The lastblock case currently selects the physically closest largest
> > > free extent in the block that will fit the (aligned) length we
> > > require. We can get really close to that with a >= (args->agbno,
> > > args->maxlen + args->alignment) lookup
> > > 
> > > And the  <= (args->agbno, args->maxlen) finds us the largest,
> > > closest free extent that we can check against minlen and return
> > > that...
> > > 
> > > IOWs, something like this, which is a whole lot simpler and faster
> > > than the linear searches we do now and should return much better
> > > candidate extents:
> > > 
> > 
> > This all sounds pretty reasonable to me. I need to think more about the
> > details. I.e., whether we'd still want/need to fallback to a worst case
> > scan in certain cases, which may not be a problem if the first algorithm
> > is updated to find extents in almost all cases instead of being limited
> > to when there are a small number of maxlen extents.
> 
> I think we can avoid the brute force by-bno search by being smart
> with a by minlen by size search in the worst case.
> 

Ok. The thought below does seem to imply that we could reuse the same
algorithm to find progressively smaller extents in the !maxlen case
rather than fall back to the purely locality based search, which is
somewhat of an incoherent logic/priority transition.

> > I'm also wondering if we could enhance this further by repeating the
> > agbno lookup at the top for the next smallest extent size in the min/max
> > range when no suitable extent is found. Then perhaps the high level
> > algorithm is truly simplified to find the "largest available extent with
> > best locality for that size."
> 
> We get that automatically from a <= lookup on the by size tree. i.e:
> 

Yep..

> > > 	check_left = true;
> > > 
> > > 	/* look for size and closeness */
> > > 	error = xfs_alloc_lookup_ge(cnt_cur, args->agbno,
> > > 				args->maxlen + alignment, &i);
> > > 	if (error)
> > > 		goto error0;
> > > 
> > > 	if (!i) {
> > > 		/*
> > > 		 * nothing >= target. Search for best <= instead so
> > > 		 * we get the get the largest closest match to what
> > > 		 * we are asking for.
> > > 		 */
> > > 		error = xfs_alloc_lookup_le(cnt_cur, args->agbno,
> > > 					args->maxlen + alignment, &i);
> 
> This search here will find the largest extent size that is <=
> maxlen + alignment. it will either match the size and be the closest
> physical extent, or it will be the largest smaller extent size
> available (w/ no physical locality implied).
> 
> If we then want the physically closest extent of that largest size,
> then we have to grab the length out of the record and redo the
> lookup with that exact size so the by-cnt diff matches and it falls
> through to checking the block numbers in the records.
> 

Pretty much the same high-level idea, either way, given the current
lookup implementation...

> I suspect what we want here is a xfs_btree_lookup_from() still of
> operation that doesn't completely restart the btree search. The
> cursor already holds the the path we searched to get to the current
> block, so the optimal search method here is to walk back up the
> current to find the point where there might be ptrs to a different
> block and rerun the search from there. i.e. look at the parent block
> to see if the adjacent records indicate the search might span
> multiple blocks at the current level, and restart the binary search
> at the level where multiple ptrs to the same extent size are found.
> This means we don't have to binary search from the top of the tree
> down if we've already got a cursor that points to the first extent
> of the candidate size...
> 

... but this sounds like a nice idea too.

> > > Inode chunk allocation sets minlen == maxlen, in which case we
> > > should run your new by-size search to exhaustion and never hit the
> > > existing second algorithm. (i.e. don't bound it if minlen = maxlen)
> > > i.e. your new algorithm should get the same result as the existing
> > > code, but much faster because it's sonly searching extents we know
> > > can satisfy the allocation requirements. The proposed algorithm
> > > above would be even faster :P
> > 
> > Yes, that makes sense. The search can potentially be made simpler in
> > that case. Also note that minlen == maxlen for most of the data extent
> > allocations in the test case I ran as well. It looks like the
> > xfs_bmap_btalloc*() path can use the longest free extent in the AG to
> > set a starting minlen and then we loosen constraints over multiple
> > allocation attempts if the first happen to fail.
> 
> *nod*
> 
> FWIW, xfs_alloc_ag_vextent_size() is effectively minlen == maxlen
> allocation. It will either return a maxlen freespace (aligned if
> necessary) or fail.
> 
> However, now that I look, it does not consider physical locality at
> all - it's just a "take the first extent of size large enough to fit
> the desired size" algorithm, and it also falls back to a linear
> search of extents >= the size target until it finds an extent that
> aligns correctly.
> 
> I think this follows from the case that xfs_alloc_ag_vextent_size()
> is used - it's used for the XFS_ALLOCTYPE_THIS_AG policy, which
> means "find the first extent of at least maxlen in this AG". So
> it's really just a special case of a near allocation where the
> physical locatlity target is the start of the AG. i.e:
> XFS_ALLOCTYPE_NEAR_BNO w/ {minlen = maxlen, args->agbno = 0}
> 
> IOWs, I suspect we need to step back for a moment and consider how
> we should refactor this code because the different algorithms are
> currently a result of separation by high level policy rather than
> low level allocation selection requirements and as such, we have
> some semi-duplication in functionality in the algorithms that could
> be removed. Right now it seems the policies fall into these
> categories:
> 
> 	1. size >= maxlen, nearest bno 0 (_size)
> 	2. exact bno, size >= minlen (_exact)
> 	3. size >= maxlen, nearest bno target (_near algorithm 1)
> 	4. size >= minlen, nearest bno target (_near, algorithm 2)
> 
> Case 1 is the same as case 3 but with a neares bno target of 0.
> Case 2 is the same as case 4 except it fails if the nearest bno
> target is not an exact match. Seems like a lot of scope for
> factoring and simplification here - there's only two free extent
> selection algorithms here, not 4....
> 

That's a very interesting point. Perhaps the right longer term approach
is not necessarily to rewrite xfs_alloc_ag_vextent_near(), but start on
a new generic allocator implementation that can ultimately absorb the
functionality of each independent algorithm and reduce the amount of
code required in the process. Thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-10-29 18:41 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-23  7:56 xfs_alloc_ag_vextent_near() takes about 30ms to complete Mao Cheng
2018-10-23 14:53 ` Brian Foster
2018-10-24  3:01   ` Mao Cheng
2018-10-24  4:34     ` Dave Chinner
2018-10-24  9:02       ` Mao Cheng
2018-10-24 12:11         ` Brian Foster
2018-10-25  4:01           ` Mao Cheng
2018-10-25 14:55             ` Brian Foster
2018-10-24 12:09       ` Brian Foster
2018-10-24 22:35         ` Dave Chinner
2018-10-25 13:21           ` Brian Foster
2018-10-26  1:03             ` Dave Chinner
2018-10-26 13:03               ` Brian Foster
2018-10-27  3:16                 ` Dave Chinner
2018-10-28 14:09                   ` Brian Foster
2018-10-29  0:17                     ` Dave Chinner
2018-10-29  9:53                       ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).