* xfs_alloc_ag_vextent_near() takes about 30ms to complete
@ 2018-10-23 7:56 Mao Cheng
2018-10-23 14:53 ` Brian Foster
0 siblings, 1 reply; 17+ messages in thread
From: Mao Cheng @ 2018-10-23 7:56 UTC (permalink / raw)
To: linux-xfs
Sorry for trouble again. I just wrote wrong function name in previous
sending, so resend it.
If you have received previous email please ignore it, thanks
we have a XFS mkfs with "-k" and mount with the default options(
rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and
exported via samba.
[root@test1 home]# xfs_info /dev/sdk
meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=524288000, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=256000, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
free space about allocation groups:
from to extents blocks pct
1 1 9 9 0.00
2 3 14291 29124 0.19
4 7 5689 22981 0.15
8 15 119 1422 0.01
16 31 754657 15093035 99.65
32 63 1 33 0.00
total free extents 774766
total free blocks 15146604
average free extent size 19.5499
from to extents blocks pct
1 1 253 253 0.00
2 3 7706 16266 0.21
4 7 7718 30882 0.39
8 15 24 296 0.00
16 31 381976 7638130 96.71
32 63 753 38744 0.49
131072 262143 1 173552 2.20
total free extents 398431
total free blocks 7898123
average free extent size 19.8231
from to extents blocks pct
1 1 370 370 0.00
2 3 2704 5775 0.01
4 7 1016 4070 0.01
8 15 24 254 0.00
16 31 546614 10931743 20.26
32 63 19191 1112600 2.06
64 127 2 184 0.00
131072 262143 1 163713 0.30
524288 1048575 2 1438626 2.67
1048576 2097151 4 5654463 10.48
2097152 4194303 1 3489060 6.47
4194304 8388607 2 12656637 23.46
16777216 33554431 1 18502975 34.29
total free extents 569932
total free blocks 53960470
average free extent size 94.6788
from to extents blocks pct
1 1 8 8 0.00
2 3 5566 11229 0.06
4 7 9622 38537 0.21
8 15 57 686 0.00
16 31 747242 14944852 80.31
32 63 570 32236 0.17
2097152 4194303 1 3582074 19.25
total free extents 763066
total free blocks 18609622
average free extent size 24.38
we copy small files(about 150kb) from windows to xfs via SMB protocal,
sometines kworker process consumes 100% of one CPU, and "perf top"
shows xfs_extent_busy_trim() and xfs_btree_increment() consume too much
cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to
complete.
In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64).
Any suggestions are welcome.
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-23 7:56 xfs_alloc_ag_vextent_near() takes about 30ms to complete Mao Cheng @ 2018-10-23 14:53 ` Brian Foster 2018-10-24 3:01 ` Mao Cheng 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-23 14:53 UTC (permalink / raw) To: Mao Cheng; +Cc: linux-xfs On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > Sorry for trouble again. I just wrote wrong function name in previous > sending, so resend it. > If you have received previous email please ignore it, thanks > > we have a XFS mkfs with "-k" and mount with the default options( > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > exported via samba. > > [root@test1 home]# xfs_info /dev/sdk > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=0 spinodes=0 > data = bsize=4096 blocks=524288000, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal bsize=4096 blocks=256000, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > free space about allocation groups: > from to extents blocks pct > 1 1 9 9 0.00 > 2 3 14291 29124 0.19 > 4 7 5689 22981 0.15 > 8 15 119 1422 0.01 > 16 31 754657 15093035 99.65 > 32 63 1 33 0.00 > total free extents 774766 > total free blocks 15146604 > average free extent size 19.5499 > from to extents blocks pct > 1 1 253 253 0.00 > 2 3 7706 16266 0.21 > 4 7 7718 30882 0.39 > 8 15 24 296 0.00 > 16 31 381976 7638130 96.71 > 32 63 753 38744 0.49 > 131072 262143 1 173552 2.20 > total free extents 398431 > total free blocks 7898123 > average free extent size 19.8231 > from to extents blocks pct > 1 1 370 370 0.00 > 2 3 2704 5775 0.01 > 4 7 1016 4070 0.01 > 8 15 24 254 0.00 > 16 31 546614 10931743 20.26 > 32 63 19191 1112600 2.06 > 64 127 2 184 0.00 > 131072 262143 1 163713 0.30 > 524288 1048575 2 1438626 2.67 > 1048576 2097151 4 5654463 10.48 > 2097152 4194303 1 3489060 6.47 > 4194304 8388607 2 12656637 23.46 > 16777216 33554431 1 18502975 34.29 > total free extents 569932 > total free blocks 53960470 > average free extent size 94.6788 > from to extents blocks pct > 1 1 8 8 0.00 > 2 3 5566 11229 0.06 > 4 7 9622 38537 0.21 > 8 15 57 686 0.00 > 16 31 747242 14944852 80.31 > 32 63 570 32236 0.17 > 2097152 4194303 1 3582074 19.25 > total free extents 763066 > total free blocks 18609622 > average free extent size 24.38 > So it looks like free space in 3 out of 4 AGs is mostly fragmented to 16-31 block extents. Those same AGs appear to have a much higher number (~15k-20k) of even smaller extents. > we copy small files(about 150kb) from windows to xfs via SMB protocal, > sometines kworker process consumes 100% of one CPU, and "perf top" > shows xfs_extent_busy_trim() and xfs_btree_increment() consume too much > cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to > complete. > This is kind of a vague performance report. Some process consumes a full CPU and this is a problem for some (??) reason given unknown CPU and unknown storage (with unknown amount of RAM). I assume that kworker task is writeback, but you haven't really specified that either. xfs_alloc_ag_vextent_near() is one of the several block allocation algorithms in XFS. That function itself includes a couple different algorithms for the "near" allocation based on the state of the AG. One looks like an intra-block search of the by-size free space btree (if not many suitably sized extents are available) and the second looks like an outward sweep of the by-block free space btree to find a suitably sized extent. I could certainly see the latter taking some time for certain sized allocation requests under fragmented free space conditions. If you wanted more detail over what's going on here, I'd suggest to capture a sample of the xfs_alloc* (and perhaps xfs_extent_busy*) tracepoints during the workload. That aside, it's probably best to step back and describe for the list the overall environment, workload and performance problem you observed that caused this level of digging in the first place. For example, has throughput degraded over time? Latency increased? How many writers are active at once? Is preallocation involved (I thought Samba/Windows triggered it certain cases, but I don't recall)? Brian > In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64). > > Any suggestions are welcome. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-23 14:53 ` Brian Foster @ 2018-10-24 3:01 ` Mao Cheng 2018-10-24 4:34 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Mao Cheng @ 2018-10-24 3:01 UTC (permalink / raw) To: bfoster; +Cc: linux-xfs Hi Brian, Thanks for your response. Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > Sorry for trouble again. I just wrote wrong function name in previous > > sending, so resend it. > > If you have received previous email please ignore it, thanks > > > > we have a XFS mkfs with "-k" and mount with the default options( > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > exported via samba. > > > > [root@test1 home]# xfs_info /dev/sdk > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > = sectsz=4096 attr=2, projid32bit=1 > > = crc=1 finobt=0 spinodes=0 > > data = bsize=4096 blocks=524288000, imaxpct=5 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > log =internal bsize=4096 blocks=256000, version=2 > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > free space about allocation groups: > > from to extents blocks pct > > 1 1 9 9 0.00 > > 2 3 14291 29124 0.19 > > 4 7 5689 22981 0.15 > > 8 15 119 1422 0.01 > > 16 31 754657 15093035 99.65 > > 32 63 1 33 0.00 > > total free extents 774766 > > total free blocks 15146604 > > average free extent size 19.5499 > > from to extents blocks pct > > 1 1 253 253 0.00 > > 2 3 7706 16266 0.21 > > 4 7 7718 30882 0.39 > > 8 15 24 296 0.00 > > 16 31 381976 7638130 96.71 > > 32 63 753 38744 0.49 > > 131072 262143 1 173552 2.20 > > total free extents 398431 > > total free blocks 7898123 > > average free extent size 19.8231 > > from to extents blocks pct > > 1 1 370 370 0.00 > > 2 3 2704 5775 0.01 > > 4 7 1016 4070 0.01 > > 8 15 24 254 0.00 > > 16 31 546614 10931743 20.26 > > 32 63 19191 1112600 2.06 > > 64 127 2 184 0.00 > > 131072 262143 1 163713 0.30 > > 524288 1048575 2 1438626 2.67 > > 1048576 2097151 4 5654463 10.48 > > 2097152 4194303 1 3489060 6.47 > > 4194304 8388607 2 12656637 23.46 > > 16777216 33554431 1 18502975 34.29 > > total free extents 569932 > > total free blocks 53960470 > > average free extent size 94.6788 > > from to extents blocks pct > > 1 1 8 8 0.00 > > 2 3 5566 11229 0.06 > > 4 7 9622 38537 0.21 > > 8 15 57 686 0.00 > > 16 31 747242 14944852 80.31 > > 32 63 570 32236 0.17 > > 2097152 4194303 1 3582074 19.25 > > total free extents 763066 > > total free blocks 18609622 > > average free extent size 24.38 > > > > So it looks like free space in 3 out of 4 AGs is mostly fragmented to > 16-31 block extents. Those same AGs appear to have a much higher number > (~15k-20k) of even smaller extents. > > > we copy small files(about 150kb) from windows to xfs via SMB protocal, > > sometines kworker process consumes 100% of one CPU, and "perf top" > > shows xfs_extent_busy_trim() and xfs_btree_increment() consume too much > > cpu resources, ftrace also show xfs_alloc_ag_vextent_near takes about 30ms to > > complete. > > > > This is kind of a vague performance report. Some process consumes a full > CPU and this is a problem for some (??) reason given unknown CPU and > unknown storage (with unknown amount of RAM). I assume that kworker task > is writeback, but you haven't really specified that either. yes, the kworker task is writeback, and the storage is XFS-formatted disk that is alos the target we copy files to. > > xfs_alloc_ag_vextent_near() is one of the several block allocation > algorithms in XFS. That function itself includes a couple different > algorithms for the "near" allocation based on the state of the AG. One > looks like an intra-block search of the by-size free space btree (if not > many suitably sized extents are available) and the second looks like an > outward sweep of the by-block free space btree to find a suitably sized > extent. I could certainly see the latter taking some time for certain > sized allocation requests under fragmented free space conditions. If you > wanted more detail over what's going on here, I'd suggest to capture a > sample of the xfs_alloc* (and perhaps xfs_extent_busy*) tracepoints > during the workload. > > That aside, it's probably best to step back and describe for the list > the overall environment, workload and performance problem you observed > that caused this level of digging in the first place. For example, has > throughput degraded over time? Latency increased? How many writers are > active at once? Is preallocation involved (I thought Samba/Windows > triggered it certain cases, but I don't recall)? We share an xfs filesystem to windows via SMB protocol. There are about 5 windows copy small files to the samba share at the same time. The main problem is the throughput degraded from 30MB/s to around 10KB/s periodically and recovered about 5s later. The kworker consumes 100% of one CPU when the throughput degraded and kworker task is wrteback. /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold and nr_writeback is too small(is that means there too many dirty pages in page cache and can't be written out to disk?) Mao > > Brian > > > In addition all tests were performed on Centos7.4(3.10.0-693.el7.x86_64). > > > > Any suggestions are welcome. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 3:01 ` Mao Cheng @ 2018-10-24 4:34 ` Dave Chinner 2018-10-24 9:02 ` Mao Cheng 2018-10-24 12:09 ` Brian Foster 0 siblings, 2 replies; 17+ messages in thread From: Dave Chinner @ 2018-10-24 4:34 UTC (permalink / raw) To: Mao Cheng; +Cc: bfoster, linux-xfs On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > Hi Brian, > Thanks for your response. > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > Sorry for trouble again. I just wrote wrong function name in previous > > > sending, so resend it. > > > If you have received previous email please ignore it, thanks > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > exported via samba. > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > = sectsz=4096 attr=2, projid32bit=1 > > > = crc=1 finobt=0 spinodes=0 > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > = sunit=0 swidth=0 blks > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > log =internal bsize=4096 blocks=256000, version=2 > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > free space about allocation groups: > > > from to extents blocks pct > > > 1 1 9 9 0.00 > > > 2 3 14291 29124 0.19 > > > 4 7 5689 22981 0.15 > > > 8 15 119 1422 0.01 > > > 16 31 754657 15093035 99.65 750,000 fragmented free extents means something like 1600 btree leaf blocks to hold them all..... > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > algorithms in XFS. That function itself includes a couple different > > algorithms for the "near" allocation based on the state of the AG. One > > looks like an intra-block search of the by-size free space btree (if not > > many suitably sized extents are available) and the second looks like an > > outward sweep of the by-block free space btree to find a suitably sized > > extent. Yup, just like the free inode allocation search, which is capped at about 10 btree blocks left and right to prevent searching the entire tree for the one free inode that remains in it. The problem here is that the first algorithm fails immediately because there isn't a contiguous free space large enough for the allocation being requested, and so it finds the largest block whose /location/ is less than target block as the start point for the nearest largest freespace. IOW, we do an expanding radius size search based on physical locality rather than finding a free space based on size. Once we find a good extent to either the left or right, we then stop that search and try to find a better extent to the other direction (xfs_alloc_find_best_extent()). That search is not bound, so can search the entire of the tree in that remaining directory without finding a better match. We can't cut the initial left/right search shorter - we've got to find a large enough free extent to succeed, but we can chop xfs_alloc_find_best_extent() short, similar to searchdistance in xfs_dialloc_ag_inobt(). The patch below does that. Really, though, I think what we need to a better size based search before falling back to a locality based search. This is more complex, so not a few minutes work and requires a lot more thought and testing. > We share an xfs filesystem to windows via SMB protocol. > There are about 5 windows copy small files to the samba share at the same time. > The main problem is the throughput degraded from 30MB/s to around > 10KB/s periodically and recovered about 5s later. > The kworker consumes 100% of one CPU when the throughput degraded and > kworker task is wrteback. > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > and nr_writeback is too small(is that means there too many dirty pages > in page cache and can't be written out to disk?) incoming writes are throttled at the rate writeback makes progress, hence the system will sit at the threshold. This is normal. Writeback is just slow because of the freespace fragmentation in the filesystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: cap search distance in xfs_alloc_ag_vextent_near() From: Dave Chinner <dchinner@redhat.com> Don't waste too much CPU time finding the perfect free extent when we don't have a large enough contiguous free space and there are many, many small free spaces that we'd do a linear search through. Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved the same problem with the cost of finding the last free inodes in the inode allocation btree. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index e1c0c0d2f1b0..c0c0a018e3bb 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( } /* - * Search the btree in a given direction via the search cursor and compare - * the records found against the good extent we've already found. + * Search the btree in a given direction via the search cursor and compare the + * records found against the good extent we've already found. + * + * We cap this search to a number of records to prevent searching hundreds of + * thousands of records in a potentially futile search for a larger freespace + * when free space is really badly fragmented. Spending more CPU time than the + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching + * a full btree block (~500 records on a 4k block size fs). */ STATIC int xfs_alloc_find_best_extent( @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( int error; int i; unsigned busy_gen; + int searchdistance = args->mp->m_alloc_mxr[0]; /* The good extent is perfect, no need to search. */ if (!gdiff) @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( error = xfs_btree_decrement(*scur, 0, &i); if (error) goto error0; - } while (i); + } while (i && searchdistance-- > 0); out_use_good: xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 4:34 ` Dave Chinner @ 2018-10-24 9:02 ` Mao Cheng 2018-10-24 12:11 ` Brian Foster 2018-10-24 12:09 ` Brian Foster 1 sibling, 1 reply; 17+ messages in thread From: Mao Cheng @ 2018-10-24 9:02 UTC (permalink / raw) To: david; +Cc: bfoster, linux-xfs Hi, Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道: > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > Hi Brian, > > Thanks for your response. > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > sending, so resend it. > > > > If you have received previous email please ignore it, thanks > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > exported via samba. > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > = crc=1 finobt=0 spinodes=0 > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > = sunit=0 swidth=0 blks > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > free space about allocation groups: > > > > from to extents blocks pct > > > > 1 1 9 9 0.00 > > > > 2 3 14291 29124 0.19 > > > > 4 7 5689 22981 0.15 > > > > 8 15 119 1422 0.01 > > > > 16 31 754657 15093035 99.65 > > 750,000 fragmented free extents means something like 1600 btree > leaf blocks to hold them all..... > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > algorithms in XFS. That function itself includes a couple different > > > algorithms for the "near" allocation based on the state of the AG. One > > > looks like an intra-block search of the by-size free space btree (if not > > > many suitably sized extents are available) and the second looks like an > > > outward sweep of the by-block free space btree to find a suitably sized > > > extent. > > Yup, just like the free inode allocation search, which is capped > at about 10 btree blocks left and right to prevent searching the > entire tree for the one free inode that remains in it. > > The problem here is that the first algorithm fails immediately > because there isn't a contiguous free space large enough for the > allocation being requested, and so it finds the largest block whose > /location/ is less than target block as the start point for the > nearest largest freespace. > > IOW, we do an expanding radius size search based on physical > locality rather than finding a free space based on size. Once we > find a good extent to either the left or right, we then stop that > search and try to find a better extent to the other direction > (xfs_alloc_find_best_extent()). That search is not bound, so can > search the entire of the tree in that remaining directory without > finding a better match. > > We can't cut the initial left/right search shorter - we've got to > find a large enough free extent to succeed, but we can chop > xfs_alloc_find_best_extent() short, similar to searchdistance in > xfs_dialloc_ag_inobt(). The patch below does that. > > Really, though, I think what we need to a better size based search > before falling back to a locality based search. This is more > complex, so not a few minutes work and requires a lot more thought > and testing. > > > We share an xfs filesystem to windows via SMB protocol. > > There are about 5 windows copy small files to the samba share at the same time. > > The main problem is the throughput degraded from 30MB/s to around > > 10KB/s periodically and recovered about 5s later. > > The kworker consumes 100% of one CPU when the throughput degraded and > > kworker task is wrteback. > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > > and nr_writeback is too small(is that means there too many dirty pages > > in page cache and can't be written out to disk?) > > incoming writes are throttled at the rate writeback makes progress, > hence the system will sit at the threshold. This is normal. > Writeback is just slow because of the freespace fragmentation in the > filesystem. Does running xfs_fsr periodically alleviate this problem? And is it advisable to run xfs_fsr regularly to reduce the fragmentation in xfs filesystems? Regards, Mao. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > > xfs: cap search distance in xfs_alloc_ag_vextent_near() > > From: Dave Chinner <dchinner@redhat.com> > > Don't waste too much CPU time finding the perfect free extent when > we don't have a large enough contiguous free space and there are > many, many small free spaces that we'd do a linear search through. > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved > the same problem with the cost of finding the last free inodes in > the inode allocation btree. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index e1c0c0d2f1b0..c0c0a018e3bb 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( > } > > /* > - * Search the btree in a given direction via the search cursor and compare > - * the records found against the good extent we've already found. > + * Search the btree in a given direction via the search cursor and compare the > + * records found against the good extent we've already found. > + * > + * We cap this search to a number of records to prevent searching hundreds of > + * thousands of records in a potentially futile search for a larger freespace > + * when free space is really badly fragmented. Spending more CPU time than the > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching > + * a full btree block (~500 records on a 4k block size fs). > */ > STATIC int > xfs_alloc_find_best_extent( > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( > int error; > int i; > unsigned busy_gen; > + int searchdistance = args->mp->m_alloc_mxr[0]; > > /* The good extent is perfect, no need to search. */ > if (!gdiff) > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( > error = xfs_btree_decrement(*scur, 0, &i); > if (error) > goto error0; > - } while (i); > + } while (i && searchdistance-- > 0); > > out_use_good: > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 9:02 ` Mao Cheng @ 2018-10-24 12:11 ` Brian Foster 2018-10-25 4:01 ` Mao Cheng 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-24 12:11 UTC (permalink / raw) To: Mao Cheng; +Cc: david, linux-xfs On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote: > Hi, > Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道: > > > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > > Hi Brian, > > > Thanks for your response. > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > > sending, so resend it. > > > > > If you have received previous email please ignore it, thanks > > > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > > exported via samba. > > > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > > = crc=1 finobt=0 spinodes=0 > > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > > = sunit=0 swidth=0 blks > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > > free space about allocation groups: > > > > > from to extents blocks pct > > > > > 1 1 9 9 0.00 > > > > > 2 3 14291 29124 0.19 > > > > > 4 7 5689 22981 0.15 > > > > > 8 15 119 1422 0.01 > > > > > 16 31 754657 15093035 99.65 > > > > 750,000 fragmented free extents means something like 1600 btree > > leaf blocks to hold them all..... > > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > > algorithms in XFS. That function itself includes a couple different > > > > algorithms for the "near" allocation based on the state of the AG. One > > > > looks like an intra-block search of the by-size free space btree (if not > > > > many suitably sized extents are available) and the second looks like an > > > > outward sweep of the by-block free space btree to find a suitably sized > > > > extent. > > > > Yup, just like the free inode allocation search, which is capped > > at about 10 btree blocks left and right to prevent searching the > > entire tree for the one free inode that remains in it. > > > > The problem here is that the first algorithm fails immediately > > because there isn't a contiguous free space large enough for the > > allocation being requested, and so it finds the largest block whose > > /location/ is less than target block as the start point for the > > nearest largest freespace. > > > > IOW, we do an expanding radius size search based on physical > > locality rather than finding a free space based on size. Once we > > find a good extent to either the left or right, we then stop that > > search and try to find a better extent to the other direction > > (xfs_alloc_find_best_extent()). That search is not bound, so can > > search the entire of the tree in that remaining directory without > > finding a better match. > > > > We can't cut the initial left/right search shorter - we've got to > > find a large enough free extent to succeed, but we can chop > > xfs_alloc_find_best_extent() short, similar to searchdistance in > > xfs_dialloc_ag_inobt(). The patch below does that. > > > > Really, though, I think what we need to a better size based search > > before falling back to a locality based search. This is more > > complex, so not a few minutes work and requires a lot more thought > > and testing. > > > > > We share an xfs filesystem to windows via SMB protocol. > > > There are about 5 windows copy small files to the samba share at the same time. > > > The main problem is the throughput degraded from 30MB/s to around > > > 10KB/s periodically and recovered about 5s later. > > > The kworker consumes 100% of one CPU when the throughput degraded and > > > kworker task is wrteback. > > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > > > and nr_writeback is too small(is that means there too many dirty pages > > > in page cache and can't be written out to disk?) > > > > incoming writes are throttled at the rate writeback makes progress, > > hence the system will sit at the threshold. This is normal. > > Writeback is just slow because of the freespace fragmentation in the > > filesystem. > Does running xfs_fsr periodically alleviate this problem? > And is it advisable to run xfs_fsr regularly to reduce the > fragmentation in xfs filesystems? > I think xfs_fsr is more likely to contribute to this problem than alleviate it. xfs_fsr defragments files whereas the problem here is fragmentation of free space. Could you determine whether Dave's patch helps with performance at all? Also, would you be able to share a metadump of this filesystem? Brian > Regards, > > Mao. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > > > xfs: cap search distance in xfs_alloc_ag_vextent_near() > > > > From: Dave Chinner <dchinner@redhat.com> > > > > Don't waste too much CPU time finding the perfect free extent when > > we don't have a large enough contiguous free space and there are > > many, many small free spaces that we'd do a linear search through. > > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved > > the same problem with the cost of finding the last free inodes in > > the inode allocation btree. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index e1c0c0d2f1b0..c0c0a018e3bb 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( > > } > > > > /* > > - * Search the btree in a given direction via the search cursor and compare > > - * the records found against the good extent we've already found. > > + * Search the btree in a given direction via the search cursor and compare the > > + * records found against the good extent we've already found. > > + * > > + * We cap this search to a number of records to prevent searching hundreds of > > + * thousands of records in a potentially futile search for a larger freespace > > + * when free space is really badly fragmented. Spending more CPU time than the > > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching > > + * a full btree block (~500 records on a 4k block size fs). > > */ > > STATIC int > > xfs_alloc_find_best_extent( > > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( > > int error; > > int i; > > unsigned busy_gen; > > + int searchdistance = args->mp->m_alloc_mxr[0]; > > > > /* The good extent is perfect, no need to search. */ > > if (!gdiff) > > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( > > error = xfs_btree_decrement(*scur, 0, &i); > > if (error) > > goto error0; > > - } while (i); > > + } while (i && searchdistance-- > 0); > > > > out_use_good: > > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 12:11 ` Brian Foster @ 2018-10-25 4:01 ` Mao Cheng 2018-10-25 14:55 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Mao Cheng @ 2018-10-25 4:01 UTC (permalink / raw) To: bfoster; +Cc: david, linux-xfs Brian Foster <bfoster@redhat.com> 于2018年10月24日周三 下午8:11写道: > > On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote: > > Hi, > > Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道: > > > > > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > > > Hi Brian, > > > > Thanks for your response. > > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > > > sending, so resend it. > > > > > > If you have received previous email please ignore it, thanks > > > > > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > > > exported via samba. > > > > > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > > > = crc=1 finobt=0 spinodes=0 > > > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > > > = sunit=0 swidth=0 blks > > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > > > > free space about allocation groups: > > > > > > from to extents blocks pct > > > > > > 1 1 9 9 0.00 > > > > > > 2 3 14291 29124 0.19 > > > > > > 4 7 5689 22981 0.15 > > > > > > 8 15 119 1422 0.01 > > > > > > 16 31 754657 15093035 99.65 > > > > > > 750,000 fragmented free extents means something like 1600 btree > > > leaf blocks to hold them all..... > > > > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > > > algorithms in XFS. That function itself includes a couple different > > > > > algorithms for the "near" allocation based on the state of the AG. One > > > > > looks like an intra-block search of the by-size free space btree (if not > > > > > many suitably sized extents are available) and the second looks like an > > > > > outward sweep of the by-block free space btree to find a suitably sized > > > > > extent. > > > > > > Yup, just like the free inode allocation search, which is capped > > > at about 10 btree blocks left and right to prevent searching the > > > entire tree for the one free inode that remains in it. > > > > > > The problem here is that the first algorithm fails immediately > > > because there isn't a contiguous free space large enough for the > > > allocation being requested, and so it finds the largest block whose > > > /location/ is less than target block as the start point for the > > > nearest largest freespace. > > > > > > IOW, we do an expanding radius size search based on physical > > > locality rather than finding a free space based on size. Once we > > > find a good extent to either the left or right, we then stop that > > > search and try to find a better extent to the other direction > > > (xfs_alloc_find_best_extent()). That search is not bound, so can > > > search the entire of the tree in that remaining directory without > > > finding a better match. > > > > > > We can't cut the initial left/right search shorter - we've got to > > > find a large enough free extent to succeed, but we can chop > > > xfs_alloc_find_best_extent() short, similar to searchdistance in > > > xfs_dialloc_ag_inobt(). The patch below does that. > > > > > > Really, though, I think what we need to a better size based search > > > before falling back to a locality based search. This is more > > > complex, so not a few minutes work and requires a lot more thought > > > and testing. > > > > > > > We share an xfs filesystem to windows via SMB protocol. > > > > There are about 5 windows copy small files to the samba share at the same time. > > > > The main problem is the throughput degraded from 30MB/s to around > > > > 10KB/s periodically and recovered about 5s later. > > > > The kworker consumes 100% of one CPU when the throughput degraded and > > > > kworker task is wrteback. > > > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > > > > and nr_writeback is too small(is that means there too many dirty pages > > > > in page cache and can't be written out to disk?) > > > > > > incoming writes are throttled at the rate writeback makes progress, > > > hence the system will sit at the threshold. This is normal. > > > Writeback is just slow because of the freespace fragmentation in the > > > filesystem. > > Does running xfs_fsr periodically alleviate this problem? > > And is it advisable to run xfs_fsr regularly to reduce the > > fragmentation in xfs filesystems? > > > > I think xfs_fsr is more likely to contribute to this problem than > alleviate it. xfs_fsr defragments files whereas the problem here is > fragmentation of free space. > > Could you determine whether Dave's patch helps with performance at all? I will test the patch later. > > Also, would you be able to share a metadump of this filesystem? The metadump has been uploaded to google driver, and link as follows: https://drive.google.com/open?id=1RLekC-BnbAujXDl-xZ-vteMudrl2xC9D Thanks, Mao > > Brian > > > Regards, > > > > Mao. > > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > > > > > > > xfs: cap search distance in xfs_alloc_ag_vextent_near() > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Don't waste too much CPU time finding the perfect free extent when > > > we don't have a large enough contiguous free space and there are > > > many, many small free spaces that we'd do a linear search through. > > > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved > > > the same problem with the cost of finding the last free inodes in > > > the inode allocation btree. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- > > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index e1c0c0d2f1b0..c0c0a018e3bb 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( > > > } > > > > > > /* > > > - * Search the btree in a given direction via the search cursor and compare > > > - * the records found against the good extent we've already found. > > > + * Search the btree in a given direction via the search cursor and compare the > > > + * records found against the good extent we've already found. > > > + * > > > + * We cap this search to a number of records to prevent searching hundreds of > > > + * thousands of records in a potentially futile search for a larger freespace > > > + * when free space is really badly fragmented. Spending more CPU time than the > > > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching > > > + * a full btree block (~500 records on a 4k block size fs). > > > */ > > > STATIC int > > > xfs_alloc_find_best_extent( > > > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( > > > int error; > > > int i; > > > unsigned busy_gen; > > > + int searchdistance = args->mp->m_alloc_mxr[0]; > > > > > > /* The good extent is perfect, no need to search. */ > > > if (!gdiff) > > > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( > > > error = xfs_btree_decrement(*scur, 0, &i); > > > if (error) > > > goto error0; > > > - } while (i); > > > + } while (i && searchdistance-- > 0); > > > > > > out_use_good: > > > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-25 4:01 ` Mao Cheng @ 2018-10-25 14:55 ` Brian Foster 0 siblings, 0 replies; 17+ messages in thread From: Brian Foster @ 2018-10-25 14:55 UTC (permalink / raw) To: Mao Cheng; +Cc: david, linux-xfs On Thu, Oct 25, 2018 at 12:01:06PM +0800, Mao Cheng wrote: > Brian Foster <bfoster@redhat.com> 于2018年10月24日周三 下午8:11写道: > > > > On Wed, Oct 24, 2018 at 05:02:11PM +0800, Mao Cheng wrote: > > > Hi, > > > Dave Chinner <david@fromorbit.com> 于2018年10月24日周三 下午12:34写道: > > > > > > > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > > > > Hi Brian, > > > > > Thanks for your response. > > > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > > > > sending, so resend it. > > > > > > > If you have received previous email please ignore it, thanks > > > > > > > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > > > > exported via samba. > > > > > > > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > > > > = crc=1 finobt=0 spinodes=0 > > > > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > > > > = sunit=0 swidth=0 blks > > > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > > > > > > free space about allocation groups: > > > > > > > from to extents blocks pct > > > > > > > 1 1 9 9 0.00 > > > > > > > 2 3 14291 29124 0.19 > > > > > > > 4 7 5689 22981 0.15 > > > > > > > 8 15 119 1422 0.01 > > > > > > > 16 31 754657 15093035 99.65 > > > > > > > > 750,000 fragmented free extents means something like 1600 btree > > > > leaf blocks to hold them all..... > > > > > > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > > > > algorithms in XFS. That function itself includes a couple different > > > > > > algorithms for the "near" allocation based on the state of the AG. One > > > > > > looks like an intra-block search of the by-size free space btree (if not > > > > > > many suitably sized extents are available) and the second looks like an > > > > > > outward sweep of the by-block free space btree to find a suitably sized > > > > > > extent. > > > > > > > > Yup, just like the free inode allocation search, which is capped > > > > at about 10 btree blocks left and right to prevent searching the > > > > entire tree for the one free inode that remains in it. > > > > > > > > The problem here is that the first algorithm fails immediately > > > > because there isn't a contiguous free space large enough for the > > > > allocation being requested, and so it finds the largest block whose > > > > /location/ is less than target block as the start point for the > > > > nearest largest freespace. > > > > > > > > IOW, we do an expanding radius size search based on physical > > > > locality rather than finding a free space based on size. Once we > > > > find a good extent to either the left or right, we then stop that > > > > search and try to find a better extent to the other direction > > > > (xfs_alloc_find_best_extent()). That search is not bound, so can > > > > search the entire of the tree in that remaining directory without > > > > finding a better match. > > > > > > > > We can't cut the initial left/right search shorter - we've got to > > > > find a large enough free extent to succeed, but we can chop > > > > xfs_alloc_find_best_extent() short, similar to searchdistance in > > > > xfs_dialloc_ag_inobt(). The patch below does that. > > > > > > > > Really, though, I think what we need to a better size based search > > > > before falling back to a locality based search. This is more > > > > complex, so not a few minutes work and requires a lot more thought > > > > and testing. > > > > > > > > > We share an xfs filesystem to windows via SMB protocol. > > > > > There are about 5 windows copy small files to the samba share at the same time. > > > > > The main problem is the throughput degraded from 30MB/s to around > > > > > 10KB/s periodically and recovered about 5s later. > > > > > The kworker consumes 100% of one CPU when the throughput degraded and > > > > > kworker task is wrteback. > > > > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > > > > > and nr_writeback is too small(is that means there too many dirty pages > > > > > in page cache and can't be written out to disk?) > > > > > > > > incoming writes are throttled at the rate writeback makes progress, > > > > hence the system will sit at the threshold. This is normal. > > > > Writeback is just slow because of the freespace fragmentation in the > > > > filesystem. > > > Does running xfs_fsr periodically alleviate this problem? > > > And is it advisable to run xfs_fsr regularly to reduce the > > > fragmentation in xfs filesystems? > > > > > > > I think xfs_fsr is more likely to contribute to this problem than > > alleviate it. xfs_fsr defragments files whereas the problem here is > > fragmentation of free space. > > > > Could you determine whether Dave's patch helps with performance at all? > I will test the patch later. > > > > Also, would you be able to share a metadump of this filesystem? > The metadump has been uploaded to google driver, and link as follows: > https://drive.google.com/open?id=1RLekC-BnbAujXDl-xZ-vteMudrl2xC9D > Thanks. xfs_alloc_ag_vextent_near() shows up as a hot path if I restore this and throw an fs_mark small file workload at it. Some observations.. A trace of xfs_alloc_near* events over a 5 minute period shows a breakdown like the following: 513 xfs_alloc_near_first: 8102 xfs_alloc_near_greater: 180 xfs_alloc_near_lesser: If I re-mkfs the restored image and run the same workload, I end up with (to no real surprise): 61561 xfs_alloc_near_first: So clearly we are falling back to that second algorithm most of the time. Most of these lesser/greater allocs have minlen == maxlen == 38 blocks and occur mostly split between AG 0 and AG 2. Looking at the (initial) per-ag free space summary: # for i in $(seq 0 3); do xfs_db -c "freesp -a $i" /mnt/img ; done from to extents blocks pct 1 1 9 9 0.00 2 3 14243 28983 0.06 4 7 10595 42615 0.09 8 15 123 1468 0.00 16 31 862232 17244531 37.66 32 63 126968 7364144 16.08 64 127 1 88 0.00 16384 32767 1 30640 0.07 131072 262143 1 131092 0.29 524288 1048575 3 1835043 4.01 1048576 2097151 2 2883584 6.30 2097152 4194303 3 10456093 22.84 4194304 8388607 1 5767201 12.60 from to extents blocks pct 1 1 8 8 0.00 2 3 6557 13115 0.08 4 7 7710 30844 0.18 8 15 26 320 0.00 16 31 393039 7859395 47.08 32 63 250 9568 0.06 8388608 16777215 1 8780040 52.60 from to extents blocks pct 1 1 2418 2418 0.01 2 3 6126 16025 0.06 4 7 1052 4263 0.02 8 15 84 998 0.00 16 31 873095 17461168 62.84 32 63 35224 2042992 7.35 4194304 8388607 1 8259469 29.72 from to extents blocks pct 1 1 258 258 0.00 2 3 7951 16550 0.06 4 7 10484 42007 0.14 8 15 68 827 0.00 16 31 864835 17296959 57.85 32 63 58 2700 0.01 1048576 2097151 1 1310993 4.38 4194304 8388607 2 11228101 37.55 We can see that both AGs 0 and 2 have many likely >= 38 block extents, but each also has a significant number of < 32 block extents. The first bit can contribute to skipping the cntbt algorithm, the second bit leaves a proverbial minefield of too small extents that the second algorithm may have to sift through. AGs 1 and 3 look like they have decent (< 32) fragmentation, but both at least start with a much smaller number of 32+ block extents and thus increased odds of finding an extent in the cntbt. That said, I'm not seeing any (38 block, near) allocation requests in these AGs at all for some reason. Perhaps my trace window was too small to catch any.. Brian > Thanks, > > Mao > > > > Brian > > > > > Regards, > > > > > > Mao. > > > > > > > > Cheers, > > > > > > > > Dave. > > > > -- > > > > Dave Chinner > > > > david@fromorbit.com > > > > > > > > > > > > xfs: cap search distance in xfs_alloc_ag_vextent_near() > > > > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > Don't waste too much CPU time finding the perfect free extent when > > > > we don't have a large enough contiguous free space and there are > > > > many, many small free spaces that we'd do a linear search through. > > > > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved > > > > the same problem with the cost of finding the last free inodes in > > > > the inode allocation btree. > > > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > --- > > > > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- > > > > 1 file changed, 10 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > > index e1c0c0d2f1b0..c0c0a018e3bb 100644 > > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( > > > > } > > > > > > > > /* > > > > - * Search the btree in a given direction via the search cursor and compare > > > > - * the records found against the good extent we've already found. > > > > + * Search the btree in a given direction via the search cursor and compare the > > > > + * records found against the good extent we've already found. > > > > + * > > > > + * We cap this search to a number of records to prevent searching hundreds of > > > > + * thousands of records in a potentially futile search for a larger freespace > > > > + * when free space is really badly fragmented. Spending more CPU time than the > > > > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching > > > > + * a full btree block (~500 records on a 4k block size fs). > > > > */ > > > > STATIC int > > > > xfs_alloc_find_best_extent( > > > > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( > > > > int error; > > > > int i; > > > > unsigned busy_gen; > > > > + int searchdistance = args->mp->m_alloc_mxr[0]; > > > > > > > > /* The good extent is perfect, no need to search. */ > > > > if (!gdiff) > > > > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( > > > > error = xfs_btree_decrement(*scur, 0, &i); > > > > if (error) > > > > goto error0; > > > > - } while (i); > > > > + } while (i && searchdistance-- > 0); > > > > > > > > out_use_good: > > > > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 4:34 ` Dave Chinner 2018-10-24 9:02 ` Mao Cheng @ 2018-10-24 12:09 ` Brian Foster 2018-10-24 22:35 ` Dave Chinner 1 sibling, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-24 12:09 UTC (permalink / raw) To: Dave Chinner; +Cc: Mao Cheng, linux-xfs On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote: > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > Hi Brian, > > Thanks for your response. > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > sending, so resend it. > > > > If you have received previous email please ignore it, thanks > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > exported via samba. > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > = crc=1 finobt=0 spinodes=0 > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > = sunit=0 swidth=0 blks > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > free space about allocation groups: > > > > from to extents blocks pct > > > > 1 1 9 9 0.00 > > > > 2 3 14291 29124 0.19 > > > > 4 7 5689 22981 0.15 > > > > 8 15 119 1422 0.01 > > > > 16 31 754657 15093035 99.65 > > 750,000 fragmented free extents means something like 1600 btree > leaf blocks to hold them all..... > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > algorithms in XFS. That function itself includes a couple different > > > algorithms for the "near" allocation based on the state of the AG. One > > > looks like an intra-block search of the by-size free space btree (if not > > > many suitably sized extents are available) and the second looks like an > > > outward sweep of the by-block free space btree to find a suitably sized > > > extent. > > Yup, just like the free inode allocation search, which is capped > at about 10 btree blocks left and right to prevent searching the > entire tree for the one free inode that remains in it. > > The problem here is that the first algorithm fails immediately > because there isn't a contiguous free space large enough for the > allocation being requested, and so it finds the largest block whose > /location/ is less than target block as the start point for the > nearest largest freespace. > Hmm, not sure I follow what you're saying here wrt to why we end up in the second algorithm. I was thinking the most likely condition is that there are actually plenty of suitably sized extents, but as shown by the free space data, they're amidst a huge number of too small extents. The first algorithm is only active if a lookup in the cntbt lands in the last block (the far right leaf) of the btree, so unless I'm missing something this would mean we'd skip right past it to the second algorithm if the last N blocks (where N > 1) of the cnt_bt have large enough extents. IOW, the first algo seems like an optimization for when we know there are only a small number of minimum sized extents available and the second (location based) algorithm would mostly churn. Regardless, we end up in the same place in the end... > IOW, we do an expanding radius size search based on physical > locality rather than finding a free space based on size. Once we > find a good extent to either the left or right, we then stop that > search and try to find a better extent to the other direction > (xfs_alloc_find_best_extent()). That search is not bound, so can > search the entire of the tree in that remaining directory without > finding a better match. > > We can't cut the initial left/right search shorter - we've got to > find a large enough free extent to succeed, but we can chop > xfs_alloc_find_best_extent() short, similar to searchdistance in > xfs_dialloc_ag_inobt(). The patch below does that. > This search looks like it goes as far in the opposite direction as the current candidate extent. So I take it this could potentially go off the rails if we find one suitable extent in one direction that is relatively far off wrt to startblock, and then the opposite direction happens to be populated with a ton of too small extents before we extend to the range that breaks the search. I'm curious whether this contributes to the reporter's problem at all, but this sounds like a reasonable change to me either way. > Really, though, I think what we need to a better size based search > before falling back to a locality based search. This is more > complex, so not a few minutes work and requires a lot more thought > and testing. > Indeed. As noted above, the current size based search strikes me as an optimization that only executes under particular conditions. Since the purpose of this function is locality allocation, I'm wondering if we could implement a smarter location based search using information available in the by-size tree. For example, suppose we could identify the closest minimally sized extents to agbno in order to better seed the left/right starting points of the location based search. This of course would require careful heuristics/tradeoffs to make sure we don't just replace a bnobt scan with a cntbt scan. Or perhaps do something like locate the range of cntbt extents within the min/max size, then make an intelligent decision over whether to scan that set of cntbt records, perform a smarter bnobt scan or resort to the current algorithm. Just thinking out loud... Brian > > We share an xfs filesystem to windows via SMB protocol. > > There are about 5 windows copy small files to the samba share at the same time. > > The main problem is the throughput degraded from 30MB/s to around > > 10KB/s periodically and recovered about 5s later. > > The kworker consumes 100% of one CPU when the throughput degraded and > > kworker task is wrteback. > > /proc/vmstat shows nr_dirty is very close to nr_dirty_threshold > > and nr_writeback is too small(is that means there too many dirty pages > > in page cache and can't be written out to disk?) > > incoming writes are throttled at the rate writeback makes progress, > hence the system will sit at the threshold. This is normal. > Writeback is just slow because of the freespace fragmentation in the > filesystem. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > > xfs: cap search distance in xfs_alloc_ag_vextent_near() > > From: Dave Chinner <dchinner@redhat.com> > > Don't waste too much CPU time finding the perfect free extent when > we don't have a large enough contiguous free space and there are > many, many small free spaces that we'd do a linear search through. > Modelled on searchdistance in xfs_dialloc_ag_inobt() which solved > the same problem with the cost of finding the last free inodes in > the inode allocation btree. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_alloc.c | 13 ++++++++++--- > 1 file changed, 10 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index e1c0c0d2f1b0..c0c0a018e3bb 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -886,8 +886,14 @@ xfs_alloc_ag_vextent_exact( > } > > /* > - * Search the btree in a given direction via the search cursor and compare > - * the records found against the good extent we've already found. > + * Search the btree in a given direction via the search cursor and compare the > + * records found against the good extent we've already found. > + * > + * We cap this search to a number of records to prevent searching hundreds of > + * thousands of records in a potentially futile search for a larger freespace > + * when free space is really badly fragmented. Spending more CPU time than the > + * IO cost of a sub-optimal allocation is a bad tradeoff - cap it at searching > + * a full btree block (~500 records on a 4k block size fs). > */ > STATIC int > xfs_alloc_find_best_extent( > @@ -906,6 +912,7 @@ xfs_alloc_find_best_extent( > int error; > int i; > unsigned busy_gen; > + int searchdistance = args->mp->m_alloc_mxr[0]; > > /* The good extent is perfect, no need to search. */ > if (!gdiff) > @@ -963,7 +970,7 @@ xfs_alloc_find_best_extent( > error = xfs_btree_decrement(*scur, 0, &i); > if (error) > goto error0; > - } while (i); > + } while (i && searchdistance-- > 0); > > out_use_good: > xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 12:09 ` Brian Foster @ 2018-10-24 22:35 ` Dave Chinner 2018-10-25 13:21 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2018-10-24 22:35 UTC (permalink / raw) To: Brian Foster; +Cc: Mao Cheng, linux-xfs On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote: > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > > Hi Brian, > > > Thanks for your response. > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > > sending, so resend it. > > > > > If you have received previous email please ignore it, thanks > > > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > > exported via samba. > > > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > > = crc=1 finobt=0 spinodes=0 > > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > > = sunit=0 swidth=0 blks > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > > free space about allocation groups: > > > > > from to extents blocks pct > > > > > 1 1 9 9 0.00 > > > > > 2 3 14291 29124 0.19 > > > > > 4 7 5689 22981 0.15 > > > > > 8 15 119 1422 0.01 > > > > > 16 31 754657 15093035 99.65 > > > > 750,000 fragmented free extents means something like 1600 btree > > leaf blocks to hold them all..... > > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > > algorithms in XFS. That function itself includes a couple different > > > > algorithms for the "near" allocation based on the state of the AG. One > > > > looks like an intra-block search of the by-size free space btree (if not > > > > many suitably sized extents are available) and the second looks like an > > > > outward sweep of the by-block free space btree to find a suitably sized > > > > extent. > > > > Yup, just like the free inode allocation search, which is capped > > at about 10 btree blocks left and right to prevent searching the > > entire tree for the one free inode that remains in it. > > > > The problem here is that the first algorithm fails immediately > > because there isn't a contiguous free space large enough for the > > allocation being requested, and so it finds the largest block whose > > /location/ is less than target block as the start point for the > > nearest largest freespace. > > > > Hmm, not sure I follow what you're saying here wrt to why we end up in > the second algorithm. I probably didn't explain it well. I wrote it quickly, didn't proof read. What I meant was "...because there isn't enough contiguous free space large for the allocation requested to land in the last btree block, and so...." > I was thinking the most likely condition is that > there are actually plenty of suitably sized extents, but as shown by the > free space data, they're amidst a huge number of too small extents. The Yup, if we do a by-size lookup for >= 22 blocks and there are a 1000 free 22 block extents, the lookup doesn't land in the last block. Straight to physical locality search. > first algorithm is only active if a lookup in the cntbt lands in the > last block (the far right leaf) of the btree, so unless I'm missing > something this would mean we'd skip right past it to the second > algorithm if the last N blocks (where N > 1) of the cnt_bt have large > enough extents. *nod*. > IOW, the first algo seems like an optimization for when we know there > are only a small number of minimum sized extents available and the > second (location based) algorithm would mostly churn. Regardless, we end > up in the same place in the end... > > > IOW, we do an expanding radius size search based on physical > > locality rather than finding a free space based on size. Once we > > find a good extent to either the left or right, we then stop that > > search and try to find a better extent to the other direction > > (xfs_alloc_find_best_extent()). That search is not bound, so can > > search the entire of the tree in that remaining directory without > > finding a better match. > > > > We can't cut the initial left/right search shorter - we've got to > > find a large enough free extent to succeed, but we can chop > > xfs_alloc_find_best_extent() short, similar to searchdistance in > > xfs_dialloc_ag_inobt(). The patch below does that. > > > > This search looks like it goes as far in the opposite direction as the > current candidate extent. So I take it this could potentially go off the > rails if we find one suitable extent in one direction that is relatively > far off wrt to startblock, and then the opposite direction happens to be > populated with a ton of too small extents before we extend to the range > that breaks the search. Precisely. > I'm curious whether this contributes to the reporter's problem at all, > but this sounds like a reasonable change to me either way. So am I. It's the low hanging fruit - we have ot search until we find the first candidate block (no chioce in that) but we can chose to terminate the "is there a better choice" search. > > Really, though, I think what we need to a better size based search > > before falling back to a locality based search. This is more > > complex, so not a few minutes work and requires a lot more thought > > and testing. > > > > Indeed. As noted above, the current size based search strikes me as an > optimization that only executes under particular conditions. It's the common condition in a typical filesystem - if large, contiguous free spaces in the filesystem, then the lookup will almost always land in the last block of the btree. > Since the purpose of this function is locality allocation, Well, locality is the /second/ consideration - the first algorithm prioritises maxlen for contiguous allocation, then selects the best candidate by locality. The second alogortihm prioiritises locality over allocation length. > I'm wondering > if we could implement a smarter location based search using information > available in the by-size tree. For example, suppose we could identify > the closest minimally sized extents to agbno in order to better seed the > left/right starting points of the location based search. This of course > would require careful heuristics/tradeoffs to make sure we don't just > replace a bnobt scan with a cntbt scan. I wouldn't bother. I'd just take the "last block" algorithm and make it search all the >= contiguous free space extents for best locality before dropping back to the minlen search. Really, that's what the first algorithm should be. Looking at the last block and selecting the best free space by size and then locality is really just a degenerate case of the more general algorithm. Back when this algorithm was designed, AGs could only be 4GB in size, so searching the oly the last block by size made sense - the total number of free space extents is fairly well bound by the AG size. That bound essentially went away with expanding AGs to 1TB, but the algorithm wasn't changed to reflect that even a small amount of free space fragmentation could result almost never hitting the last block of the btree.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-24 22:35 ` Dave Chinner @ 2018-10-25 13:21 ` Brian Foster 2018-10-26 1:03 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-25 13:21 UTC (permalink / raw) To: Dave Chinner; +Cc: Mao Cheng, linux-xfs On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > On Wed, Oct 24, 2018 at 03:34:16PM +1100, Dave Chinner wrote: > > > On Wed, Oct 24, 2018 at 11:01:13AM +0800, Mao Cheng wrote: > > > > Hi Brian, > > > > Thanks for your response. > > > > Brian Foster <bfoster@redhat.com> 于2018年10月23日周二 下午10:53写道: > > > > > > > > > > On Tue, Oct 23, 2018 at 03:56:51PM +0800, Mao Cheng wrote: > > > > > > Sorry for trouble again. I just wrote wrong function name in previous > > > > > > sending, so resend it. > > > > > > If you have received previous email please ignore it, thanks > > > > > > > > > > > > we have a XFS mkfs with "-k" and mount with the default options( > > > > > > rw,relatime,attr2,inode64,noquota), the size is about 2.2TB,and > > > > > > exported via samba. > > > > > > > > > > > > [root@test1 home]# xfs_info /dev/sdk > > > > > > meta-data=/dev/sdk isize=512 agcount=4, agsize=131072000 blks > > > > > > = sectsz=4096 attr=2, projid32bit=1 > > > > > > = crc=1 finobt=0 spinodes=0 > > > > > > data = bsize=4096 blocks=524288000, imaxpct=5 > > > > > > = sunit=0 swidth=0 blks > > > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > > > log =internal bsize=4096 blocks=256000, version=2 > > > > > > = sectsz=4096 sunit=1 blks, lazy-count=1 > > > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > > > > free space about allocation groups: > > > > > > from to extents blocks pct > > > > > > 1 1 9 9 0.00 > > > > > > 2 3 14291 29124 0.19 > > > > > > 4 7 5689 22981 0.15 > > > > > > 8 15 119 1422 0.01 > > > > > > 16 31 754657 15093035 99.65 > > > > > > 750,000 fragmented free extents means something like 1600 btree > > > leaf blocks to hold them all..... > > > > > > > > xfs_alloc_ag_vextent_near() is one of the several block allocation > > > > > algorithms in XFS. That function itself includes a couple different > > > > > algorithms for the "near" allocation based on the state of the AG. One > > > > > looks like an intra-block search of the by-size free space btree (if not > > > > > many suitably sized extents are available) and the second looks like an > > > > > outward sweep of the by-block free space btree to find a suitably sized > > > > > extent. > > > > > > Yup, just like the free inode allocation search, which is capped > > > at about 10 btree blocks left and right to prevent searching the > > > entire tree for the one free inode that remains in it. > > > > > > The problem here is that the first algorithm fails immediately > > > because there isn't a contiguous free space large enough for the > > > allocation being requested, and so it finds the largest block whose > > > /location/ is less than target block as the start point for the > > > nearest largest freespace. > > > > > > > Hmm, not sure I follow what you're saying here wrt to why we end up in > > the second algorithm. > > I probably didn't explain it well. I wrote it quickly, didn't proof > read. What I meant was "...because there isn't enough contiguous > free space large for the allocation requested to land in the last > btree block, and so...." > > > I was thinking the most likely condition is that > > there are actually plenty of suitably sized extents, but as shown by the > > free space data, they're amidst a huge number of too small extents. The > > Yup, if we do a by-size lookup for >= 22 blocks and there are a > 1000 free 22 block extents, the lookup doesn't land in the last > block. Straight to physical locality search. > > > first algorithm is only active if a lookup in the cntbt lands in the > > last block (the far right leaf) of the btree, so unless I'm missing > > something this would mean we'd skip right past it to the second > > algorithm if the last N blocks (where N > 1) of the cnt_bt have large > > enough extents. > > *nod*. > > > IOW, the first algo seems like an optimization for when we know there > > are only a small number of minimum sized extents available and the > > second (location based) algorithm would mostly churn. Regardless, we end > > up in the same place in the end... > > > > > IOW, we do an expanding radius size search based on physical > > > locality rather than finding a free space based on size. Once we > > > find a good extent to either the left or right, we then stop that > > > search and try to find a better extent to the other direction > > > (xfs_alloc_find_best_extent()). That search is not bound, so can > > > search the entire of the tree in that remaining directory without > > > finding a better match. > > > > > > We can't cut the initial left/right search shorter - we've got to > > > find a large enough free extent to succeed, but we can chop > > > xfs_alloc_find_best_extent() short, similar to searchdistance in > > > xfs_dialloc_ag_inobt(). The patch below does that. > > > > > > > This search looks like it goes as far in the opposite direction as the > > current candidate extent. So I take it this could potentially go off the > > rails if we find one suitable extent in one direction that is relatively > > far off wrt to startblock, and then the opposite direction happens to be > > populated with a ton of too small extents before we extend to the range > > that breaks the search. > > Precisely. > > > I'm curious whether this contributes to the reporter's problem at all, > > but this sounds like a reasonable change to me either way. > > So am I. It's the low hanging fruit - we have ot search until we > find the first candidate block (no chioce in that) but we can chose > to terminate the "is there a better choice" search. > > > > Really, though, I think what we need to a better size based search > > > before falling back to a locality based search. This is more > > > complex, so not a few minutes work and requires a lot more thought > > > and testing. > > > > > > > Indeed. As noted above, the current size based search strikes me as an > > optimization that only executes under particular conditions. > > It's the common condition in a typical filesystem - if large, > contiguous free spaces in the filesystem, then the lookup will > almost always land in the last block of the btree. > I guess that makes sense for a clean fs. I wonder how long that state persists in practice as usage increases, however. The more smaller free extents that are created, the more this algorithm decision varies to the size (or maxlen at least) of the request. > > Since the purpose of this function is locality allocation, > > Well, locality is the /second/ consideration - the first algorithm > prioritises maxlen for contiguous allocation, then selects the best > candidate by locality. The second alogortihm prioiritises locality > over allocation length. > Ah yes, good point. I missed the maxlen/minlen dynamic on my first read through. So if we have some small number of extents that satisfy maxlen (as opposed to the [minlen, maxlen] range), we prioritize those extents and apply locality to that subset. If we have none of such extents or too many, we move on to the bno-based fan out search. There, we find the closest extent that satisfies the [minlen, maxlen] request. > > I'm wondering > > if we could implement a smarter location based search using information > > available in the by-size tree. For example, suppose we could identify > > the closest minimally sized extents to agbno in order to better seed the > > left/right starting points of the location based search. This of course > > would require careful heuristics/tradeoffs to make sure we don't just > > replace a bnobt scan with a cntbt scan. > > I wouldn't bother. I'd just take the "last block" algorithm and make > it search all the >= contiguous free space extents for best locality > before dropping back to the minlen search. > Ok, that makes sense. The caveat seems to be though that the "last block" algorithm searches all of the applicable records to discover the best locality. We could open up this search as such, but if free space happens to be completely fragmented to >= requested extents, that could mean every allocation falls into a full cntbt scan where a bnobt lookup would result in a much faster allocation. So ISTM that we still need some kind of intelligent heuristic to fall back to the second algorithm to cover the "too many" case. What exactly that is may take some more thought, experimentation and testing. > Really, that's what the first algorithm should be. Looking at the > last block and selecting the best free space by size and then > locality is really just a degenerate case of the more general > algorithm. > > Back when this algorithm was designed, AGs could only be 4GB in > size, so searching the oly the last block by size made sense - the > total number of free space extents is fairly well bound by the > AG size. That bound essentially went away with expanding AGs to 1TB, > but the algorithm wasn't changed to reflect that even a small amount > of free space fragmentation could result almost never hitting the > last block of the btree.... > Ok, thanks for the historical context. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-25 13:21 ` Brian Foster @ 2018-10-26 1:03 ` Dave Chinner 2018-10-26 13:03 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2018-10-26 1:03 UTC (permalink / raw) To: Brian Foster; +Cc: Mao Cheng, linux-xfs On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > > I'm wondering > > > if we could implement a smarter location based search using information > > > available in the by-size tree. For example, suppose we could identify > > > the closest minimally sized extents to agbno in order to better seed the > > > left/right starting points of the location based search. This of course > > > would require careful heuristics/tradeoffs to make sure we don't just > > > replace a bnobt scan with a cntbt scan. > > > > I wouldn't bother. I'd just take the "last block" algorithm and make > > it search all the >= contiguous free space extents for best locality > > before dropping back to the minlen search. > > > > Ok, that makes sense. The caveat seems to be though that the "last > block" algorithm searches all of the applicable records to discover the > best locality. We could open up this search as such, but if free space > happens to be completely fragmented to >= requested extents, that could > mean every allocation falls into a full cntbt scan where a bnobt lookup > would result in a much faster allocation. Yup, we'll need to bound it so it doesn't do stupid things. :P > So ISTM that we still need some kind of intelligent heuristic to fall > back to the second algorithm to cover the "too many" case. What exactly > that is may take some more thought, experimentation and testing. Yeah, that's the difficulty with making core allocator algorithm changes - how to characterise the effect of the change. I'm not sure that's a huge problem in this case, though, because selecting a matching contig freespace is almost always going to be better for filesystem longetivty and freespace fragmentation resistance than slecting a shorter freespace and doing lots more small allocations. it's the 'lots of small allocations' that really makes the freespace framgmentation spiral out of control, so if we can avoid that until we've used all the matching contig free spaces we'll be better off in the long run. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-26 1:03 ` Dave Chinner @ 2018-10-26 13:03 ` Brian Foster 2018-10-27 3:16 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-26 13:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Mao Cheng, linux-xfs On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote: > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > > > I'm wondering > > > > if we could implement a smarter location based search using information > > > > available in the by-size tree. For example, suppose we could identify > > > > the closest minimally sized extents to agbno in order to better seed the > > > > left/right starting points of the location based search. This of course > > > > would require careful heuristics/tradeoffs to make sure we don't just > > > > replace a bnobt scan with a cntbt scan. > > > > > > I wouldn't bother. I'd just take the "last block" algorithm and make > > > it search all the >= contiguous free space extents for best locality > > > before dropping back to the minlen search. > > > > > > > Ok, that makes sense. The caveat seems to be though that the "last > > block" algorithm searches all of the applicable records to discover the > > best locality. We could open up this search as such, but if free space > > happens to be completely fragmented to >= requested extents, that could > > mean every allocation falls into a full cntbt scan where a bnobt lookup > > would result in a much faster allocation. > > Yup, we'll need to bound it so it doesn't do stupid things. :P > Yep. > > So ISTM that we still need some kind of intelligent heuristic to fall > > back to the second algorithm to cover the "too many" case. What exactly > > that is may take some more thought, experimentation and testing. > > Yeah, that's the difficulty with making core allocator algorithm > changes - how to characterise the effect of the change. I'm not sure > that's a huge problem in this case, though, because selecting a > matching contig freespace is almost always going to be better for > filesystem longetivty and freespace fragmentation resistance than > slecting a shorter freespace and doing lots more small allocations. > it's the 'lots of small allocations' that really makes the freespace > framgmentation spiral out of control, so if we can avoid that until > we've used all the matching contig free spaces we'll be better off > in the long run. > Ok, so I ran fs_mark against the metadump with your patch and a quick hack to unconditionally scan the cntbt if maxlen extents are available (up to mxr[0] records similar to your patch, to avoid excessive scans). The xfs_alloc_find_best_extent() patch alone didn't have much of a noticeable effect, but that is an isolated optimization and I'm only doing coarse measurements atm that probably hide it in the noise. The write workload improves quite a bit with the addition of the cntbt change. Both throughput (via iostat 60s intervals) and fs_mark files/sec change from a slow high/low sweeping behavior to much more consistent and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec change to a much more consistent 27-39MB/s and ~200-300 f/s. A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first events which means we never fell back to the bnobt based search. I'm not sure the mxr thing is the right approach necessarily, I just wanted something quick that would demonstrate the potential upside gains without going off the rails. One related concern I have with restricting the locality of the search too much, for example, is that we use NEAR_BNO allocs for other things like inode allocation locality that might not be represented in this simple write only workload. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-26 13:03 ` Brian Foster @ 2018-10-27 3:16 ` Dave Chinner 2018-10-28 14:09 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2018-10-27 3:16 UTC (permalink / raw) To: Brian Foster; +Cc: Mao Cheng, linux-xfs On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote: > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote: > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > > > > I'm wondering > > > > > if we could implement a smarter location based search using information > > > > > available in the by-size tree. For example, suppose we could identify > > > > > the closest minimally sized extents to agbno in order to better seed the > > > > > left/right starting points of the location based search. This of course > > > > > would require careful heuristics/tradeoffs to make sure we don't just > > > > > replace a bnobt scan with a cntbt scan. > > > > > > > > I wouldn't bother. I'd just take the "last block" algorithm and make > > > > it search all the >= contiguous free space extents for best locality > > > > before dropping back to the minlen search. > > > > > > > > > > Ok, that makes sense. The caveat seems to be though that the "last > > > block" algorithm searches all of the applicable records to discover the > > > best locality. We could open up this search as such, but if free space > > > happens to be completely fragmented to >= requested extents, that could > > > mean every allocation falls into a full cntbt scan where a bnobt lookup > > > would result in a much faster allocation. > > > > Yup, we'll need to bound it so it doesn't do stupid things. :P > > > > Yep. > > > > So ISTM that we still need some kind of intelligent heuristic to fall > > > back to the second algorithm to cover the "too many" case. What exactly > > > that is may take some more thought, experimentation and testing. > > > > Yeah, that's the difficulty with making core allocator algorithm > > changes - how to characterise the effect of the change. I'm not sure > > that's a huge problem in this case, though, because selecting a > > matching contig freespace is almost always going to be better for > > filesystem longetivty and freespace fragmentation resistance than > > slecting a shorter freespace and doing lots more small allocations. > > it's the 'lots of small allocations' that really makes the freespace > > framgmentation spiral out of control, so if we can avoid that until > > we've used all the matching contig free spaces we'll be better off > > in the long run. > > > > Ok, so I ran fs_mark against the metadump with your patch and a quick > hack to unconditionally scan the cntbt if maxlen extents are available > (up to mxr[0] records similar to your patch, to avoid excessive scans). > The xfs_alloc_find_best_extent() patch alone didn't have much of a > noticeable effect, but that is an isolated optimization and I'm only > doing coarse measurements atm that probably hide it in the noise. > > The write workload improves quite a bit with the addition of the cntbt > change. Both throughput (via iostat 60s intervals) and fs_mark files/sec > change from a slow high/low sweeping behavior to much more consistent > and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec > change to a much more consistent 27-39MB/s and ~200-300 f/s. That looks really promising. :) > A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first > events which means we never fell back to the bnobt based search. I'm not > sure the mxr thing is the right approach necessarily, I just wanted > something quick that would demonstrate the potential upside gains > without going off the rails. *nod*. it's a good first approximation, though. The inobt limits search to ~10 inobt records left and right (~1200 nearest inodes) and if there were none free it allocated a new chunk. The records in the by-size tree have a secondary sort order of by-bno, so we know that as we walk the records of the same size, we'll get closer to the target we want. Hmmm. I wonder. xfs_cntbt_key_diff() discriminates first by size, and then if size matches by startblock. But xfs_alloc_vextent_near() does this: if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i))) goto error0; It sets the startblock of the target lookup to be 0, which means it will always find the extent closest to the start of the AG of the same size of larger. IOWs, it looks for size without considering locality. But the lookup can do both. ~if we change that to be: error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i); if will actually find the first block of size >= args->maxlen and >= args->agbno. IOWs, it does the majority of the locality search for us. All we need to do is check that the extent on the left side of the return extent is closer to the target than the extent that is returned..... Which makes me wonder - can we just get rid of that first algorithm (the lastblock search for largest) and replace it with a simple lookup for args->agbno, args->maxlen + args->alignment? That way we'll get an extent that will be big enough for alignment to succeed if alignment is required, big enough to fit if alignment is not required, or if nothing is found, we can then do a <= to find the first extent smaller and closest to the target: error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i); If the record returned has a length = args->maxlen, then we've got the physically closest exact match and we should use it. if the record is shorter than args->maxlen but > args->minlen, then there are no extents large enough for maxlen, then we should check if the right side record is closer to target and select between the two. And that can replace the entirity of xfs_alloc_ag_vextent_near. The existing "nothing >= maxlen" case that goes to xfs_alloc_ag_vextent_small() selects the largest free extent at the highest block number (right nost record) and so ignores locality. The xfs_alloc_lookup_le() lookup does they same thing, except it also picks the physically closest extent of the largest size. That's an improvement. The lastblock case currently selects the physically closest largest free extent in the block that will fit the (aligned) length we require. We can get really close to that with a >= (args->agbno, args->maxlen + args->alignment) lookup And the <= (args->agbno, args->maxlen) finds us the largest, closest free extent that we can check against minlen and return that... IOWs, something like this, which is a whole lot simpler and faster than the linear searches we do now and should return much better candidate extents: check_left = true; /* look for size and closeness */ error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen + alignment, &i); if (error) goto error0; if (!i) { /* * nothing >= target. Search for best <= instead so * we get the get the largest closest match to what * we are asking for. */ error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen + alignment, &i); if (error) goto error0; /* try off the free list */ if (!i) { error = xfs_alloc_ag_vextent_small(args, cnt_cur, <bno <len, &i); ...... return ....; } check_left = false; } /* best candidate found, get the record and check adjacent */ error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i) .... if (check_left) { xfs_btree_increment(cnt_cur) xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i) .... } else { xfs_btree_decrement(cnt_cur) xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i) .... } /* align results if required */ /* check against minlen */ /* compute distance diff to target */ /* select best extent, fixup trees */ /* return best extent */ > One related concern I have with restricting > the locality of the search too much, for example, is that we use > NEAR_BNO allocs for other things like inode allocation locality that > might not be represented in this simple write only workload. Inode chunk allocation sets minlen == maxlen, in which case we should run your new by-size search to exhaustion and never hit the existing second algorithm. (i.e. don't bound it if minlen = maxlen) i.e. your new algorithm should get the same result as the existing code, but much faster because it's sonly searching extents we know can satisfy the allocation requirements. The proposed algorithm above would be even faster :P Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-27 3:16 ` Dave Chinner @ 2018-10-28 14:09 ` Brian Foster 2018-10-29 0:17 ` Dave Chinner 0 siblings, 1 reply; 17+ messages in thread From: Brian Foster @ 2018-10-28 14:09 UTC (permalink / raw) To: Dave Chinner; +Cc: Mao Cheng, linux-xfs On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote: > On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote: > > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote: > > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > > > > > I'm wondering > > > > > > if we could implement a smarter location based search using information > > > > > > available in the by-size tree. For example, suppose we could identify > > > > > > the closest minimally sized extents to agbno in order to better seed the > > > > > > left/right starting points of the location based search. This of course > > > > > > would require careful heuristics/tradeoffs to make sure we don't just > > > > > > replace a bnobt scan with a cntbt scan. > > > > > > > > > > I wouldn't bother. I'd just take the "last block" algorithm and make > > > > > it search all the >= contiguous free space extents for best locality > > > > > before dropping back to the minlen search. > > > > > > > > > > > > > Ok, that makes sense. The caveat seems to be though that the "last > > > > block" algorithm searches all of the applicable records to discover the > > > > best locality. We could open up this search as such, but if free space > > > > happens to be completely fragmented to >= requested extents, that could > > > > mean every allocation falls into a full cntbt scan where a bnobt lookup > > > > would result in a much faster allocation. > > > > > > Yup, we'll need to bound it so it doesn't do stupid things. :P > > > > > > > Yep. > > > > > > So ISTM that we still need some kind of intelligent heuristic to fall > > > > back to the second algorithm to cover the "too many" case. What exactly > > > > that is may take some more thought, experimentation and testing. > > > > > > Yeah, that's the difficulty with making core allocator algorithm > > > changes - how to characterise the effect of the change. I'm not sure > > > that's a huge problem in this case, though, because selecting a > > > matching contig freespace is almost always going to be better for > > > filesystem longetivty and freespace fragmentation resistance than > > > slecting a shorter freespace and doing lots more small allocations. > > > it's the 'lots of small allocations' that really makes the freespace > > > framgmentation spiral out of control, so if we can avoid that until > > > we've used all the matching contig free spaces we'll be better off > > > in the long run. > > > > > > > Ok, so I ran fs_mark against the metadump with your patch and a quick > > hack to unconditionally scan the cntbt if maxlen extents are available > > (up to mxr[0] records similar to your patch, to avoid excessive scans). > > The xfs_alloc_find_best_extent() patch alone didn't have much of a > > noticeable effect, but that is an isolated optimization and I'm only > > doing coarse measurements atm that probably hide it in the noise. > > > > The write workload improves quite a bit with the addition of the cntbt > > change. Both throughput (via iostat 60s intervals) and fs_mark files/sec > > change from a slow high/low sweeping behavior to much more consistent > > and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec > > change to a much more consistent 27-39MB/s and ~200-300 f/s. > > That looks really promising. :) > > > A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first > > events which means we never fell back to the bnobt based search. I'm not > > sure the mxr thing is the right approach necessarily, I just wanted > > something quick that would demonstrate the potential upside gains > > without going off the rails. > > *nod*. it's a good first approximation, though. The inobt limits > search to ~10 inobt records left and right (~1200 nearest inodes) > and if there were none free it allocated a new chunk. > > The records in the by-size tree have a secondary sort order of > by-bno, so we know that as we walk the records of the same size, > we'll get closer to the target we want. > Yeah, this occurred to me while poking at the state of the cntbt trees of the metadump. I was thinking specifically about whether we could use that to optimize the existing algorithm a bit. For example, if we skip the lastblock logic and do find many maxlen extents, use the agbno of the first record to avoid sifting through the entire set. If the agbno was > the requested agbno, for example, we could probably end the search right there... > Hmmm. I wonder. > > xfs_cntbt_key_diff() discriminates first by size, and then if size > matches by startblock. > > But xfs_alloc_vextent_near() does this: > > if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i))) > goto error0; > > It sets the startblock of the target lookup to be 0, which means it > will always find the extent closest to the start of the AG of the > same size of larger. IOWs, it looks for size without considering > locality. > ... but I wasn't aware we could do this. ;) > But the lookup can do both. ~if we change that to be: > > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i); > > if will actually find the first block of size >= args->maxlen and > >= args->agbno. IOWs, it does the majority of the locality search > for us. All we need to do is check that the extent on the left side > of the return extent is closer to the target than the extent that is > returned..... > > Which makes me wonder - can we just get rid of that first algorithm > (the lastblock search for largest) and replace it with a simple > lookup for args->agbno, args->maxlen + args->alignment? > > That way we'll get an extent that will be big enough for alignment > to succeed if alignment is required, big enough to fit if alignment > is not required, or if nothing is found, we can then do a <= to find > the first extent smaller and closest to the target: > > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i); > > If the record returned has a length = args->maxlen, then we've > got the physically closest exact match and we should use it. > > if the record is shorter than args->maxlen but > args->minlen, then > there are no extents large enough for maxlen, then we should check > if the right side record is closer to target and select between the > two. > > And that can replace the entirity of xfs_alloc_ag_vextent_near. > My general sense to this point from the code and your feedback about the priority of the algorithm is the fundamental problem here is that the scope of the first algorithm is simply too narrow and the second/fallback algorithm too expensive. At minimum, I think applying agbno lookups to the cntbt lookup as you describe here allows us to incorporate more locality into the first (maxlen) algorithm and widen its scope accordingly. This sounds like a great starting point to me. The tradeoff may be that we don't get the locality benefit of all > maxlen sized extents since the agbno part of the lookup is secondary to the size, but that may be a fine tradeoff if the benefit is we can use the first/faster algorithm for a whole lot more cases. I'm actually curious if doing that as a first step to open up the first algo to _all_ maxlen cases has a noticeable effect on this workload. If so, that could be a nice intermediate step to avoid paying the penalty for the "too many >= maxlen extents" case before rewriting the broader algorithm. The remaining effort is then to absorb the second algorithm into the first such that the second is eventually no longer necessary. > The existing "nothing >= maxlen" case that goes to > xfs_alloc_ag_vextent_small() selects the largest free extent at the > highest block number (right nost record) and so ignores locality. > The xfs_alloc_lookup_le() lookup does they same thing, except it > also picks the physically closest extent of the largest size. That's > an improvement. > That looks like another point where the first algorithm bails out too quickly as well. We find !maxlen extents, decrement to get the next (highest agbno) smallest record, then go on to check it for alignment and whatnot without any consideration for any other records that may exist of the same size. Unless I'm missing something, the fact that we decrement in the _small() case and jump back to the incrementing algorithm of the caller pretty much ensures we'll only consider one such record before going into the fan out search. > The lastblock case currently selects the physically closest largest > free extent in the block that will fit the (aligned) length we > require. We can get really close to that with a >= (args->agbno, > args->maxlen + args->alignment) lookup > > And the <= (args->agbno, args->maxlen) finds us the largest, > closest free extent that we can check against minlen and return > that... > > IOWs, something like this, which is a whole lot simpler and faster > than the linear searches we do now and should return much better > candidate extents: > This all sounds pretty reasonable to me. I need to think more about the details. I.e., whether we'd still want/need to fallback to a worst case scan in certain cases, which may not be a problem if the first algorithm is updated to find extents in almost all cases instead of being limited to when there are a small number of maxlen extents. I'm also wondering if we could enhance this further by repeating the agbno lookup at the top for the next smallest extent size in the min/max range when no suitable extent is found. Then perhaps the high level algorithm is truly simplified to find the "largest available extent with best locality for that size." > check_left = true; > > /* look for size and closeness */ > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, > args->maxlen + alignment, &i); > if (error) > goto error0; > > if (!i) { > /* > * nothing >= target. Search for best <= instead so > * we get the get the largest closest match to what > * we are asking for. > */ > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, > args->maxlen + alignment, &i); > if (error) > goto error0; > > /* try off the free list */ > if (!i) { > error = xfs_alloc_ag_vextent_small(args, cnt_cur, <bno > <len, &i); > ...... > return ....; > } > check_left = false; > } > > /* best candidate found, get the record and check adjacent */ > error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i) > .... > > if (check_left) { > xfs_btree_increment(cnt_cur) > xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i) > .... > } else { > xfs_btree_decrement(cnt_cur) > xfs_alloc_get_rec(cnt_cur, altbno, &altlen &i) > .... > } > > /* align results if required */ > > /* check against minlen */ > > /* compute distance diff to target */ > > /* select best extent, fixup trees */ > > /* return best extent */ > > > One related concern I have with restricting > > the locality of the search too much, for example, is that we use > > NEAR_BNO allocs for other things like inode allocation locality that > > might not be represented in this simple write only workload. > > Inode chunk allocation sets minlen == maxlen, in which case we > should run your new by-size search to exhaustion and never hit the > existing second algorithm. (i.e. don't bound it if minlen = maxlen) > i.e. your new algorithm should get the same result as the existing > code, but much faster because it's sonly searching extents we know > can satisfy the allocation requirements. The proposed algorithm > above would be even faster :P > Yes, that makes sense. The search can potentially be made simpler in that case. Also note that minlen == maxlen for most of the data extent allocations in the test case I ran as well. It looks like the xfs_bmap_btalloc*() path can use the longest free extent in the AG to set a starting minlen and then we loosen constraints over multiple allocation attempts if the first happen to fail. Anyways, I'll play around with this more next week. Thanks for all of the thoughts. Brian > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-28 14:09 ` Brian Foster @ 2018-10-29 0:17 ` Dave Chinner 2018-10-29 9:53 ` Brian Foster 0 siblings, 1 reply; 17+ messages in thread From: Dave Chinner @ 2018-10-29 0:17 UTC (permalink / raw) To: Brian Foster; +Cc: Mao Cheng, linux-xfs On Sun, Oct 28, 2018 at 10:09:08AM -0400, Brian Foster wrote: > On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote: > > On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote: > > > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote: > > > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > > > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: > > > > > > > I'm wondering > > > > > > > if we could implement a smarter location based search using information > > > > > > > available in the by-size tree. For example, suppose we could identify > > > > > > > the closest minimally sized extents to agbno in order to better seed the > > > > > > > left/right starting points of the location based search. This of course > > > > > > > would require careful heuristics/tradeoffs to make sure we don't just > > > > > > > replace a bnobt scan with a cntbt scan. > > > > > > > > > > > > I wouldn't bother. I'd just take the "last block" algorithm and make > > > > > > it search all the >= contiguous free space extents for best locality > > > > > > before dropping back to the minlen search. > > > > > > > > > > > > > > > > Ok, that makes sense. The caveat seems to be though that the "last > > > > > block" algorithm searches all of the applicable records to discover the > > > > > best locality. We could open up this search as such, but if free space > > > > > happens to be completely fragmented to >= requested extents, that could > > > > > mean every allocation falls into a full cntbt scan where a bnobt lookup > > > > > would result in a much faster allocation. > > > > > > > > Yup, we'll need to bound it so it doesn't do stupid things. :P > > > > > > > > > > Yep. > > > > > > > > So ISTM that we still need some kind of intelligent heuristic to fall > > > > > back to the second algorithm to cover the "too many" case. What exactly > > > > > that is may take some more thought, experimentation and testing. > > > > > > > > Yeah, that's the difficulty with making core allocator algorithm > > > > changes - how to characterise the effect of the change. I'm not sure > > > > that's a huge problem in this case, though, because selecting a > > > > matching contig freespace is almost always going to be better for > > > > filesystem longetivty and freespace fragmentation resistance than > > > > slecting a shorter freespace and doing lots more small allocations. > > > > it's the 'lots of small allocations' that really makes the freespace > > > > framgmentation spiral out of control, so if we can avoid that until > > > > we've used all the matching contig free spaces we'll be better off > > > > in the long run. > > > > > > > > > > Ok, so I ran fs_mark against the metadump with your patch and a quick > > > hack to unconditionally scan the cntbt if maxlen extents are available > > > (up to mxr[0] records similar to your patch, to avoid excessive scans). > > > The xfs_alloc_find_best_extent() patch alone didn't have much of a > > > noticeable effect, but that is an isolated optimization and I'm only > > > doing coarse measurements atm that probably hide it in the noise. > > > > > > The write workload improves quite a bit with the addition of the cntbt > > > change. Both throughput (via iostat 60s intervals) and fs_mark files/sec > > > change from a slow high/low sweeping behavior to much more consistent > > > and faster results. I see a sweep between 3-30 MB/s and ~30-250 f/sec > > > change to a much more consistent 27-39MB/s and ~200-300 f/s. > > > > That looks really promising. :) > > > > > A 5 minute tracepoint sample consists of 100% xfs_alloc_near_first > > > events which means we never fell back to the bnobt based search. I'm not > > > sure the mxr thing is the right approach necessarily, I just wanted > > > something quick that would demonstrate the potential upside gains > > > without going off the rails. > > > > *nod*. it's a good first approximation, though. The inobt limits > > search to ~10 inobt records left and right (~1200 nearest inodes) > > and if there were none free it allocated a new chunk. > > > > The records in the by-size tree have a secondary sort order of > > by-bno, so we know that as we walk the records of the same size, > > we'll get closer to the target we want. > > > > Yeah, this occurred to me while poking at the state of the cntbt trees > of the metadump. I was thinking specifically about whether we could use > that to optimize the existing algorithm a bit. For example, if we skip > the lastblock logic and do find many maxlen extents, use the agbno of > the first record to avoid sifting through the entire set. If the agbno > was > the requested agbno, for example, we could probably end the search > right there... I hadn't considered that, but yes, that would also shorten the second algorithm significantly in that case. > > Hmmm. I wonder. > > > > xfs_cntbt_key_diff() discriminates first by size, and then if size > > matches by startblock. > > > > But xfs_alloc_vextent_near() does this: > > > > if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i))) > > goto error0; > > > > It sets the startblock of the target lookup to be 0, which means it > > will always find the extent closest to the start of the AG of the > > same size of larger. IOWs, it looks for size without considering > > locality. > > > > ... but I wasn't aware we could do this. ;) It's a side effect of the way the xfs_cntbt_key_diff() calculates the distance. It will binary search on the size (by returning +/- diff based on size), but when the size matches (i.e. diff == 0), it then allows the binary search to continue by returning +/- based on the startblock diff. i.e. it will find either the first extent of a closest size match (no locality) or the closest physical locality match of the desired size.... > > But the lookup can do both. ~if we change that to be: > > > > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i); > > > > if will actually find the first block of size >= args->maxlen and > > >= args->agbno. IOWs, it does the majority of the locality search > > for us. All we need to do is check that the extent on the left side > > of the return extent is closer to the target than the extent that is > > returned..... > > > > Which makes me wonder - can we just get rid of that first algorithm > > (the lastblock search for largest) and replace it with a simple > > lookup for args->agbno, args->maxlen + args->alignment? > > > > That way we'll get an extent that will be big enough for alignment > > to succeed if alignment is required, big enough to fit if alignment > > is not required, or if nothing is found, we can then do a <= to find > > the first extent smaller and closest to the target: > > > > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i); > > > > If the record returned has a length = args->maxlen, then we've > > got the physically closest exact match and we should use it. > > > > if the record is shorter than args->maxlen but > args->minlen, then > > there are no extents large enough for maxlen, then we should check > > if the right side record is closer to target and select between the > > two. > > > > And that can replace the entirity of xfs_alloc_ag_vextent_near. > > My general sense to this point from the code and your feedback about the > priority of the algorithm is the fundamental problem here is that the > scope of the first algorithm is simply too narrow and the > second/fallback algorithm too expensive. A good summary :) > At minimum, I think applying > agbno lookups to the cntbt lookup as you describe here allows us to > incorporate more locality into the first (maxlen) algorithm and widen > its scope accordingly. This sounds like a great starting point to me. > The tradeoff may be that we don't get the locality benefit of all > > maxlen sized extents since the agbno part of the lookup is secondary to > the size, but that may be a fine tradeoff if the benefit is we can use > the first/faster algorithm for a whole lot more cases. > > I'm actually curious if doing that as a first step to open up the first > algo to _all_ maxlen cases has a noticeable effect on this workload. If > so, that could be a nice intermediate step to avoid paying the penalty > for the "too many >= maxlen extents" case before rewriting the broader > algorithm. The remaining effort is then to absorb the second algorithm > into the first such that the second is eventually no longer necessary. Yes, that seems like a sensible first experiment to perform. > > The existing "nothing >= maxlen" case that goes to > > xfs_alloc_ag_vextent_small() selects the largest free extent at the > > highest block number (right nost record) and so ignores locality. > > The xfs_alloc_lookup_le() lookup does they same thing, except it > > also picks the physically closest extent of the largest size. That's > > an improvement. > > > > That looks like another point where the first algorithm bails out too > quickly as well. We find !maxlen extents, decrement to get the next > (highest agbno) smallest record, then go on to check it for alignment > and whatnot without any consideration for any other records that may > exist of the same size. Not quite.... > Unless I'm missing something, the fact that we > decrement in the _small() case and jump back to the incrementing > algorithm of the caller pretty much ensures we'll only consider one such > record before going into the fan out search. In that case the _small() allocation finds a candidate it sets ltlen which triggers the "reset search from start of block" case in the first algorithm. This walks from the start of the last block to the first extent >= minlen in the last block and then begins the search from there. So it does consider all the candidate free extents in the last block large enough to be valid in the _small() case. But, yes, it may not be considering them all. > > The lastblock case currently selects the physically closest largest > > free extent in the block that will fit the (aligned) length we > > require. We can get really close to that with a >= (args->agbno, > > args->maxlen + args->alignment) lookup > > > > And the <= (args->agbno, args->maxlen) finds us the largest, > > closest free extent that we can check against minlen and return > > that... > > > > IOWs, something like this, which is a whole lot simpler and faster > > than the linear searches we do now and should return much better > > candidate extents: > > > > This all sounds pretty reasonable to me. I need to think more about the > details. I.e., whether we'd still want/need to fallback to a worst case > scan in certain cases, which may not be a problem if the first algorithm > is updated to find extents in almost all cases instead of being limited > to when there are a small number of maxlen extents. I think we can avoid the brute force by-bno search by being smart with a by minlen by size search in the worst case. > I'm also wondering if we could enhance this further by repeating the > agbno lookup at the top for the next smallest extent size in the min/max > range when no suitable extent is found. Then perhaps the high level > algorithm is truly simplified to find the "largest available extent with > best locality for that size." We get that automatically from a <= lookup on the by size tree. i.e: > > check_left = true; > > > > /* look for size and closeness */ > > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, > > args->maxlen + alignment, &i); > > if (error) > > goto error0; > > > > if (!i) { > > /* > > * nothing >= target. Search for best <= instead so > > * we get the get the largest closest match to what > > * we are asking for. > > */ > > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, > > args->maxlen + alignment, &i); This search here will find the largest extent size that is <= maxlen + alignment. it will either match the size and be the closest physical extent, or it will be the largest smaller extent size available (w/ no physical locality implied). If we then want the physically closest extent of that largest size, then we have to grab the length out of the record and redo the lookup with that exact size so the by-cnt diff matches and it falls through to checking the block numbers in the records. I suspect what we want here is a xfs_btree_lookup_from() still of operation that doesn't completely restart the btree search. The cursor already holds the the path we searched to get to the current block, so the optimal search method here is to walk back up the current to find the point where there might be ptrs to a different block and rerun the search from there. i.e. look at the parent block to see if the adjacent records indicate the search might span multiple blocks at the current level, and restart the binary search at the level where multiple ptrs to the same extent size are found. This means we don't have to binary search from the top of the tree down if we've already got a cursor that points to the first extent of the candidate size... > > Inode chunk allocation sets minlen == maxlen, in which case we > > should run your new by-size search to exhaustion and never hit the > > existing second algorithm. (i.e. don't bound it if minlen = maxlen) > > i.e. your new algorithm should get the same result as the existing > > code, but much faster because it's sonly searching extents we know > > can satisfy the allocation requirements. The proposed algorithm > > above would be even faster :P > > Yes, that makes sense. The search can potentially be made simpler in > that case. Also note that minlen == maxlen for most of the data extent > allocations in the test case I ran as well. It looks like the > xfs_bmap_btalloc*() path can use the longest free extent in the AG to > set a starting minlen and then we loosen constraints over multiple > allocation attempts if the first happen to fail. *nod* FWIW, xfs_alloc_ag_vextent_size() is effectively minlen == maxlen allocation. It will either return a maxlen freespace (aligned if necessary) or fail. However, now that I look, it does not consider physical locality at all - it's just a "take the first extent of size large enough to fit the desired size" algorithm, and it also falls back to a linear search of extents >= the size target until it finds an extent that aligns correctly. I think this follows from the case that xfs_alloc_ag_vextent_size() is used - it's used for the XFS_ALLOCTYPE_THIS_AG policy, which means "find the first extent of at least maxlen in this AG". So it's really just a special case of a near allocation where the physical locatlity target is the start of the AG. i.e: XFS_ALLOCTYPE_NEAR_BNO w/ {minlen = maxlen, args->agbno = 0} IOWs, I suspect we need to step back for a moment and consider how we should refactor this code because the different algorithms are currently a result of separation by high level policy rather than low level allocation selection requirements and as such, we have some semi-duplication in functionality in the algorithms that could be removed. Right now it seems the policies fall into these categories: 1. size >= maxlen, nearest bno 0 (_size) 2. exact bno, size >= minlen (_exact) 3. size >= maxlen, nearest bno target (_near algorithm 1) 4. size >= minlen, nearest bno target (_near, algorithm 2) Case 1 is the same as case 3 but with a neares bno target of 0. Case 2 is the same as case 4 except it fails if the nearest bno target is not an exact match. Seems like a lot of scope for factoring and simplification here - there's only two free extent selection algorithms here, not 4.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: xfs_alloc_ag_vextent_near() takes about 30ms to complete 2018-10-29 0:17 ` Dave Chinner @ 2018-10-29 9:53 ` Brian Foster 0 siblings, 0 replies; 17+ messages in thread From: Brian Foster @ 2018-10-29 9:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Mao Cheng, linux-xfs On Mon, Oct 29, 2018 at 11:17:39AM +1100, Dave Chinner wrote: > On Sun, Oct 28, 2018 at 10:09:08AM -0400, Brian Foster wrote: > > On Sat, Oct 27, 2018 at 02:16:07PM +1100, Dave Chinner wrote: > > > On Fri, Oct 26, 2018 at 09:03:35AM -0400, Brian Foster wrote: > > > > On Fri, Oct 26, 2018 at 12:03:44PM +1100, Dave Chinner wrote: > > > > > On Thu, Oct 25, 2018 at 09:21:30AM -0400, Brian Foster wrote: > > > > > > On Thu, Oct 25, 2018 at 09:35:23AM +1100, Dave Chinner wrote: > > > > > > > On Wed, Oct 24, 2018 at 08:09:27AM -0400, Brian Foster wrote: ... > > > The records in the by-size tree have a secondary sort order of > > > by-bno, so we know that as we walk the records of the same size, > > > we'll get closer to the target we want. > > > > > > > Yeah, this occurred to me while poking at the state of the cntbt trees > > of the metadump. I was thinking specifically about whether we could use > > that to optimize the existing algorithm a bit. For example, if we skip > > the lastblock logic and do find many maxlen extents, use the agbno of > > the first record to avoid sifting through the entire set. If the agbno > > was > the requested agbno, for example, we could probably end the search > > right there... > > I hadn't considered that, but yes, that would also shorten the > second algorithm significantly in that case. > > > > Hmmm. I wonder. > > > > > > xfs_cntbt_key_diff() discriminates first by size, and then if size > > > matches by startblock. > > > > > > But xfs_alloc_vextent_near() does this: > > > > > > if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i))) > > > goto error0; > > > > > > It sets the startblock of the target lookup to be 0, which means it > > > will always find the extent closest to the start of the AG of the > > > same size of larger. IOWs, it looks for size without considering > > > locality. > > > > > > > ... but I wasn't aware we could do this. ;) > > It's a side effect of the way the xfs_cntbt_key_diff() calculates > the distance. It will binary search on the size (by returning +/- > diff based on size), but when the size matches (i.e. diff == 0), it > then allows the binary search to continue by returning +/- based on > the startblock diff. > > i.e. it will find either the first extent of a closest size match > (no locality) or the closest physical locality match of the desired > size.... > Yep, makes sense. > > > But the lookup can do both. ~if we change that to be: > > > > > > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, args->maxlen, &i); > > > > > > if will actually find the first block of size >= args->maxlen and > > > >= args->agbno. IOWs, it does the majority of the locality search > > > for us. All we need to do is check that the extent on the left side > > > of the return extent is closer to the target than the extent that is > > > returned..... > > > > > > Which makes me wonder - can we just get rid of that first algorithm > > > (the lastblock search for largest) and replace it with a simple > > > lookup for args->agbno, args->maxlen + args->alignment? > > > > > > That way we'll get an extent that will be big enough for alignment > > > to succeed if alignment is required, big enough to fit if alignment > > > is not required, or if nothing is found, we can then do a <= to find > > > the first extent smaller and closest to the target: > > > > > > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, args->maxlen, &i); > > > > > > If the record returned has a length = args->maxlen, then we've > > > got the physically closest exact match and we should use it. > > > > > > if the record is shorter than args->maxlen but > args->minlen, then > > > there are no extents large enough for maxlen, then we should check > > > if the right side record is closer to target and select between the > > > two. > > > > > > And that can replace the entirity of xfs_alloc_ag_vextent_near. > > ... > > > The existing "nothing >= maxlen" case that goes to > > > xfs_alloc_ag_vextent_small() selects the largest free extent at the > > > highest block number (right nost record) and so ignores locality. > > > The xfs_alloc_lookup_le() lookup does they same thing, except it > > > also picks the physically closest extent of the largest size. That's > > > an improvement. > > > > > > > That looks like another point where the first algorithm bails out too > > quickly as well. We find !maxlen extents, decrement to get the next > > (highest agbno) smallest record, then go on to check it for alignment > > and whatnot without any consideration for any other records that may > > exist of the same size. > > Not quite.... > > > Unless I'm missing something, the fact that we > > decrement in the _small() case and jump back to the incrementing > > algorithm of the caller pretty much ensures we'll only consider one such > > record before going into the fan out search. > > In that case the _small() allocation finds a candidate it sets ltlen > which triggers the "reset search from start of block" case in the > first algorithm. This walks from the start of the last block to the > first extent >= minlen in the last block and then begins the search > from there. So it does consider all the candidate free extents > in the last block large enough to be valid in the _small() case. > Oops, right. I glossed over that hunk when looking at the _small() alloc path. Never mind. Side node: the ->bc_ptrs hacks in this code are pretty nasty. > But, yes, it may not be considering them all. > > > > The lastblock case currently selects the physically closest largest > > > free extent in the block that will fit the (aligned) length we > > > require. We can get really close to that with a >= (args->agbno, > > > args->maxlen + args->alignment) lookup > > > > > > And the <= (args->agbno, args->maxlen) finds us the largest, > > > closest free extent that we can check against minlen and return > > > that... > > > > > > IOWs, something like this, which is a whole lot simpler and faster > > > than the linear searches we do now and should return much better > > > candidate extents: > > > > > > > This all sounds pretty reasonable to me. I need to think more about the > > details. I.e., whether we'd still want/need to fallback to a worst case > > scan in certain cases, which may not be a problem if the first algorithm > > is updated to find extents in almost all cases instead of being limited > > to when there are a small number of maxlen extents. > > I think we can avoid the brute force by-bno search by being smart > with a by minlen by size search in the worst case. > Ok. The thought below does seem to imply that we could reuse the same algorithm to find progressively smaller extents in the !maxlen case rather than fall back to the purely locality based search, which is somewhat of an incoherent logic/priority transition. > > I'm also wondering if we could enhance this further by repeating the > > agbno lookup at the top for the next smallest extent size in the min/max > > range when no suitable extent is found. Then perhaps the high level > > algorithm is truly simplified to find the "largest available extent with > > best locality for that size." > > We get that automatically from a <= lookup on the by size tree. i.e: > Yep.. > > > check_left = true; > > > > > > /* look for size and closeness */ > > > error = xfs_alloc_lookup_ge(cnt_cur, args->agbno, > > > args->maxlen + alignment, &i); > > > if (error) > > > goto error0; > > > > > > if (!i) { > > > /* > > > * nothing >= target. Search for best <= instead so > > > * we get the get the largest closest match to what > > > * we are asking for. > > > */ > > > error = xfs_alloc_lookup_le(cnt_cur, args->agbno, > > > args->maxlen + alignment, &i); > > This search here will find the largest extent size that is <= > maxlen + alignment. it will either match the size and be the closest > physical extent, or it will be the largest smaller extent size > available (w/ no physical locality implied). > > If we then want the physically closest extent of that largest size, > then we have to grab the length out of the record and redo the > lookup with that exact size so the by-cnt diff matches and it falls > through to checking the block numbers in the records. > Pretty much the same high-level idea, either way, given the current lookup implementation... > I suspect what we want here is a xfs_btree_lookup_from() still of > operation that doesn't completely restart the btree search. The > cursor already holds the the path we searched to get to the current > block, so the optimal search method here is to walk back up the > current to find the point where there might be ptrs to a different > block and rerun the search from there. i.e. look at the parent block > to see if the adjacent records indicate the search might span > multiple blocks at the current level, and restart the binary search > at the level where multiple ptrs to the same extent size are found. > This means we don't have to binary search from the top of the tree > down if we've already got a cursor that points to the first extent > of the candidate size... > ... but this sounds like a nice idea too. > > > Inode chunk allocation sets minlen == maxlen, in which case we > > > should run your new by-size search to exhaustion and never hit the > > > existing second algorithm. (i.e. don't bound it if minlen = maxlen) > > > i.e. your new algorithm should get the same result as the existing > > > code, but much faster because it's sonly searching extents we know > > > can satisfy the allocation requirements. The proposed algorithm > > > above would be even faster :P > > > > Yes, that makes sense. The search can potentially be made simpler in > > that case. Also note that minlen == maxlen for most of the data extent > > allocations in the test case I ran as well. It looks like the > > xfs_bmap_btalloc*() path can use the longest free extent in the AG to > > set a starting minlen and then we loosen constraints over multiple > > allocation attempts if the first happen to fail. > > *nod* > > FWIW, xfs_alloc_ag_vextent_size() is effectively minlen == maxlen > allocation. It will either return a maxlen freespace (aligned if > necessary) or fail. > > However, now that I look, it does not consider physical locality at > all - it's just a "take the first extent of size large enough to fit > the desired size" algorithm, and it also falls back to a linear > search of extents >= the size target until it finds an extent that > aligns correctly. > > I think this follows from the case that xfs_alloc_ag_vextent_size() > is used - it's used for the XFS_ALLOCTYPE_THIS_AG policy, which > means "find the first extent of at least maxlen in this AG". So > it's really just a special case of a near allocation where the > physical locatlity target is the start of the AG. i.e: > XFS_ALLOCTYPE_NEAR_BNO w/ {minlen = maxlen, args->agbno = 0} > > IOWs, I suspect we need to step back for a moment and consider how > we should refactor this code because the different algorithms are > currently a result of separation by high level policy rather than > low level allocation selection requirements and as such, we have > some semi-duplication in functionality in the algorithms that could > be removed. Right now it seems the policies fall into these > categories: > > 1. size >= maxlen, nearest bno 0 (_size) > 2. exact bno, size >= minlen (_exact) > 3. size >= maxlen, nearest bno target (_near algorithm 1) > 4. size >= minlen, nearest bno target (_near, algorithm 2) > > Case 1 is the same as case 3 but with a neares bno target of 0. > Case 2 is the same as case 4 except it fails if the nearest bno > target is not an exact match. Seems like a lot of scope for > factoring and simplification here - there's only two free extent > selection algorithms here, not 4.... > That's a very interesting point. Perhaps the right longer term approach is not necessarily to rewrite xfs_alloc_ag_vextent_near(), but start on a new generic allocator implementation that can ultimately absorb the functionality of each independent algorithm and reduce the amount of code required in the process. Thanks. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-10-29 18:41 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-10-23 7:56 xfs_alloc_ag_vextent_near() takes about 30ms to complete Mao Cheng 2018-10-23 14:53 ` Brian Foster 2018-10-24 3:01 ` Mao Cheng 2018-10-24 4:34 ` Dave Chinner 2018-10-24 9:02 ` Mao Cheng 2018-10-24 12:11 ` Brian Foster 2018-10-25 4:01 ` Mao Cheng 2018-10-25 14:55 ` Brian Foster 2018-10-24 12:09 ` Brian Foster 2018-10-24 22:35 ` Dave Chinner 2018-10-25 13:21 ` Brian Foster 2018-10-26 1:03 ` Dave Chinner 2018-10-26 13:03 ` Brian Foster 2018-10-27 3:16 ` Dave Chinner 2018-10-28 14:09 ` Brian Foster 2018-10-29 0:17 ` Dave Chinner 2018-10-29 9:53 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).