* Pathological allocation pattern with direct IO
@ 2013-03-06 20:22 Jan Kara
2013-03-06 22:01 ` Ben Myers
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Jan Kara @ 2013-03-06 20:22 UTC (permalink / raw)
To: xfs
Hello,
one of our customers has application that write large (tens of GB) files
using direct IO done in 16 MB chunks. They keep the fs around 80% full
deleting oldest files when they need to store new ones. Usually the file
can be stored in under 10 extents but from time to time a pathological case
is triggered and the file has few thousands extents (which naturally has
impact on performance). The customer actually uses 2.6.32-based kernel but
I reproduced the issue with 3.8.2 kernel as well.
I was analyzing why this happens and the filefrag for the file looks like:
Filesystem type is: 58465342
File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
(45481984 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 13 4550656
1 4550656 188136807 4550668 12562432
2 17113088 200699240 200699238 622592
3 17735680 182046055 201321831 4096
4 17739776 182041959 182050150 4096
5 17743872 182037863 182046054 4096
6 17747968 182033767 182041958 4096
7 17752064 182029671 182037862 4096
...
6757 45400064 154381644 154389835 4096
6758 45404160 154377548 154385739 4096
6759 45408256 252951571 154381643 73728 eof
/raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
So we see that at one moment, the allocator starts giving us 16 MB chunks
backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
two cases I was able to track down the logic:
1) We start allocating blocks for file. We want to allocate in the same AG
as the inode is. First we try exact allocation which fails so we try
XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
before the inode. So we start allocating 16 MB chunks from the end of that
free extent. From this moment on we are basically bound to continue
allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
exhaust the whole free extent.
2) Similar situation happens when we cannot further grow current extent but
there is large free space somewhere before this extent in the AG.
So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
it outweights pathological cases like the above? Or shouldn't it maybe be
disabled for larger files or for direct IO?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: Pathological allocation pattern with direct IO 2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara @ 2013-03-06 22:01 ` Ben Myers 2013-03-06 22:31 ` Peter Grandi 2013-03-07 5:03 ` Dave Chinner 2 siblings, 0 replies; 9+ messages in thread From: Ben Myers @ 2013-03-06 22:01 UTC (permalink / raw) To: Jan Kara; +Cc: xfs Hi Jan, On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote: > one of our customers has application that write large (tens of GB) files > using direct IO done in 16 MB chunks. They keep the fs around 80% full > deleting oldest files when they need to store new ones. Usually the file > can be stored in under 10 extents but from time to time a pathological case > is triggered and the file has few thousands extents (which naturally has > impact on performance). The customer actually uses 2.6.32-based kernel but > I reproduced the issue with 3.8.2 kernel as well. > > I was analyzing why this happens and the filefrag for the file looks like: > Filesystem type is: 58465342 > File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464 > (45481984 blocks, blocksize 4096) > ext logical physical expected length flags > 0 0 13 4550656 > 1 4550656 188136807 4550668 12562432 > 2 17113088 200699240 200699238 622592 > 3 17735680 182046055 201321831 4096 > 4 17739776 182041959 182050150 4096 > 5 17743872 182037863 182046054 4096 > 6 17747968 182033767 182041958 4096 > 7 17752064 182029671 182037862 4096 > ... > 6757 45400064 154381644 154389835 4096 > 6758 45404160 154377548 154385739 4096 > 6759 45408256 252951571 154381643 73728 eof > /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found > > So we see that at one moment, the allocator starts giving us 16 MB chunks > backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For > two cases I was able to track down the logic: > > 1) We start allocating blocks for file. We want to allocate in the same AG > as the inode is. First we try exact allocation which fails so we try > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent > before the inode. So we start allocating 16 MB chunks from the end of that > free extent. From this moment on we are basically bound to continue > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we > exhaust the whole free extent. > > 2) Similar situation happens when we cannot further grow current extent but > there is large free space somewhere before this extent in the AG. > > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial > it outweights pathological cases like the above? Or shouldn't it maybe be > disabled for larger files or for direct IO? I believe we've seen something similar to #2 before: # xfs_bmap -v /data/dbench.dat /data/dbench.dat: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..150994943]: 2343559168..2494554111 5 (2048..150996991) 150994944 00011 1: [150994944..468582399]: 2494556160..2812143615 5 (150999040..468586495) 317587456 00011 2: [468582400..670957567]: 3078479872..3280855039 6 (266211328..468586495) 202375168 00011 3: [670957568..671088639]: 3078346752..3078477823 6 (266078208..266209279) 131072 00011 4: [671088640..671219711]: 3078215680..3078346751 6 (265947136..266078207) 131072 00011 5: [671219712..671350783]: 3078084608..3078215679 6 (265816064..265947135) 131072 00011 6: [671350784..671481855]: 3077953536..3078084607 6 (265684992..265816063) 131072 00011 7: [671481856..671612927]: 3077822464..3077953535 6 (265553920..265684991) 131072 00011 8: [671612928..671743999]: 3077691392..3077822463 6 (265422848..265553919) 131072 00011 9: [671744000..671875071]: 3077560320..3077691391 6 (265291776..265422847) 131072 00011 ... 2040: [4216979456..4502192127]: 6562093056..6847305727 14 (133120..285345791) 285212672 00011 2041: [4502192128..4685430783]: 6847307776..7030546431 14 (285347840..468586495) 183238656 00011 2042: [4685430784..4876402687]: 9183129600..9374101503 19 (277612544..468584447) 190971904 00011 2043: [4876402688..5344985087]: 9374230528..9842812927 20 (2048..468584447) 468582400 00011 2044: [5344985088..5813567487]: 9842941952..10311524351 21 (2048..468584447) 468582400 00011 2045: [5813567488..6282149887]: 10311653376..10780235775 22 (2048..468584447) 468582400 00011 2046: [6282149888..6750732287]: 10780364800..11248947199 23 (2048..468584447) 468582400 00011 2047: [6750732288..6767501311]: 11249076224..11265845247 24 (2048..16771071) 16769024 00011 2048: [6767501312..7219314687]: 11265845248..11717658623 24 (16771072..468584447) 451813376 00011 2049: [7219314688..7687766015]: 11717918720..12186370047 25 (133120..468584447) 468451328 2050: [7687766016..8156348415]: 12186499072..12655081471 26 (2048..468584447) 468582400 00011 2051: [8156348416..8449425407]: 12655210496..12948287487 27 (2048..293079039) 293076992 00011 In this case, the allocation in AG 6 starts near the middle of the AG and runs through the end. At that point we began to march backward through the AG until it was exhausted. Not ideal. Maybe it would be better if XFS_ALLOCTYPE_NEAR_BNO would move on to the next AG if it reached the end of the current one. We need to be careful though. What is good for this workload may have unintended consequences for another. Could you post geometry information for the filesystem in question? xfs_growfs -n /dev/sda Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara 2013-03-06 22:01 ` Ben Myers @ 2013-03-06 22:31 ` Peter Grandi 2013-03-07 5:03 ` Dave Chinner 2 siblings, 0 replies; 9+ messages in thread From: Peter Grandi @ 2013-03-06 22:31 UTC (permalink / raw) To: Linux fs XFS > [ ... ] application that write large (tens of GB) files using > direct IO done in 16 MB chunks. They keep the fs around 80% > full deleting oldest files when they need to store new ones. > Usually the file can be stored in under 10 extents but from > time to time a pathological case is triggered and the file has > few thousands extents (which naturally has impact on > performance). [ ... ] They just want *guaranteed* (not just nearly-always) contiguity, without preallocating, with incremental writes with direct IO, while keeping the filesystem full at 80%. And a pony! :-) _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara 2013-03-06 22:01 ` Ben Myers 2013-03-06 22:31 ` Peter Grandi @ 2013-03-07 5:03 ` Dave Chinner 2013-03-07 10:24 ` Jan Kara 2 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-03-07 5:03 UTC (permalink / raw) To: Jan Kara; +Cc: xfs On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote: > Hello, > > one of our customers has application that write large (tens of GB) files > using direct IO done in 16 MB chunks. They keep the fs around 80% full > deleting oldest files when they need to store new ones. Usually the file > can be stored in under 10 extents but from time to time a pathological case > is triggered and the file has few thousands extents (which naturally has > impact on performance). The customer actually uses 2.6.32-based kernel but > I reproduced the issue with 3.8.2 kernel as well. > > I was analyzing why this happens and the filefrag for the file looks like: > Filesystem type is: 58465342 > File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464 > (45481984 blocks, blocksize 4096) > ext logical physical expected length flags > 0 0 13 4550656 > 1 4550656 188136807 4550668 12562432 > 2 17113088 200699240 200699238 622592 > 3 17735680 182046055 201321831 4096 > 4 17739776 182041959 182050150 4096 > 5 17743872 182037863 182046054 4096 > 6 17747968 182033767 182041958 4096 > 7 17752064 182029671 182037862 4096 > ... > 6757 45400064 154381644 154389835 4096 > 6758 45404160 154377548 154385739 4096 > 6759 45408256 252951571 154381643 73728 eof > /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found > > So we see that at one moment, the allocator starts giving us 16 MB chunks > backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For > two cases I was able to track down the logic: > > 1) We start allocating blocks for file. We want to allocate in the same AG > as the inode is. First we try exact allocation which fails so we try > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent > before the inode. So we start allocating 16 MB chunks from the end of that > free extent. From this moment on we are basically bound to continue > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we > exhaust the whole free extent. > > 2) Similar situation happens when we cannot further grow current extent but > there is large free space somewhere before this extent in the AG. > > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial > it outweights pathological cases like the above? Or shouldn't it maybe be > disabled for larger files or for direct IO? Well known issue, first diagnosed about 15 years ago, IIRC. Simple solution: use extent size hints. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-07 5:03 ` Dave Chinner @ 2013-03-07 10:24 ` Jan Kara 2013-03-07 13:58 ` Mark Tinguely 2013-03-08 1:35 ` Dave Chinner 0 siblings, 2 replies; 9+ messages in thread From: Jan Kara @ 2013-03-07 10:24 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, xfs On Thu 07-03-13 16:03:25, Dave Chinner wrote: > On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote: > > Hello, > > > > one of our customers has application that write large (tens of GB) files > > using direct IO done in 16 MB chunks. They keep the fs around 80% full > > deleting oldest files when they need to store new ones. Usually the file > > can be stored in under 10 extents but from time to time a pathological case > > is triggered and the file has few thousands extents (which naturally has > > impact on performance). The customer actually uses 2.6.32-based kernel but > > I reproduced the issue with 3.8.2 kernel as well. > > > > I was analyzing why this happens and the filefrag for the file looks like: > > Filesystem type is: 58465342 > > File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464 > > (45481984 blocks, blocksize 4096) > > ext logical physical expected length flags > > 0 0 13 4550656 > > 1 4550656 188136807 4550668 12562432 > > 2 17113088 200699240 200699238 622592 > > 3 17735680 182046055 201321831 4096 > > 4 17739776 182041959 182050150 4096 > > 5 17743872 182037863 182046054 4096 > > 6 17747968 182033767 182041958 4096 > > 7 17752064 182029671 182037862 4096 > > ... > > 6757 45400064 154381644 154389835 4096 > > 6758 45404160 154377548 154385739 4096 > > 6759 45408256 252951571 154381643 73728 eof > > /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found > > > > So we see that at one moment, the allocator starts giving us 16 MB chunks > > backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For > > two cases I was able to track down the logic: > > > > 1) We start allocating blocks for file. We want to allocate in the same AG > > as the inode is. First we try exact allocation which fails so we try > > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent > > before the inode. So we start allocating 16 MB chunks from the end of that > > free extent. From this moment on we are basically bound to continue > > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we > > exhaust the whole free extent. > > > > 2) Similar situation happens when we cannot further grow current extent but > > there is large free space somewhere before this extent in the AG. > > > > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial > > it outweights pathological cases like the above? Or shouldn't it maybe be > > disabled for larger files or for direct IO? > > Well known issue, first diagnosed about 15 years ago, IIRC. Simple > solution: use extent size hints. I thought someone must have hit it before. But I wasn't successful in googling... I suggested using fallocate to the customer since they have a good idea of the final file size in advance and in testing it gave better results than extent size hints (plus it works for other filesystems as well). But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO heuristic. Sure the seek time depends on the distance so if we are speaking about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but once that strategy would allocate two or three consecutive extents you've lost all the benefit and you would be better off if you started allocating from the start of the free space. Obviously we don't know the future in advance but this resembles a classical problem from approximations algorithms theory (rent-or-buy problem where renting corresponds to allocating from the end of free space and taking the smaller cost while buying corresponds to allocation from the beginning, taking the higher cost, but expecting you won't have to pay anything in future). And the theory of approximation algorithms tells us that once we pay for renting as much as buying will cost us, then at that moment it is advantageous to buy and that gives you 2-approximation algorithm (you can do even better - factor 1.58 approximation - if you use randomization but I don't think we want to go that way). So from this I'd say that switching off XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents backwards would work of better on average. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-07 10:24 ` Jan Kara @ 2013-03-07 13:58 ` Mark Tinguely 2013-03-08 1:35 ` Dave Chinner 1 sibling, 0 replies; 9+ messages in thread From: Mark Tinguely @ 2013-03-07 13:58 UTC (permalink / raw) To: Jan Kara; +Cc: xfs On 03/07/13 04:24, Jan Kara wrote: > On Thu 07-03-13 16:03:25, Dave Chinner wrote: >> On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote: >>> Hello, >>> >>> one of our customers has application that write large (tens of GB) files >>> using direct IO done in 16 MB chunks. They keep the fs around 80% full >>> deleting oldest files when they need to store new ones. Usually the file >>> can be stored in under 10 extents but from time to time a pathological case >>> is triggered and the file has few thousands extents (which naturally has >>> impact on performance). The customer actually uses 2.6.32-based kernel but >>> I reproduced the issue with 3.8.2 kernel as well. >>> >>> I was analyzing why this happens and the filefrag for the file looks like: >>> Filesystem type is: 58465342 >>> File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464 >>> (45481984 blocks, blocksize 4096) >>> ext logical physical expected length flags >>> 0 0 13 4550656 >>> 1 4550656 188136807 4550668 12562432 >>> 2 17113088 200699240 200699238 622592 >>> 3 17735680 182046055 201321831 4096 >>> 4 17739776 182041959 182050150 4096 >>> 5 17743872 182037863 182046054 4096 >>> 6 17747968 182033767 182041958 4096 >>> 7 17752064 182029671 182037862 4096 >>> ... >>> 6757 45400064 154381644 154389835 4096 >>> 6758 45404160 154377548 154385739 4096 >>> 6759 45408256 252951571 154381643 73728 eof >>> /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found >>> >>> So we see that at one moment, the allocator starts giving us 16 MB chunks >>> backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For >>> two cases I was able to track down the logic: >>> >>> 1) We start allocating blocks for file. We want to allocate in the same AG >>> as the inode is. First we try exact allocation which fails so we try >>> XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent >>> before the inode. So we start allocating 16 MB chunks from the end of that >>> free extent. From this moment on we are basically bound to continue >>> allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we >>> exhaust the whole free extent. >>> >>> 2) Similar situation happens when we cannot further grow current extent but >>> there is large free space somewhere before this extent in the AG. >>> >>> So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial >>> it outweights pathological cases like the above? Or shouldn't it maybe be >>> disabled for larger files or for direct IO? >> >> Well known issue, first diagnosed about 15 years ago, IIRC. Simple >> solution: use extent size hints. > I thought someone must have hit it before. But I wasn't successful in > googling... I suggested using fallocate to the customer since they have a > good idea of the final file size in advance and in testing it gave better > results than extent size hints (plus it works for other filesystems as > well). > > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO > heuristic. Sure the seek time depends on the distance so if we are speaking > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but > once that strategy would allocate two or three consecutive extents you've > lost all the benefit and you would be better off if you started allocating > from the start of the free space. Obviously we don't know the future in > advance but this resembles a classical problem from approximations > algorithms theory (rent-or-buy problem where renting corresponds to > allocating from the end of free space and taking the smaller cost while > buying corresponds to allocation from the beginning, taking the higher > cost, but expecting you won't have to pay anything in future). And the > theory of approximation algorithms tells us that once we pay for renting as > much as buying will cost us, then at that moment it is advantageous to buy > and that gives you 2-approximation algorithm (you can do even better - > factor 1.58 approximation - if you use randomization but I don't think we > want to go that way). So from this I'd say that switching off > XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents > backwards would work of better on average. > > Honza Sounds like a candidate for a dynamic allocation policy, http://oss.sgi.com/archives/xfs/2013-01/msg00611.html --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-07 10:24 ` Jan Kara 2013-03-07 13:58 ` Mark Tinguely @ 2013-03-08 1:35 ` Dave Chinner 2013-03-12 11:01 ` Jan Kara 1 sibling, 1 reply; 9+ messages in thread From: Dave Chinner @ 2013-03-08 1:35 UTC (permalink / raw) To: Jan Kara; +Cc: xfs On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote: > On Thu 07-03-13 16:03:25, Dave Chinner wrote: > > On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote: > > > Hello, > > > > > > one of our customers has application that write large (tens of GB) files > > > using direct IO done in 16 MB chunks. They keep the fs around 80% full > > > deleting oldest files when they need to store new ones. Usually the file > > > can be stored in under 10 extents but from time to time a pathological case > > > is triggered and the file has few thousands extents (which naturally has > > > impact on performance). The customer actually uses 2.6.32-based kernel but > > > I reproduced the issue with 3.8.2 kernel as well. ..... > > > 1) We start allocating blocks for file. We want to allocate in the same AG > > > as the inode is. First we try exact allocation which fails so we try > > > XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent > > > before the inode. So we start allocating 16 MB chunks from the end of that > > > free extent. From this moment on we are basically bound to continue > > > allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we > > > exhaust the whole free extent. > > > > > > 2) Similar situation happens when we cannot further grow current extent but > > > there is large free space somewhere before this extent in the AG. > > > > > > So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial > > > it outweights pathological cases like the above? Or shouldn't it maybe be > > > disabled for larger files or for direct IO? > > > > Well known issue, first diagnosed about 15 years ago, IIRC. Simple > > solution: use extent size hints. > I thought someone must have hit it before. But I wasn't successful in > googling... I suggested using fallocate to the customer since they have a > good idea of the final file size in advance and in testing it gave better > results than extent size hints (plus it works for other filesystems as > well). Extent size hints have the advantage of being effective when the file size is not known ahead of time, or the application cannot be modified to do preallocation. And they can be applied immediately to any existing filesystem without needing downtime or code changes, so the problem can be mitigated immediately without impacting ongoing production.... > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO > heuristic. Sure the seek time depends on the distance so if we are speaking > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but > once that strategy would allocate two or three consecutive extents you've > lost all the benefit and you would be better off if you started allocating > from the start of the free space. Sorry, I gave quick answer because I didn't have time yesterday to answer in depth. The thing is, this backwards allocation is a low level allocator primitive, and it's behaviour is supposed to find the nearest free extent to a target block number. What you are seeing is exactly the expected beahviour of XFS_ALLOCTYPE_NEAR_BNO. An because it is a low level primitive, it is not designed to take into account high level allocation requirements or do pattern detection. It just finds the nearest free extent to the target block number, and that happens to be at the end of a free extent when it is at a lower block number than the target being asked for. This allocator behaviour is desirable for metadata in btree format, because it means that the btree clusters locally around both sides of the parent object rather than spreading out over an entire AG with blocks getting further and further apart as free space on the ascending side of the inode gets used up. In this case, it's far better to keep seek distances short by maintaining locality to the parent object by allocating on the descending side of the object. And backwards seeks are handled pretty well by the XFS metadata code - it uses readahead in the btree/dirent traversals to allow the elevator to optimise the resultant IO when walking the metadata, and hence backwards seeks are not really a problem at all. For data, random write workloads into sparse files really want this behaviour as well as it results in physical locality of extents. The backwards allocation simply doesn't matter because the extents are being randomly accessed anyway, because physical locality ultimately determines random IO performance. Further, high end storage arrays are capable of detecting backwards read patterns and issue readahead for them, so this backwards allocation doesn't always result in measurable performance problems... Hence repeated backwards allocation is, in many cases, desirable behaviour, and in others it doesn't make any difference. And with it being easy to work around for the few cases where it is tripped over on aging filesystems, there hasn't been a great deal of need to change the behaviour. > Obviously we don't know the future in > advance but this resembles a classical problem from approximations > algorithms theory (rent-or-buy problem where renting corresponds to > allocating from the end of free space and taking the smaller cost while > buying corresponds to allocation from the beginning, taking the higher > cost, but expecting you won't have to pay anything in future). And the > theory of approximation algorithms tells us that once we pay for renting as > much as buying will cost us, then at that moment it is advantageous to buy > and that gives you 2-approximation algorithm (you can do even better - > factor 1.58 approximation - if you use randomization but I don't think we > want to go that way). Way too mathematical for me, because... > So from this I'd say that switching off > XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents > backwards would work of better on average. .... common sense says that this is true. I haven't looked to solve this problem in the past 5 years because extent size hints are such a simple way of mitigating the problem. However, given that I know how this allocation code works whole lot better than I did 5 years ago, I think I can see an easy way to avoid this repeated backwards allocation pattern when extending files. What we have right now is this used space/free space layout with the allocation context prev/want when extending the file: free extent free extent +-----------------+ prev ...... +--------------+ +------+ +------+ want This is what we'll ask for as a XFS_ALLOCTYPE_THIS_BNO allocation, but that is going to fail because there is no free space at wantbno. This then falls back to XFS_ALLOCTYPE_NEAR_BNO with the same wantbno target, and it selects the free extent before prev. If we then we look at what is happening xfs_alloc_compute_diff(), we have wantbno > freebno, and wantbno > freeend, so we end up in either of these two cases: } else if (alignment > 1) { newbno1 = roundup(freeend - wantlen, alignment); if (newbno1 > freeend - wantlen && newbno1 - alignment >= freebno) newbno1 -= alignment; else if (newbno1 >= freeend) newbno1 = NULLAGBLOCK; } else newbno1 = freeend - wantlen; Either way, the resultant extent that is cut out of the free extent is this: free extent +-----------------+ prev +------+ +------+ +------+ want got And hence we end up in a backwards allocation pattern. After repeated iterations of this, we end up this a pattern like this: free extent +-----------------+ prev prev-1 prev-2 +------+------+------+ +------+ +------+ want got and so is likely to lead to an extended run of backwards allocation for sequential ascending offsets. So, we would need to detect this pattern in xfs_bmap_adjacent, where the wantbno is determined for EOF allocation. That's a bit complex - we'll need to do extent tree lookups because we need more information than we've got from the initial lookup. I do not want to get this information from the initial lookup and pass it up to xfs_bmap_adjacent() because we are already using lots of stack through this path and adding another 70-odd bytes for another couple of extent maps is not going to fly. So to detect backwards allocation, we'd have to take the CPU hit of doing another lookup and then analysing the results for backwards allocation. That can be done, but I'm wondering if we even need to detect this case... I suspect we don't even need to detect the backwards allocation pattern at all - if we are falling back from an XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be allocated and so we *might* end up doing a backwards allocation. Hence I don't think it will not hurt at all to simply say "select the start of the next candidate free extent" for a file data allocations in this case. That would completely avoid needing to detect repeated backwards allocations and ensure that subsequent allocations after the backwards jump to the start of the large free extent then go forwards as intended... And because we only do XFS_ALLOCTYPE_THIS_BNO allocations on data allocations when extending file, adding such a flag won't affect random write or metadata allocation patterns.... So, once we've decided that backward allocation is possible, all we need to do is inform XFS_ALLOCTYPE_NEAR_BNO/xfs_alloc_compute_diff() to select the start of the free extent it finds rather than select the tail of it. That will mean that rather than consume the entire free extent in small chunks via backwards allocation, we'll do one large backwards step and then consume the free extent in the forwards direction instead. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-08 1:35 ` Dave Chinner @ 2013-03-12 11:01 ` Jan Kara 2013-03-14 23:36 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Jan Kara @ 2013-03-12 11:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Jan Kara, xfs On Fri 08-03-13 12:35:25, Dave Chinner wrote: > On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote: > > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO > > heuristic. Sure the seek time depends on the distance so if we are speaking > > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but > > once that strategy would allocate two or three consecutive extents you've > > lost all the benefit and you would be better off if you started allocating > > from the start of the free space. <snip> > Hence repeated backwards allocation is, in many cases, desirable > behaviour, and in others it doesn't make any difference. And with it > being easy to work around for the few cases where it is tripped over > on aging filesystems, there hasn't been a great deal of need to > change the behaviour. Thanks for explanation! > > Obviously we don't know the future in > > advance but this resembles a classical problem from approximations > > algorithms theory (rent-or-buy problem where renting corresponds to > > allocating from the end of free space and taking the smaller cost while > > buying corresponds to allocation from the beginning, taking the higher > > cost, but expecting you won't have to pay anything in future). And the > > theory of approximation algorithms tells us that once we pay for renting as > > much as buying will cost us, then at that moment it is advantageous to buy > > and that gives you 2-approximation algorithm (you can do even better - > > factor 1.58 approximation - if you use randomization but I don't think we > > want to go that way). > > Way too mathematical for me, because... > > > So from this I'd say that switching off > > XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents > > backwards would work of better on average. > > .... common sense says that this is true. > > I haven't looked to solve this problem in the past 5 years because > extent size hints are such a simple way of mitigating the problem. > However, given that I know how this allocation code works whole lot > better than I did 5 years ago, I think I can see an easy way to > avoid this repeated backwards allocation pattern when extending > files. > > What we have right now is this used space/free space layout > with the allocation context prev/want when extending the file: > > > free extent free extent > +-----------------+ prev ...... +--------------+ > +------+ > +------+ > want > > This is what we'll ask for as a XFS_ALLOCTYPE_THIS_BNO allocation, > but that is going to fail because there is no free space at wantbno. > This then falls back to XFS_ALLOCTYPE_NEAR_BNO with the same wantbno > target, and it selects the free extent before prev. > > If we then we look at what is happening xfs_alloc_compute_diff(), we > have wantbno > freebno, and wantbno > freeend, so we end up in > either of these two cases: > > } else if (alignment > 1) { > newbno1 = roundup(freeend - wantlen, alignment); > if (newbno1 > freeend - wantlen && > newbno1 - alignment >= freebno) > newbno1 -= alignment; > else if (newbno1 >= freeend) > newbno1 = NULLAGBLOCK; > } else > newbno1 = freeend - wantlen; > > > Either way, the resultant extent that is cut out of the free extent > is this: > > free extent > +-----------------+ prev > +------+ > +------+ > +------+ want > got > > And hence we end up in a backwards allocation pattern. After > repeated iterations of this, we end up this a pattern like this: > > free extent > +-----------------+ prev prev-1 prev-2 > +------+------+------+ > +------+ > +------+ want > got > > and so is likely to lead to an extended run of backwards allocation > for sequential ascending offsets. So, we would need to detect this Yup, that's what I understood in the code and saw in my experiments... > pattern in xfs_bmap_adjacent, where the wantbno is determined for > EOF allocation. That's a bit complex - we'll need to do extent tree > lookups because we need more information than we've got from the > initial lookup. > > I do not want to get this information from the initial lookup and > pass it up to xfs_bmap_adjacent() because we are already using lots > of stack through this path and adding another 70-odd bytes for > another couple of extent maps is not going to fly. So to detect > backwards allocation, we'd have to take the CPU hit of doing another > lookup and then analysing the results for backwards allocation. That > can be done, but I'm wondering if we even need to detect this > case... > > I suspect we don't even need to detect the backwards allocation > pattern at all - if we are falling back from an > XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be > allocated and so we *might* end up doing a backwards allocation. > Hence I don't think it will not hurt at all to simply say "select > the start of the next candidate free extent" for a file data > allocations in this case. That would completely avoid needing to > detect repeated backwards allocations and ensure that subsequent > allocations after the backwards jump to the start of the large free > extent then go forwards as intended... Yes, I was wondering about whether this strategy won't be good enough as well. But there's one case I'm afraid of: Suppose you have a small file so that it gets written out in one request and fits in a single extent of blocks. Then it is somewhat advantageous to allocate from the end of the free extent because that is closer to the inode and thus reduces inode-to-data seek time. That's why I was suggesting maybe we could use this logic only for larger files or something like that. Hum, maybe if we didn't use this strategy for initial file extent, it would be enough. That's already handled specially in the allocator so it shouldn't be a problem to detect. What do you think? > And because we only do XFS_ALLOCTYPE_THIS_BNO allocations on data > allocations when extending file, adding such a flag won't affect > random write or metadata allocation patterns.... > > So, once we've decided that backward allocation is possible, all we > need to do is inform XFS_ALLOCTYPE_NEAR_BNO/xfs_alloc_compute_diff() > to select the start of the free extent it finds rather than select > the tail of it. That will mean that rather than consume the entire > free extent in small chunks via backwards allocation, we'll do one > large backwards step and then consume the free extent in the > forwards direction instead. Yes, this would work great for me. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Pathological allocation pattern with direct IO 2013-03-12 11:01 ` Jan Kara @ 2013-03-14 23:36 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2013-03-14 23:36 UTC (permalink / raw) To: Jan Kara; +Cc: xfs On Tue, Mar 12, 2013 at 12:01:00PM +0100, Jan Kara wrote: > On Fri 08-03-13 12:35:25, Dave Chinner wrote: > > On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote: > > > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO > > > heuristic. Sure the seek time depends on the distance so if we are speaking > > > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but > > > once that strategy would allocate two or three consecutive extents you've > > > lost all the benefit and you would be better off if you started allocating > > > from the start of the free space. > <snip> > > Hence repeated backwards allocation is, in many cases, desirable > > behaviour, and in others it doesn't make any difference. And with it > > being easy to work around for the few cases where it is tripped over > > on aging filesystems, there hasn't been a great deal of need to > > change the behaviour. > Thanks for explanation! No problems - there's a lot of stuff in my head that simply isn't documented anywhere, and it would take months of work to document it so I've never written it down... > > I suspect we don't even need to detect the backwards allocation > > pattern at all - if we are falling back from an > > XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be > > allocated and so we *might* end up doing a backwards allocation. > > Hence I don't think it will not hurt at all to simply say "select > > the start of the next candidate free extent" for a file data > > allocations in this case. That would completely avoid needing to > > detect repeated backwards allocations and ensure that subsequent > > allocations after the backwards jump to the start of the large free > > extent then go forwards as intended... > Yes, I was wondering about whether this strategy won't be good enough as > well. But there's one case I'm afraid of: Suppose you have a small file so > that it gets written out in one request and fits in a single extent of > blocks. Then it is somewhat advantageous to allocate from the end of the > free extent because that is closer to the inode and thus reduces > inode-to-data seek time. That's why I was suggesting maybe we could use > this logic only for larger files or something like that. Hum, maybe if we > didn't use this strategy for initial file extent, it would be enough. > That's already handled specially in the allocator so it shouldn't be a > problem to detect. What do you think? It took me a couple of readings to work out what you meant here. You are wondering if selecting the start of a free extent for the initial data for small files is the right thing to do as it may not be "closest", right? The thing is, we don't know if the inode is going to be written to again, and so there is no context to be able to determine if we need to be able to do a subsequent THIS_BNO exact allocation in subsequent allocations. Because XFS is designed for large files and high bandwidth IO, it is assumed that contiguous allocation is the most important factor in determining where to allocate a block. Hence even for a small allocation, the assumption is that there will be another contiguous allocation request and another and another.... Hence XFS always tries to place new files at the start of a large freespace extent, not at the end, because it is assumed that files will always grow in the short term. The thing is, this is all *very* complicated, and there's a lot of context that needs to be understood that, as I said above, isn't really documented anywhere. So, more context: We are searching left/right because the extent size we want to allocate resulted in a by-size btree lookup returning a btree block that is not at the end of the tree. i.e. we've hit a btree block that contains free space extents of roughly the same size as the extent we want to allocate rather than free space extents mcuh larger than the size we need. If we hit the last block - we have large contiguous free space extents and so we do a linear search to select the freespace extent that is *closest* to the target block, and allocate from the front of it. Hence subsequent THIS_BNO allocations will work just fine as there's plenty of contiguous free space at the end of the allocation. However, if we don't hit the last block it means there is some amount of free space fragmentation present in the filesystem. This means we can't use a by-size lookup to determine the closest free extent that would fit the allocation. If freespace fragementation is bad, there might be tens of thousands of free extents of that size in the tree, and selecting the closest would require a linear search through all the blocks. It's prohibitively expensive from both a CPU and Io point of view, so that isn't done. [ Freespace fragmentation of this style will occur from having an application that does concurrent, identically sized allocations that are small compared to the eventual file size. e.g. using direct IO in 16MB chunks to write files that are tens of GB in length. Over time, extents will get interleaved and then when you remove some files you end up with smaller freespace extents, which then get interleaved into more files, and when some of them get removed you get even smaller freespace extents. ] It is when we see this sort of freespace fragmentation that the NEAR_BNO allocation falls back to the second algorithm and search left and right from the target block in the by-block freespace tree. This *may* find a large extent that is in the last block of by-size tree and this is the trigger for backwards allocation to occur. If we were to have found the last by-size btree block in the initial by-size lookup, this is the extent that would have been selected by that algorithm, and we would have selected the front of the freespace extent, not the end of it. So for data allocation, it is quite clear that there is a significant difference between the way the free space extent is treated when we carve a chunk out of it when we compare the two algorithms. Hence if we simply change the way we select the best part of the extent to use for data allocation to match the method used when we select a large free space from the by-size tree, we'll stop this repeated backwards allocation problem from occurring and have similar extent selection behaviour for data allocation not matter which algorithm finds a the closest large freespace extent.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2013-03-14 23:36 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara 2013-03-06 22:01 ` Ben Myers 2013-03-06 22:31 ` Peter Grandi 2013-03-07 5:03 ` Dave Chinner 2013-03-07 10:24 ` Jan Kara 2013-03-07 13:58 ` Mark Tinguely 2013-03-08 1:35 ` Dave Chinner 2013-03-12 11:01 ` Jan Kara 2013-03-14 23:36 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox