From: Mark Tinguely <tinguely@sgi.com>
To: Jan Kara <jack@suse.cz>
Cc: xfs@oss.sgi.com
Subject: Re: Pathological allocation pattern with direct IO
Date: Thu, 07 Mar 2013 07:58:35 -0600 [thread overview]
Message-ID: <51389D0B.4020000@sgi.com> (raw)
In-Reply-To: <20130307102406.GA6723@quack.suse.cz>
On 03/07/13 04:24, Jan Kara wrote:
> On Thu 07-03-13 16:03:25, Dave Chinner wrote:
>> On Wed, Mar 06, 2013 at 09:22:10PM +0100, Jan Kara wrote:
>>> Hello,
>>>
>>> one of our customers has application that write large (tens of GB) files
>>> using direct IO done in 16 MB chunks. They keep the fs around 80% full
>>> deleting oldest files when they need to store new ones. Usually the file
>>> can be stored in under 10 extents but from time to time a pathological case
>>> is triggered and the file has few thousands extents (which naturally has
>>> impact on performance). The customer actually uses 2.6.32-based kernel but
>>> I reproduced the issue with 3.8.2 kernel as well.
>>>
>>> I was analyzing why this happens and the filefrag for the file looks like:
>>> Filesystem type is: 58465342
>>> File size of /raw_data/ex.20130302T121135/ov.s1a1.wb is 186294206464
>>> (45481984 blocks, blocksize 4096)
>>> ext logical physical expected length flags
>>> 0 0 13 4550656
>>> 1 4550656 188136807 4550668 12562432
>>> 2 17113088 200699240 200699238 622592
>>> 3 17735680 182046055 201321831 4096
>>> 4 17739776 182041959 182050150 4096
>>> 5 17743872 182037863 182046054 4096
>>> 6 17747968 182033767 182041958 4096
>>> 7 17752064 182029671 182037862 4096
>>> ...
>>> 6757 45400064 154381644 154389835 4096
>>> 6758 45404160 154377548 154385739 4096
>>> 6759 45408256 252951571 154381643 73728 eof
>>> /raw_data/ex.20130302T121135/ov.s1a1.wb: 6760 extents found
>>>
>>> So we see that at one moment, the allocator starts giving us 16 MB chunks
>>> backwards. This seems to be caused by XFS_ALLOCTYPE_NEAR_BNO allocation. For
>>> two cases I was able to track down the logic:
>>>
>>> 1) We start allocating blocks for file. We want to allocate in the same AG
>>> as the inode is. First we try exact allocation which fails so we try
>>> XFS_ALLOCTYPE_NEAR_BNO allocation which finds large enough free extent
>>> before the inode. So we start allocating 16 MB chunks from the end of that
>>> free extent. From this moment on we are basically bound to continue
>>> allocating backwards using XFS_ALLOCTYPE_NEAR_BNO allocation until we
>>> exhaust the whole free extent.
>>>
>>> 2) Similar situation happens when we cannot further grow current extent but
>>> there is large free space somewhere before this extent in the AG.
>>>
>>> So I was wondering is this known? Is XFS_ALLOCTYPE_NEAR_BNO so beneficial
>>> it outweights pathological cases like the above? Or shouldn't it maybe be
>>> disabled for larger files or for direct IO?
>>
>> Well known issue, first diagnosed about 15 years ago, IIRC. Simple
>> solution: use extent size hints.
> I thought someone must have hit it before. But I wasn't successful in
> googling... I suggested using fallocate to the customer since they have a
> good idea of the final file size in advance and in testing it gave better
> results than extent size hints (plus it works for other filesystems as
> well).
>
> But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO
> heuristic. Sure the seek time depends on the distance so if we are speaking
> about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but
> once that strategy would allocate two or three consecutive extents you've
> lost all the benefit and you would be better off if you started allocating
> from the start of the free space. Obviously we don't know the future in
> advance but this resembles a classical problem from approximations
> algorithms theory (rent-or-buy problem where renting corresponds to
> allocating from the end of free space and taking the smaller cost while
> buying corresponds to allocation from the beginning, taking the higher
> cost, but expecting you won't have to pay anything in future). And the
> theory of approximation algorithms tells us that once we pay for renting as
> much as buying will cost us, then at that moment it is advantageous to buy
> and that gives you 2-approximation algorithm (you can do even better -
> factor 1.58 approximation - if you use randomization but I don't think we
> want to go that way). So from this I'd say that switching off
> XFS_ALLOCTYPE_NEAR_BNO allocation once you've allocated 2-3 extents
> backwards would work of better on average.
>
> Honza
Sounds like a candidate for a dynamic allocation policy,
http://oss.sgi.com/archives/xfs/2013-01/msg00611.html
--Mark.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
next prev parent reply other threads:[~2013-03-07 13:58 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-06 20:22 Pathological allocation pattern with direct IO Jan Kara
2013-03-06 22:01 ` Ben Myers
2013-03-06 22:31 ` Peter Grandi
2013-03-07 5:03 ` Dave Chinner
2013-03-07 10:24 ` Jan Kara
2013-03-07 13:58 ` Mark Tinguely [this message]
2013-03-08 1:35 ` Dave Chinner
2013-03-12 11:01 ` Jan Kara
2013-03-14 23:36 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51389D0B.4020000@sgi.com \
--to=tinguely@sgi.com \
--cc=jack@suse.cz \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox