linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Benjamin Coddington" <bcodding@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@sandeen.net>, linux-xfs@vger.kernel.org
Subject: Re: SEEK_HOLE second hole problem
Date: Wed, 12 Oct 2016 20:50:14 -0400	[thread overview]
Message-ID: <CF090CA4-3E5C-48EA-B392-4BFC19F6890C@redhat.com> (raw)
In-Reply-To: <20161012205405.GE27872@dastard>


On 12 Oct 2016, at 16:54, Dave Chinner wrote:

> On Wed, Oct 12, 2016 at 02:40:06PM -0400, Benjamin Coddington wrote:
>> On 12 Oct 2016, at 13:06, Eric Sandeen wrote:
>>
>>> On 10/12/16 11:15 AM, Benjamin Coddington wrote:
>>>> While investigating generic/285 failure on NFS on an XFS export I 
>>>> think I
>>>> found a seek hole bug.
>>>>
>>>> For a file with a hole/data pattern of hole, data, hole, data; it 
>>>> appears
>>>> that SEEK_HOLE for pattern chunks larger than 65536 will be 
>>>> incorrect for
>>>> seeking the start of the next hole after the first hole.
>>>
>>> [sandeen@sandeen ~]$ ./bcodding testfile
>>> SEEK_HOLE found second hole at 196608, expecting 139264
>>> [sandeen@sandeen ~]$ xfs_bmap testfile
>>> testfile:
>>> 	0: [0..135]: hole
>>> 	1: [136..383]: 134432656..134432903
>>> 	2: [384..407]: hole
>>> 	3: [408..543]: 134432392..134432527
>>>
>>> the numbers in brackets are sector numbers, so there is a hole
>>> at 0, blocks at 69632, hole at 196608, and more blocks at 208896.
>>>
>>> As bfoster mentioned on IRC, I think you are seeing xfs's 
>>> speculative
>>> preallocation at work; more data got written than you asked for,
>>> but there's no guarantee about how a filesystem will allocate
>>> blocks based on an IO pattern.
>>>
>>> The /data/ is correct even if a zero-filled block ended up somewhere
>>> you didn't expect:
>>
>> OK, this makes sense.  It's clear my understanding of a "hole" was 
>> off --
>> we're really looking for the next unallocated range, not the next 
>> unwritten
>> or zero range.
>
> Which, quite frankly, is a major red flag. Filesystems control
> allocation, not applications. Yes, you can /guide/ allocation with
> fallocate() and things like extent size hints, but you cannot
> /directly control/ how any filesystem allocates the blocks
> underlying a file from userspace.
>
> i.e. the moment an application makes an assumption that the
> filesystem "must leave a hole" when doing some operation, or that
> "we need to find the next unallocated region" in a file, the design
> should be thrown away and shoul dbe started again without those
> assumptions. That's because filesystem allocation behaviour is
> completely undefined by any standard, not guaranteed by any
> filesystem, and change between different filesystems and even
> different versions of the same filesystem.
>
> Remember - fallocate() defines a set of user visible behaviours, but
> it does not dictate how a filesystem should implement them. e.g.
> preallocation needs to guarantee that the next write to that region
> does not ENOSPC. That can be done in several different ways - write
> zeroes, allocate unwritten extents, accounting tricks to reserve
> blocks, etc. Every one of these is going to give different output
> when seek hole/data passes over those regions, but they will still
> /all be correct/.
>
>> Like me, generic/285 seems to have gotten this wrong too, but
>
> The test isn't wrong - it's just a demonstration of the fact we
> can't easily cater for every different allocation strategy that
> filesystems and storage uses to optimise IO patterns and pervent
> fragmentation.
>
> e.g. the hack in generic/285 to turn off ext4's "zero-around"
> functionality, which allocates and zeros small regions between data
> rather than leaving a hole. That's a different style of
> anti-fragmentation optimisation to what XFS uses, but the result is
> the same - there is data (all zeroes) on ext4 where other
> filesystems leave a hole.
>
> IOWs, by changing a sysfs value we make ext4 return different
> information from seek hole/data for exactly the same user IO
> pattern. Yet both are correct....
>
>> the short allocation sizes aren't triggering this preallocation when 
>> used
>> directly on XFS.  For NFS the larger st_blksize means we see the
>
>                                       ^ NFS client side
>> preallocation happen.
>
> NFS client write IO patterns require aggressive preallocation
> strategies when XFS is used on the server to prevent excessive
> fragmentation of larger files. What is being returned from seek
> hole/data in this case is still correct and valid - it's just not
> what you (or the test) were expecting.

What happened here is that I used generic/285 to guide how I thought
seek_hole should work rather than think about it from first principles 
(and
careful reading of the man pages), and fired off this presumptive bug
report.  That's just laziness and I am ashamed.  Thanks for taking time 
to take
me to school.

Ben

  reply	other threads:[~2016-10-13  0:50 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-12 16:15 SEEK_HOLE second hole problem Benjamin Coddington
2016-10-12 17:06 ` Eric Sandeen
2016-10-12 18:40   ` Benjamin Coddington
2016-10-12 20:54     ` Dave Chinner
2016-10-13  0:50       ` Benjamin Coddington [this message]
2016-10-14 10:24       ` Benjamin Coddington

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CF090CA4-3E5C-48EA-B392-4BFC19F6890C@redhat.com \
    --to=bcodding@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).