From: "Benjamin Coddington" <bcodding@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@sandeen.net>, linux-xfs@vger.kernel.org
Subject: Re: SEEK_HOLE second hole problem
Date: Wed, 12 Oct 2016 20:50:14 -0400 [thread overview]
Message-ID: <CF090CA4-3E5C-48EA-B392-4BFC19F6890C@redhat.com> (raw)
In-Reply-To: <20161012205405.GE27872@dastard>
On 12 Oct 2016, at 16:54, Dave Chinner wrote:
> On Wed, Oct 12, 2016 at 02:40:06PM -0400, Benjamin Coddington wrote:
>> On 12 Oct 2016, at 13:06, Eric Sandeen wrote:
>>
>>> On 10/12/16 11:15 AM, Benjamin Coddington wrote:
>>>> While investigating generic/285 failure on NFS on an XFS export I
>>>> think I
>>>> found a seek hole bug.
>>>>
>>>> For a file with a hole/data pattern of hole, data, hole, data; it
>>>> appears
>>>> that SEEK_HOLE for pattern chunks larger than 65536 will be
>>>> incorrect for
>>>> seeking the start of the next hole after the first hole.
>>>
>>> [sandeen@sandeen ~]$ ./bcodding testfile
>>> SEEK_HOLE found second hole at 196608, expecting 139264
>>> [sandeen@sandeen ~]$ xfs_bmap testfile
>>> testfile:
>>> 0: [0..135]: hole
>>> 1: [136..383]: 134432656..134432903
>>> 2: [384..407]: hole
>>> 3: [408..543]: 134432392..134432527
>>>
>>> the numbers in brackets are sector numbers, so there is a hole
>>> at 0, blocks at 69632, hole at 196608, and more blocks at 208896.
>>>
>>> As bfoster mentioned on IRC, I think you are seeing xfs's
>>> speculative
>>> preallocation at work; more data got written than you asked for,
>>> but there's no guarantee about how a filesystem will allocate
>>> blocks based on an IO pattern.
>>>
>>> The /data/ is correct even if a zero-filled block ended up somewhere
>>> you didn't expect:
>>
>> OK, this makes sense. It's clear my understanding of a "hole" was
>> off --
>> we're really looking for the next unallocated range, not the next
>> unwritten
>> or zero range.
>
> Which, quite frankly, is a major red flag. Filesystems control
> allocation, not applications. Yes, you can /guide/ allocation with
> fallocate() and things like extent size hints, but you cannot
> /directly control/ how any filesystem allocates the blocks
> underlying a file from userspace.
>
> i.e. the moment an application makes an assumption that the
> filesystem "must leave a hole" when doing some operation, or that
> "we need to find the next unallocated region" in a file, the design
> should be thrown away and shoul dbe started again without those
> assumptions. That's because filesystem allocation behaviour is
> completely undefined by any standard, not guaranteed by any
> filesystem, and change between different filesystems and even
> different versions of the same filesystem.
>
> Remember - fallocate() defines a set of user visible behaviours, but
> it does not dictate how a filesystem should implement them. e.g.
> preallocation needs to guarantee that the next write to that region
> does not ENOSPC. That can be done in several different ways - write
> zeroes, allocate unwritten extents, accounting tricks to reserve
> blocks, etc. Every one of these is going to give different output
> when seek hole/data passes over those regions, but they will still
> /all be correct/.
>
>> Like me, generic/285 seems to have gotten this wrong too, but
>
> The test isn't wrong - it's just a demonstration of the fact we
> can't easily cater for every different allocation strategy that
> filesystems and storage uses to optimise IO patterns and pervent
> fragmentation.
>
> e.g. the hack in generic/285 to turn off ext4's "zero-around"
> functionality, which allocates and zeros small regions between data
> rather than leaving a hole. That's a different style of
> anti-fragmentation optimisation to what XFS uses, but the result is
> the same - there is data (all zeroes) on ext4 where other
> filesystems leave a hole.
>
> IOWs, by changing a sysfs value we make ext4 return different
> information from seek hole/data for exactly the same user IO
> pattern. Yet both are correct....
>
>> the short allocation sizes aren't triggering this preallocation when
>> used
>> directly on XFS. For NFS the larger st_blksize means we see the
>
> ^ NFS client side
>> preallocation happen.
>
> NFS client write IO patterns require aggressive preallocation
> strategies when XFS is used on the server to prevent excessive
> fragmentation of larger files. What is being returned from seek
> hole/data in this case is still correct and valid - it's just not
> what you (or the test) were expecting.
What happened here is that I used generic/285 to guide how I thought
seek_hole should work rather than think about it from first principles
(and
careful reading of the man pages), and fired off this presumptive bug
report. That's just laziness and I am ashamed. Thanks for taking time
to take
me to school.
Ben
next prev parent reply other threads:[~2016-10-13 0:50 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-12 16:15 SEEK_HOLE second hole problem Benjamin Coddington
2016-10-12 17:06 ` Eric Sandeen
2016-10-12 18:40 ` Benjamin Coddington
2016-10-12 20:54 ` Dave Chinner
2016-10-13 0:50 ` Benjamin Coddington [this message]
2016-10-14 10:24 ` Benjamin Coddington
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CF090CA4-3E5C-48EA-B392-4BFC19F6890C@redhat.com \
--to=bcodding@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=sandeen@sandeen.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).