* SEEK_HOLE second hole problem
@ 2016-10-12 16:15 Benjamin Coddington
2016-10-12 17:06 ` Eric Sandeen
0 siblings, 1 reply; 6+ messages in thread
From: Benjamin Coddington @ 2016-10-12 16:15 UTC (permalink / raw)
To: linux-xfs
While investigating generic/285 failure on NFS on an XFS export I think
I
found a seek hole bug.
For a file with a hole/data pattern of hole, data, hole, data; it
appears
that SEEK_HOLE for pattern chunks larger than 65536 will be incorrect
for
seeking the start of the next hole after the first hole.
Here's my little test program:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#define SEEK_HOLE 4
//#define CHUNK 65536
#define CHUNK 69632
int main(int argc, char **argv)
{
int fd;
off_t pos;
char *buf = NULL;
buf = malloc(4096);
memset(buf, 'a', 4096);
fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0644);
for (pos = CHUNK; pos < CHUNK*2; pos += 4096)
pwrite(fd, buf, 4096, pos);
for (pos = CHUNK*3; pos < CHUNK*4; pos += 4096)
pwrite(fd, buf, 4096, pos);
pos = lseek(fd, CHUNK, SEEK_HOLE);
printf("SEEK_HOLE found second hole at %llu, expecting %llu\n",
pos, CHUNK*2);
close(fd);
}
[root@fedora ~]# ./test /exports/scratch/foo
SEEK_HOLE found second hole at 196608, expecting 139264
This is on 4.8, I did a bit of digging through XFS to see if I could
find
the problem, and it looks like XFS is interesting enough that I could
spend
all my time reading it... but I couldn't make serious progress and I
have to
go back to my actual job.
Let me know how I can help, or if I am just doing something exceedingly
stupid..
Ben
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: SEEK_HOLE second hole problem 2016-10-12 16:15 SEEK_HOLE second hole problem Benjamin Coddington @ 2016-10-12 17:06 ` Eric Sandeen 2016-10-12 18:40 ` Benjamin Coddington 0 siblings, 1 reply; 6+ messages in thread From: Eric Sandeen @ 2016-10-12 17:06 UTC (permalink / raw) To: Benjamin Coddington, linux-xfs On 10/12/16 11:15 AM, Benjamin Coddington wrote: > While investigating generic/285 failure on NFS on an XFS export I think I > found a seek hole bug. > > For a file with a hole/data pattern of hole, data, hole, data; it appears > that SEEK_HOLE for pattern chunks larger than 65536 will be incorrect for > seeking the start of the next hole after the first hole. [sandeen@sandeen ~]$ ./bcodding testfile SEEK_HOLE found second hole at 196608, expecting 139264 [sandeen@sandeen ~]$ xfs_bmap testfile testfile: 0: [0..135]: hole 1: [136..383]: 134432656..134432903 2: [384..407]: hole 3: [408..543]: 134432392..134432527 the numbers in brackets are sector numbers, so there is a hole at 0, blocks at 69632, hole at 196608, and more blocks at 208896. As bfoster mentioned on IRC, I think you are seeing xfs's speculative preallocation at work; more data got written than you asked for, but there's no guarantee about how a filesystem will allocate blocks based on an IO pattern. The /data/ is correct even if a zero-filled block ended up somewhere you didn't expect: [sandeen@sandeen ~]$ hexdump -C testfile 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00011000 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00022000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00033000 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 |aaaaaaaaaaaaaaaa| * 00044000 (0x22000 is 139264 bytes) -Eric > Here's my little test program: > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <fcntl.h> > #include <unistd.h> > > #define SEEK_HOLE 4 > //#define CHUNK 65536 > #define CHUNK 69632 > > int main(int argc, char **argv) > { > int fd; > off_t pos; > char *buf = NULL; > > buf = malloc(4096); > memset(buf, 'a', 4096); > > fd = open(argv[1], O_RDWR|O_CREAT|O_TRUNC, 0644); > > for (pos = CHUNK; pos < CHUNK*2; pos += 4096) > pwrite(fd, buf, 4096, pos); > > for (pos = CHUNK*3; pos < CHUNK*4; pos += 4096) > pwrite(fd, buf, 4096, pos); > > pos = lseek(fd, CHUNK, SEEK_HOLE); > printf("SEEK_HOLE found second hole at %llu, expecting %llu\n", pos, CHUNK*2); > > close(fd); > } > > [root@fedora ~]# ./test /exports/scratch/foo > SEEK_HOLE found second hole at 196608, expecting 139264 > > This is on 4.8, I did a bit of digging through XFS to see if I could find > the problem, and it looks like XFS is interesting enough that I could spend > all my time reading it... but I couldn't make serious progress and I have to > go back to my actual job. > > Let me know how I can help, or if I am just doing something exceedingly > stupid.. > > Ben > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SEEK_HOLE second hole problem 2016-10-12 17:06 ` Eric Sandeen @ 2016-10-12 18:40 ` Benjamin Coddington 2016-10-12 20:54 ` Dave Chinner 0 siblings, 1 reply; 6+ messages in thread From: Benjamin Coddington @ 2016-10-12 18:40 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-xfs On 12 Oct 2016, at 13:06, Eric Sandeen wrote: > On 10/12/16 11:15 AM, Benjamin Coddington wrote: >> While investigating generic/285 failure on NFS on an XFS export I think I >> found a seek hole bug. >> >> For a file with a hole/data pattern of hole, data, hole, data; it appears >> that SEEK_HOLE for pattern chunks larger than 65536 will be incorrect for >> seeking the start of the next hole after the first hole. > > [sandeen@sandeen ~]$ ./bcodding testfile > SEEK_HOLE found second hole at 196608, expecting 139264 > [sandeen@sandeen ~]$ xfs_bmap testfile > testfile: > 0: [0..135]: hole > 1: [136..383]: 134432656..134432903 > 2: [384..407]: hole > 3: [408..543]: 134432392..134432527 > > the numbers in brackets are sector numbers, so there is a hole > at 0, blocks at 69632, hole at 196608, and more blocks at 208896. > > As bfoster mentioned on IRC, I think you are seeing xfs's speculative > preallocation at work; more data got written than you asked for, > but there's no guarantee about how a filesystem will allocate > blocks based on an IO pattern. > > The /data/ is correct even if a zero-filled block ended up somewhere > you didn't expect: OK, this makes sense. It's clear my understanding of a "hole" was off -- we're really looking for the next unallocated range, not the next unwritten or zero range. Like me, generic/285 seems to have gotten this wrong too, but the short allocation sizes aren't triggering this preallocation when used directly on XFS. For NFS the larger st_blksize means we see the preallocation happen. Thanks Eric, Ben ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SEEK_HOLE second hole problem 2016-10-12 18:40 ` Benjamin Coddington @ 2016-10-12 20:54 ` Dave Chinner 2016-10-13 0:50 ` Benjamin Coddington 2016-10-14 10:24 ` Benjamin Coddington 0 siblings, 2 replies; 6+ messages in thread From: Dave Chinner @ 2016-10-12 20:54 UTC (permalink / raw) To: Benjamin Coddington; +Cc: Eric Sandeen, linux-xfs On Wed, Oct 12, 2016 at 02:40:06PM -0400, Benjamin Coddington wrote: > On 12 Oct 2016, at 13:06, Eric Sandeen wrote: > > > On 10/12/16 11:15 AM, Benjamin Coddington wrote: > >> While investigating generic/285 failure on NFS on an XFS export I think I > >> found a seek hole bug. > >> > >> For a file with a hole/data pattern of hole, data, hole, data; it appears > >> that SEEK_HOLE for pattern chunks larger than 65536 will be incorrect for > >> seeking the start of the next hole after the first hole. > > > > [sandeen@sandeen ~]$ ./bcodding testfile > > SEEK_HOLE found second hole at 196608, expecting 139264 > > [sandeen@sandeen ~]$ xfs_bmap testfile > > testfile: > > 0: [0..135]: hole > > 1: [136..383]: 134432656..134432903 > > 2: [384..407]: hole > > 3: [408..543]: 134432392..134432527 > > > > the numbers in brackets are sector numbers, so there is a hole > > at 0, blocks at 69632, hole at 196608, and more blocks at 208896. > > > > As bfoster mentioned on IRC, I think you are seeing xfs's speculative > > preallocation at work; more data got written than you asked for, > > but there's no guarantee about how a filesystem will allocate > > blocks based on an IO pattern. > > > > The /data/ is correct even if a zero-filled block ended up somewhere > > you didn't expect: > > OK, this makes sense. It's clear my understanding of a "hole" was off -- > we're really looking for the next unallocated range, not the next unwritten > or zero range. Which, quite frankly, is a major red flag. Filesystems control allocation, not applications. Yes, you can /guide/ allocation with fallocate() and things like extent size hints, but you cannot /directly control/ how any filesystem allocates the blocks underlying a file from userspace. i.e. the moment an application makes an assumption that the filesystem "must leave a hole" when doing some operation, or that "we need to find the next unallocated region" in a file, the design should be thrown away and shoul dbe started again without those assumptions. That's because filesystem allocation behaviour is completely undefined by any standard, not guaranteed by any filesystem, and change between different filesystems and even different versions of the same filesystem. Remember - fallocate() defines a set of user visible behaviours, but it does not dictate how a filesystem should implement them. e.g. preallocation needs to guarantee that the next write to that region does not ENOSPC. That can be done in several different ways - write zeroes, allocate unwritten extents, accounting tricks to reserve blocks, etc. Every one of these is going to give different output when seek hole/data passes over those regions, but they will still /all be correct/. > Like me, generic/285 seems to have gotten this wrong too, but The test isn't wrong - it's just a demonstration of the fact we can't easily cater for every different allocation strategy that filesystems and storage uses to optimise IO patterns and pervent fragmentation. e.g. the hack in generic/285 to turn off ext4's "zero-around" functionality, which allocates and zeros small regions between data rather than leaving a hole. That's a different style of anti-fragmentation optimisation to what XFS uses, but the result is the same - there is data (all zeroes) on ext4 where other filesystems leave a hole. IOWs, by changing a sysfs value we make ext4 return different information from seek hole/data for exactly the same user IO pattern. Yet both are correct.... > the short allocation sizes aren't triggering this preallocation when used > directly on XFS. For NFS the larger st_blksize means we see the ^ NFS client side > preallocation happen. NFS client write IO patterns require aggressive preallocation strategies when XFS is used on the server to prevent excessive fragmentation of larger files. What is being returned from seek hole/data in this case is still correct and valid - it's just not what you (or the test) were expecting. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SEEK_HOLE second hole problem 2016-10-12 20:54 ` Dave Chinner @ 2016-10-13 0:50 ` Benjamin Coddington 2016-10-14 10:24 ` Benjamin Coddington 1 sibling, 0 replies; 6+ messages in thread From: Benjamin Coddington @ 2016-10-13 0:50 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs On 12 Oct 2016, at 16:54, Dave Chinner wrote: > On Wed, Oct 12, 2016 at 02:40:06PM -0400, Benjamin Coddington wrote: >> On 12 Oct 2016, at 13:06, Eric Sandeen wrote: >> >>> On 10/12/16 11:15 AM, Benjamin Coddington wrote: >>>> While investigating generic/285 failure on NFS on an XFS export I >>>> think I >>>> found a seek hole bug. >>>> >>>> For a file with a hole/data pattern of hole, data, hole, data; it >>>> appears >>>> that SEEK_HOLE for pattern chunks larger than 65536 will be >>>> incorrect for >>>> seeking the start of the next hole after the first hole. >>> >>> [sandeen@sandeen ~]$ ./bcodding testfile >>> SEEK_HOLE found second hole at 196608, expecting 139264 >>> [sandeen@sandeen ~]$ xfs_bmap testfile >>> testfile: >>> 0: [0..135]: hole >>> 1: [136..383]: 134432656..134432903 >>> 2: [384..407]: hole >>> 3: [408..543]: 134432392..134432527 >>> >>> the numbers in brackets are sector numbers, so there is a hole >>> at 0, blocks at 69632, hole at 196608, and more blocks at 208896. >>> >>> As bfoster mentioned on IRC, I think you are seeing xfs's >>> speculative >>> preallocation at work; more data got written than you asked for, >>> but there's no guarantee about how a filesystem will allocate >>> blocks based on an IO pattern. >>> >>> The /data/ is correct even if a zero-filled block ended up somewhere >>> you didn't expect: >> >> OK, this makes sense. It's clear my understanding of a "hole" was >> off -- >> we're really looking for the next unallocated range, not the next >> unwritten >> or zero range. > > Which, quite frankly, is a major red flag. Filesystems control > allocation, not applications. Yes, you can /guide/ allocation with > fallocate() and things like extent size hints, but you cannot > /directly control/ how any filesystem allocates the blocks > underlying a file from userspace. > > i.e. the moment an application makes an assumption that the > filesystem "must leave a hole" when doing some operation, or that > "we need to find the next unallocated region" in a file, the design > should be thrown away and shoul dbe started again without those > assumptions. That's because filesystem allocation behaviour is > completely undefined by any standard, not guaranteed by any > filesystem, and change between different filesystems and even > different versions of the same filesystem. > > Remember - fallocate() defines a set of user visible behaviours, but > it does not dictate how a filesystem should implement them. e.g. > preallocation needs to guarantee that the next write to that region > does not ENOSPC. That can be done in several different ways - write > zeroes, allocate unwritten extents, accounting tricks to reserve > blocks, etc. Every one of these is going to give different output > when seek hole/data passes over those regions, but they will still > /all be correct/. > >> Like me, generic/285 seems to have gotten this wrong too, but > > The test isn't wrong - it's just a demonstration of the fact we > can't easily cater for every different allocation strategy that > filesystems and storage uses to optimise IO patterns and pervent > fragmentation. > > e.g. the hack in generic/285 to turn off ext4's "zero-around" > functionality, which allocates and zeros small regions between data > rather than leaving a hole. That's a different style of > anti-fragmentation optimisation to what XFS uses, but the result is > the same - there is data (all zeroes) on ext4 where other > filesystems leave a hole. > > IOWs, by changing a sysfs value we make ext4 return different > information from seek hole/data for exactly the same user IO > pattern. Yet both are correct.... > >> the short allocation sizes aren't triggering this preallocation when >> used >> directly on XFS. For NFS the larger st_blksize means we see the > > ^ NFS client side >> preallocation happen. > > NFS client write IO patterns require aggressive preallocation > strategies when XFS is used on the server to prevent excessive > fragmentation of larger files. What is being returned from seek > hole/data in this case is still correct and valid - it's just not > what you (or the test) were expecting. What happened here is that I used generic/285 to guide how I thought seek_hole should work rather than think about it from first principles (and careful reading of the man pages), and fired off this presumptive bug report. That's just laziness and I am ashamed. Thanks for taking time to take me to school. Ben ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: SEEK_HOLE second hole problem 2016-10-12 20:54 ` Dave Chinner 2016-10-13 0:50 ` Benjamin Coddington @ 2016-10-14 10:24 ` Benjamin Coddington 1 sibling, 0 replies; 6+ messages in thread From: Benjamin Coddington @ 2016-10-14 10:24 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, linux-xfs On 12 Oct 2016, at 16:54, Dave Chinner wrote: >> Like me, generic/285 seems to have gotten this wrong too, but > > The test isn't wrong - it's just a demonstration of the fact we > can't easily cater for every different allocation strategy that > filesystems and storage uses to optimise IO patterns and pervent > fragmentation. Regarding generic/285 -- I'd like to have it work for NFS. Maybe using SEEK_DATA to discover an appropriate allocation size for filesystems exported by NFS could work by writing to increasing offsets until a seek for data from the beginning of the file reveals the second allocation. That seems to work for xfs, ext4, btrfs, and tmpfs exports. I'll send a patch with that to fstests. Ben ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-10-14 10:24 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-10-12 16:15 SEEK_HOLE second hole problem Benjamin Coddington 2016-10-12 17:06 ` Eric Sandeen 2016-10-12 18:40 ` Benjamin Coddington 2016-10-12 20:54 ` Dave Chinner 2016-10-13 0:50 ` Benjamin Coddington 2016-10-14 10:24 ` Benjamin Coddington
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).