linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
@ 2014-03-15 21:02 Al Viro
  2014-03-16  2:21 ` Al Viro
  2014-03-17  0:11 ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Al Viro @ 2014-03-15 21:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, Dave Chinner, linux-fsdevel

	Regression in xfstests generic/263 is quite real - what happens is
that e.g.
ltp/fsx -F -H -N 10000  -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z /mnt/junk
where /mnt is on xfs ends up with a very odd file.  mmap() of its last
page has garbage in the page tail when observed on *any* kernel.  Copying
that file (with cp -a) yields a copy that doesn't trigger that behaviour.
What's more, xfs_repair doesn't notice anything fishy with that sucker.

This had been introduced (or, more likely, exposed) by the commit in
question.  As far as I can see, it's an on-disk corruption of some sort;
it *might* be triggered by some kind of dio-related race, but I would be
rather surprised if that had been the case - fsx is single-threaded,
after all, and making it fsync() *and* mmap/msync/munmap after each write
still produces such a file.  The file contents per se is fine, it's the
page tail on mmap() that is bogus.

Filesystem image after that crap is on ftp.linux.org.uk/pub/people/viro/img.bz2;
with that image mounted on /mnt we have
; ls -l /mnt/junk
-rw-r--r-- 1 root root 444928 Mar 15 16:26 /mnt/junk
; echo $((0x6ca00))
444928
; cat >foo.c <<'EOF'
#include <unistd.h>
#include <sys/mman.h>
main(int argc, char **argv)
{
        int fd = open(argv[1], 0);
        char *p = (char *)mmap(0, 0xa00, PROT_READ, MAP_SHARED, fd, (off_t)0x6c000);
        if (p != (char *)-1)
                write(1, p + 0xa00, 4096 - 0xa00);
}
EOF
; gcc foo.c
; ./a.out /mnt/junk | od -c
<lots of garbage>
; cp -a /mnt/junk /mnt/junk1
; ./a.out /mnt/junk1 | od -c
0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0003000

And that's essentially what makes generic/263 complain.  Note, BTW, that
fallocate and hole-punching is irrelevant - test in generic/263 steps into
those, but the same thing happens with these operations disabled (by -F -H).

I've found the thread from last June where you've mentioned generic/263
regression; AFAICS, Dave's comments there had been wrong...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-15 21:02 fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent" Al Viro
@ 2014-03-16  2:21 ` Al Viro
  2014-03-16  2:39   ` Al Viro
  2014-03-17  0:11 ` Dave Chinner
  1 sibling, 1 reply; 12+ messages in thread
From: Al Viro @ 2014-03-16  2:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-fsdevel, Dave Chinner, xfs

On Sat, Mar 15, 2014 at 09:02:16PM +0000, Al Viro wrote:

> And that's essentially what makes generic/263 complain.  Note, BTW, that
> fallocate and hole-punching is irrelevant - test in generic/263 steps into
> those, but the same thing happens with these operations disabled (by -F -H).
> 
> I've found the thread from last June where you've mentioned generic/263
> regression; AFAICS, Dave's comments there had been wrong...

BTW, experimenting with that thing shows that junk in the tail of the page
actually comes from some unused sectors on the same device.  So it's an
information leak at the very least - I have seen it pick bits and pieces of
previously removed files that way.

While we are at it, the following creates such a buggered file in about
a half of runs:

#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/mman.h>
#define O_DIRECT 00040000

main()
{
        int n = 0x5cf2e - 0x47000;
        int fd = open("/mnt/junk", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0666);
        char *p;
        ftruncate(fd, 0x5cf2e);
        p = mmap(NULL, n, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0x47000);
        memset(p, 'x', n);
        msync(p, n, MS_SYNC);
        munmap(p, n);
        lseek(fd, 0x59000, SEEK_SET);
        p = malloc(0x13a00 + 512);
        memset(p, 'z', 0x13a00 + 512);
        write(fd, p + 512 - ((unsigned long)p & 511), 0x13a00);
}

The frequency depends on the fraction of unused sectors with non-zero
contents - for all I know it might hit that bug in 100% of runs, but
I can only detect that if the junk it picks contains non-zero data.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-16  2:21 ` Al Viro
@ 2014-03-16  2:39   ` Al Viro
  2014-03-16 20:56     ` Al Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Al Viro @ 2014-03-16  2:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, Dave Chinner, linux-fsdevel

On Sun, Mar 16, 2014 at 02:21:05AM +0000, Al Viro wrote:
> On Sat, Mar 15, 2014 at 09:02:16PM +0000, Al Viro wrote:
> 
> > And that's essentially what makes generic/263 complain.  Note, BTW, that
> > fallocate and hole-punching is irrelevant - test in generic/263 steps into
> > those, but the same thing happens with these operations disabled (by -F -H).
> > 
> > I've found the thread from last June where you've mentioned generic/263
> > regression; AFAICS, Dave's comments there had been wrong...
> 
> BTW, experimenting with that thing shows that junk in the tail of the page
> actually comes from some unused sectors on the same device.  So it's an
> information leak at the very least - I have seen it pick bits and pieces of
> previously removed files that way.

Hrm...  s/unused/not zeroed out/, actually - block size is 4K.  So we have
an empty file extended by ftruncate(), then mmap+msync+munmap in its tail,
then O_DIRECT write starting from a couple of blocks prior to EOF and
extending it by ~15 blocks.  New EOF is 2.5Kb off the beginning of the
(new) last block.  Then it's closed.  Remaining 1.5Kb of that last
block is _not_ zeroed out; moreover, pagefault on that page ends up
reading the entire block, the junk in the tail not getting zeroed out
in in-core copy either.  Interesting...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-16  2:39   ` Al Viro
@ 2014-03-16 20:56     ` Al Viro
  2014-03-17  1:36       ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Al Viro @ 2014-03-16 20:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, Dave Chinner, linux-fsdevel

On Sun, Mar 16, 2014 at 02:39:31AM +0000, Al Viro wrote:

> Hrm...  s/unused/not zeroed out/, actually - block size is 4K.  So we have
> an empty file extended by ftruncate(), then mmap+msync+munmap in its tail,
> then O_DIRECT write starting from a couple of blocks prior to EOF and
> extending it by ~15 blocks.  New EOF is 2.5Kb off the beginning of the
> (new) last block.  Then it's closed.  Remaining 1.5Kb of that last
> block is _not_ zeroed out; moreover, pagefault on that page ends up
> reading the entire block, the junk in the tail not getting zeroed out
> in in-core copy either.  Interesting...

AFAICS, what happens is that we hit this
        /*
         * If this is O_DIRECT or the mpage code calling tell them how large
         * the mapping is, so that we can avoid repeated get_blocks calls.
         */
        if (direct || size > (1 << inode->i_blkbits)) {
                xfs_off_t               mapping_size;

                mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
                mapping_size <<= inode->i_blkbits;

                ASSERT(mapping_size > 0);
                if (mapping_size > size)
                        mapping_size = size;
                if (mapping_size > LONG_MAX)
                        mapping_size = LONG_MAX;

                bh_result->b_size = mapping_size;
        }
and while the caller (do_direct_IO()) is quite happy to skip subsequent calls
of get_block, buffer_new() is *NOT* set by that one.  Fair enough, since the
_first_ block of that run (the one we'd called __xfs_get_blocks() for) isn't
new, but dio_zero_block() for the partial block in the end of the area gets
confused by that.

Basically, with direct-io.c as it is, get_block may report more than one
block if they are contiguous on disk *AND* are all old or all new.  Returning
several old blocks + several freshly allocated is broken, and "preallocated"
is the same as "freshly allocated" in that respect - they need to be zeroed.

Looks like __xfs_get_blocks() is broken in that respect - I'm definitely
seeing O_DIRECT write() crossing the EOF calling it *once*, getting
->b_size set to a lot more than what remains until EOF and buffer_head
not getting BH_New on it.  And once that has happened, we are SOL - the
tail of the last block isn't zeroed.  Increase of prealloc size made that
more likely to happen (unsurprisingly, since it can only happen when blocks
adjacent to the last block of file are not taken by anything else).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-15 21:02 fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent" Al Viro
  2014-03-16  2:21 ` Al Viro
@ 2014-03-17  0:11 ` Dave Chinner
  2014-03-17  0:29   ` Al Viro
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-03-17  0:11 UTC (permalink / raw)
  To: Al Viro; +Cc: Brian Foster, linux-fsdevel, Dave Chinner, xfs

On Sat, Mar 15, 2014 at 09:02:16PM +0000, Al Viro wrote:
> 	Regression in xfstests generic/263 is quite real - what happens is
> that e.g.
> ltp/fsx -F -H -N 10000  -o 128000 -l 500000 -r 4096 -t 512 -w 512 -Z /mnt/junk
> where /mnt is on xfs ends up with a very odd file.  mmap() of its last
> page has garbage in the page tail when observed on *any* kernel.

Yes, we've known about this since 2011.  Right - that's a long
standing problem, and one I've never been able to isolate and so
reproduce with any luck. It can only be reproduced when you use mmap
and direct IO on the same file, and every time I've added debug to
find out where the tail block corruption was being introduced, the
data corruption goes away. It behaves just like a race condition....

>
> Copying
> that file (with cp -a) yields a copy that doesn't trigger that behaviour.
> What's more, xfs_repair doesn't notice anything fishy with that sucker.

xfs_repair does detect data corruption - it does not care what is in
the data blocks and it shouldn't need to care because the kernel code
should never, ever expose stale data beyond EOF....

> This had been introduced (or, more likely, exposed) by the commit in
> question.  As far as I can see, it's an on-disk corruption of some sort;
> it *might* be triggered by some kind of dio-related race, but I would be
> rather surprised if that had been the case - fsx is single-threaded,
> after all, and making it fsync() *and* mmap/msync/munmap after each write
> still produces such a file.  The file contents per se is fine, it's the
> page tail on mmap() that is bogus.

Right.....


> Filesystem image after that crap is on ftp.linux.org.uk/pub/people/viro/img.bz2;
> with that image mounted on /mnt we have
> ; ls -l /mnt/junk
> -rw-r--r-- 1 root root 444928 Mar 15 16:26 /mnt/junk
> ; echo $((0x6ca00))
> 444928
> ; cat >foo.c <<'EOF'
> #include <unistd.h>
> #include <sys/mman.h>
> main(int argc, char **argv)
> {
>         int fd = open(argv[1], 0);
>         char *p = (char *)mmap(0, 0xa00, PROT_READ, MAP_SHARED, fd, (off_t)0x6c000);
>         if (p != (char *)-1)
>                 write(1, p + 0xa00, 4096 - 0xa00);
> }

... you're reading data beyond EOF from a page faulted page that
spans EOF. That page is filled by mpage_readpages(), not XFS. And
mpage_readpages() does not zero the region beyond EOF which means if
there is some issue where a filesystem doesn't zero a tail block
properly it will expose stale data to userspace....

> EOF
> ; gcc foo.c
> ; ./a.out /mnt/junk | od -c
> <lots of garbage>

Right, but the file is already bad, so this tells us nothing as to
how the problem arose in the first place. Indeed, create the partial
tail block with truncate:

$ rm /mnt/test/foo
$ xfs_io -f -c "truncate 0x6ca00" /mnt/test/foo
$ sudo umount /mnt/test
$ sudo mount /dev/vdb /mnt/test
$ ls -l /mnt/test/foo
-rw------- 1 root root 444928 Mar 17 09:45 /mnt/test/foo
$ ./mmap-eof-test /mnt/test/foo |od -c
0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0003000
$

With buffered IO:

$ rm /mnt/test/foo
$ xfs_io -f -c "pwrite 0x6c9ff 1" -c fsync /mnt/test/foo
wrote 1/1 bytes at offset 444927
1.000000 bytes, 1 ops; 0.0000 sec (75.120 KiB/sec and 76923.0769 ops/sec)
$ sudo umount /mnt/test
$ sudo mount /dev/vdb /mnt/test
$ ls -l /mnt/test/foo
-rw------- 1 root root 444928 Mar 17 10:23 /mnt/test/foo
$ ./mmap-eof-test /mnt/test/foo |od -c
0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0003000

Or with direct IO:

$ rm /mnt/test/foo
$ xfs_io -fd -c "pwrite 0x6c800 0x200" /mnt/test/foo
wrote 512/512 bytes at offset 444416
512.000000 bytes, 1 ops; 0.0000 sec (410.509 KiB/sec and 821.0181 ops/sec)
$ sudo umount /mnt/test
$ sudo mount /dev/vdb /mnt/test
$ ls -l /mnt/test/foo
-rw------- 1 root root 444928 Mar 17 10:25 /mnt/test/foo
$ ./mmap-eof-test /mnt/test/foo |od -c
write returned 1536/0
0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0003000

And you don't see a failure from your test program because the
partial tail block is correctly zeroed on disk.  So there's
something else going on here - the write paths XFS do zero the tail
of the block on disk, and so mmap should not *ever* exposing stale
data.

AFAICT, the only way we can get garbage beyond EOF is to have
application mmap() map the page covering EOF, write beyond EOF, then
have it written back. However, we actually handle that case in
xfs_vm_writepage() by calling zero_user_segment() on the range of
the page beyond EOF. So we can't get garbage written beyond EOF that
way, either.

So I've got no real idea of how we are getting garbage on disk, what
role DIO plays in it, or even how to reproduce it reliably in a
manner that doesn't result in instrumentation causing the data
corruption to magically disappearing...

> ; cp -a /mnt/junk /mnt/junk1
> ; ./a.out /mnt/junk1 | od -c
> 0000000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
> *
> 0003000

cp is showing the correct data because the __block_write_begin()
path zeroes the bytes beyond EOF in the tail page - it doesn't copy
them from the source....

> And that's essentially what makes generic/263 complain.

Yup, but it is merely the messenger....

> I've found the thread from last June where you've mentioned generic/263
> regression; AFAICS, Dave's comments there had been wrong...

generic/263 has failed since it was first split out of generic/091
in commit 0d69e10ed15b01397e8c6fd7833fa3c2970ec024 back in 2011. We
discovered this problem by accident when we busted fsx option
parsing and it ran the wrong options resulting in a mmap+dio being
run. We fixed that up and added the workload to generic/091 (commit
c00bad1c4c348e45d00982d06fc40522fb8cb035), then split it to a
separate test to isolate the failure case and leave generic/091 as a
reliable regression test.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  0:11 ` Dave Chinner
@ 2014-03-17  0:29   ` Al Viro
  2014-03-17  1:28     ` Al Viro
  0 siblings, 1 reply; 12+ messages in thread
From: Al Viro @ 2014-03-17  0:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-fsdevel, Dave Chinner, xfs

On Mon, Mar 17, 2014 at 11:11:30AM +1100, Dave Chinner wrote:

> Yes, we've known about this since 2011.  Right - that's a long
> standing problem, and one I've never been able to isolate and so
> reproduce with any luck. It can only be reproduced when you use mmap
> and direct IO on the same file, and every time I've added debug to
> find out where the tail block corruption was being introduced, the
> data corruption goes away. It behaves just like a race condition....

See downthread.  And I would be *very* surprised if it was a race -
don't forget the msync() done before that write().

I think I know what's going on - O_DIRECT write starting a bit before
EOF on a file with the last extent that can be grown.  It fills
a buffer_head with b_size extending quite a bit past the EOF; the
blocks are really allocated.  What causes the problem is that we
have the flags set for the *first* block.  IOW, buffer_new(bh) is
false - the first block has already been allocated.  And for
direct-io.c it means "no zeroing the tail of the last block".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  0:29   ` Al Viro
@ 2014-03-17  1:28     ` Al Viro
  2014-03-17  1:38       ` Al Viro
  2014-03-17  1:41       ` Dave Chinner
  0 siblings, 2 replies; 12+ messages in thread
From: Al Viro @ 2014-03-17  1:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Brian Foster, xfs, Dave Chinner

On Mon, Mar 17, 2014 at 12:29:18AM +0000, Al Viro wrote:

> I think I know what's going on - O_DIRECT write starting a bit before
> EOF on a file with the last extent that can be grown.  It fills
> a buffer_head with b_size extending quite a bit past the EOF; the
> blocks are really allocated.  What causes the problem is that we
> have the flags set for the *first* block.  IOW, buffer_new(bh) is
> false - the first block has already been allocated.  And for
> direct-io.c it means "no zeroing the tail of the last block".

BTW, that's something I have directly observed - xfs_get_blocks_direct()
called with iblock corresponding to a bit under 16Kb below EOF and
returning with ->b_size equal to 700K and ->b_flags not containing BH_New.

Once that has happened, the rest is inevitable - do_direct_IO() will
shove the userland data (~80Kb) into the blocks starting from one
in ->b_blocknr and dio_zero_block() called after it will do nothing,
since as far as it knows these blocks were not new - BH_New isn't there.

We may or may not get a visible junk (after all, the block tail might have
contained zeros), but the situation above is easily reproduced and when
it happens, there will be no zeroing out.  Actually, I'm not sure if
the comment in direct-io.c (just before get_more_blocks()) is correct; it
says
 * If the fs has mapped a lot of blocks, it should populate bh->b_size to  
 * indicate how much contiguous disk space has been made available at
 * bh->b_blocknr.  
 *
 * If *any* of the mapped blocks are new, then the fs must set buffer_new().
 * This isn't very efficient...
but that is unsafe - consider a situation when you are writing 20Kb from e.g.
0.5Kb offset from the beginning of last (4Kb) block.  You have 6 blocks
affected, right?  One old, five new.  And you want the last half-kilobyte
of the new last block zeroed out.  What you do _not_ want is zeroing out the
half-kilobyte just before the affected area - the data there should stay.

IOW, we really can't mix new and old blocks in that interface - not enough
information is passed back to caller to be able to decide what does and
what does not need zeroing out.  It should be either all-new or all-old.

And it's not just the EOF, of course - the beginning of a hole in a sparse
file isn't any different from the end of file in that respect.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-16 20:56     ` Al Viro
@ 2014-03-17  1:36       ` Dave Chinner
  2014-03-17  2:43         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-03-17  1:36 UTC (permalink / raw)
  To: Al Viro; +Cc: Brian Foster, xfs, Dave Chinner, linux-fsdevel

On Sun, Mar 16, 2014 at 08:56:24PM +0000, Al Viro wrote:
> On Sun, Mar 16, 2014 at 02:39:31AM +0000, Al Viro wrote:
> 
> > Hrm...  s/unused/not zeroed out/, actually - block size is 4K.  So we have
> > an empty file extended by ftruncate(), then mmap+msync+munmap in its tail,
> > then O_DIRECT write starting from a couple of blocks prior to EOF and
> > extending it by ~15 blocks.  New EOF is 2.5Kb off the beginning of the
> > (new) last block.  Then it's closed.  Remaining 1.5Kb of that last
> > block is _not_ zeroed out; moreover, pagefault on that page ends up
> > reading the entire block, the junk in the tail not getting zeroed out
> > in in-core copy either.  Interesting...
> 
> AFAICS, what happens is that we hit this
>         /*
>          * If this is O_DIRECT or the mpage code calling tell them how large
>          * the mapping is, so that we can avoid repeated get_blocks calls.
>          */
>         if (direct || size > (1 << inode->i_blkbits)) {
>                 xfs_off_t               mapping_size;
> 
>                 mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
>                 mapping_size <<= inode->i_blkbits;
> 
>                 ASSERT(mapping_size > 0);
>                 if (mapping_size > size)
>                         mapping_size = size;
>                 if (mapping_size > LONG_MAX)
>                         mapping_size = LONG_MAX;
> 
>                 bh_result->b_size = mapping_size;
>         }
> and while the caller (do_direct_IO()) is quite happy to skip subsequent calls
> of get_block, buffer_new() is *NOT* set by that one.  Fair enough, since the
> _first_ block of that run (the one we'd called __xfs_get_blocks() for) isn't
> new, but dio_zero_block() for the partial block in the end of the area gets
> confused by that.

It's not obvious from your description what the issue is here. If
i run the obvious reproducer from your description:

$ xfs_io -fd -c "truncate $((262144 - 1536))" \
> -c "mmap 0 260k" -c "mwrite 253k 1k" -c msync -c munmap \
> -c "pwrite -b 32k $((262144 - 8192 - 2560)) 32k" \
> -c "mmap 276k 4k" -c "mread -v 276k 4k" /mnt/test/blah

Is see the correct behaviour in the block mapping calls out of
direct IO:

  xfs_io-7113  [000]  9135.446574: xfs_file_direct_write: dev 253:16 ino 0x83 size 0x3fa00 offset 0x3d600 count 0x8000 ioflags 
  xfs_io-7113  [000]  9135.446597: xfs_get_blocks_alloc: dev 253:16 ino 0x83 size 0x3fa00 offset 0x3d000 count 36864 type  startoff 0x3d startblock 13 blockcount 0x2
  xfs_io-7113  [000]  9135.446603: xfs_get_blocks_found: dev 253:16 ino 0x83 size 0x3fa00 offset 0x3f000 count 28672 type  startoff 0x3f startblock 12 blockcount 0x1
  xfs_io-7113  [000]  9135.447037: xfs_get_blocks_alloc: dev 253:16 ino 0x83 size 0x3fa00 offset 0x40000 count 24576 type  startoff 0x40 startblock 15 blockcount 0x6

The first call is to allocate into the blocks before the EOF block,
the second maps the EOF block, and the third allocates the remainder
beyond EOF. And buffer_new() is set on both of the allocation ranges
that are mapped, and the tail block has zeroes written into it
correctly the the mmap read gives zeroes in teh region beyond EOF.

Can you give me a script that reproduces the problem for you? The
above trace output came from:

# ~/trace-cmd/trace-cmd record -e xfs_get_block\* -e xfs_map_block\* -e xfs_file\*
....
[run command in different shell]
^C
# ~/trace-cmd/trace-cmd report

> Basically, with direct-io.c as it is, get_block may report more than one
> block if they are contiguous on disk *AND* are all old or all new.  Returning
> several old blocks + several freshly allocated is broken, and "preallocated"
> is the same as "freshly allocated" in that respect - they need to be zeroed.
>
> Looks like __xfs_get_blocks() is broken in that respect - I'm definitely
> seeing O_DIRECT write() crossing the EOF calling it *once*, getting
> ->b_size set to a lot more than what remains until EOF and buffer_head
> not getting BH_New on it.

XFS should never do that. It does not mix existing block mappings
with newly allocated mappings, and so the newly allocated region
beyond EOF should always give a "xfs_get_blocks_alloc" trace, and
when that is emitted we always set buffer_new()....

> And once that has happened, we are SOL - the
> tail of the last block isn't zeroed.  Increase of prealloc size made that
> more likely to happen (unsurprisingly, since it can only happen when blocks
> adjacent to the last block of file are not taken by anything else).

Right, but I still don't see what userspace IO pattern is triggering
this block mapping weirdness.  A script that replicates it without
involving mmap() at all would be nice...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  1:28     ` Al Viro
@ 2014-03-17  1:38       ` Al Viro
  2014-03-17  1:41       ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Al Viro @ 2014-03-17  1:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, Brian Foster, xfs, Dave Chinner

On Mon, Mar 17, 2014 at 01:28:04AM +0000, Al Viro wrote:
> but that is unsafe - consider a situation when you are writing 20Kb from e.g.
> 0.5Kb offset from the beginning of last (4Kb) block.  You have 6 blocks
> affected, right?  One old, five new.  And you want the last half-kilobyte

s/the last/all but the first/, sorry.  Basically, we want to get from
OOOOOOOO  (8 sectors of old data)
to
ONNNNNNN|NNNNNNNN|NNNNNNNN|NNNNNNNN|NNNNNNNN|NZZZZZZZ (1 sector of old data,
40 sectors of new data, 7 sectors of zeroes).  We want the last 7 sectors
zeroed out, but we do *not* want that to happen to the one sector of old data.
OTOH, if the file had been 4K shorter (and all blocks had been new) we would
want
ZNNNNNNN|NNNNNNNN|NNNNNNNN|NNNNNNNN|NNNNNNNN|NZZZZZZZ.  IOW, it's not just
the last block we want to know about.  There's simply not enough bandwidth
in that interface to pass the information we would need for such mixed
block runs...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  1:28     ` Al Viro
  2014-03-17  1:38       ` Al Viro
@ 2014-03-17  1:41       ` Dave Chinner
  1 sibling, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2014-03-17  1:41 UTC (permalink / raw)
  To: Al Viro; +Cc: Brian Foster, linux-fsdevel, Dave Chinner, xfs

On Mon, Mar 17, 2014 at 01:28:04AM +0000, Al Viro wrote:
> On Mon, Mar 17, 2014 at 12:29:18AM +0000, Al Viro wrote:
> 
> > I think I know what's going on - O_DIRECT write starting a bit before
> > EOF on a file with the last extent that can be grown.  It fills
> > a buffer_head with b_size extending quite a bit past the EOF; the
> > blocks are really allocated.  What causes the problem is that we
> > have the flags set for the *first* block.  IOW, buffer_new(bh) is
> > false - the first block has already been allocated.  And for
> > direct-io.c it means "no zeroing the tail of the last block".
> 
> BTW, that's something I have directly observed - xfs_get_blocks_direct()
> called with iblock corresponding to a bit under 16Kb below EOF and
> returning with ->b_size equal to 700K and ->b_flags not containing BH_New.

What's the userspace IO pattern that triggers this?

> IOW, we really can't mix new and old blocks in that interface - not enough
> information is passed back to caller to be able to decide what does and
> what does not need zeroing out.  It should be either all-new or all-old.

Right, and XFS should not be mixing old and new in the way you are
describing, and that's what I can't reproduce. See my reply on the
other thread. Probably best to continue there...

> And it's not just the EOF, of course - the beginning of a hole in a sparse
> file isn't any different from the end of file in that respect.

Except that XFS treats that differently - it does allocation as
unwritten extents there, and any mapping that covers an unwritten
block will always result in buffer_new() getting set...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  1:36       ` Dave Chinner
@ 2014-03-17  2:43         ` Dave Chinner
  2014-03-18  1:16           ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2014-03-17  2:43 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Brian Foster, xfs, Dave Chinner

On Mon, Mar 17, 2014 at 12:36:39PM +1100, Dave Chinner wrote:
> On Sun, Mar 16, 2014 at 08:56:24PM +0000, Al Viro wrote:
> > On Sun, Mar 16, 2014 at 02:39:31AM +0000, Al Viro wrote:
> > 
> > > Hrm...  s/unused/not zeroed out/, actually - block size is 4K.  So we have
> > > an empty file extended by ftruncate(), then mmap+msync+munmap in its tail,
> > > then O_DIRECT write starting from a couple of blocks prior to EOF and
> > > extending it by ~15 blocks.  New EOF is 2.5Kb off the beginning of the
> > > (new) last block.  Then it's closed.  Remaining 1.5Kb of that last
> > > block is _not_ zeroed out; moreover, pagefault on that page ends up
> > > reading the entire block, the junk in the tail not getting zeroed out
> > > in in-core copy either.  Interesting...
> > 
> > AFAICS, what happens is that we hit this
> >         /*
> >          * If this is O_DIRECT or the mpage code calling tell them how large
> >          * the mapping is, so that we can avoid repeated get_blocks calls.
> >          */
> >         if (direct || size > (1 << inode->i_blkbits)) {
> >                 xfs_off_t               mapping_size;
> > 
> >                 mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
> >                 mapping_size <<= inode->i_blkbits;
> > 
> >                 ASSERT(mapping_size > 0);
> >                 if (mapping_size > size)
> >                         mapping_size = size;
> >                 if (mapping_size > LONG_MAX)
> >                         mapping_size = LONG_MAX;
> > 
> >                 bh_result->b_size = mapping_size;
> >         }
> > and while the caller (do_direct_IO()) is quite happy to skip subsequent calls
> > of get_block, buffer_new() is *NOT* set by that one.  Fair enough, since the
> > _first_ block of that run (the one we'd called __xfs_get_blocks() for) isn't
> > new, but dio_zero_block() for the partial block in the end of the area gets
> > confused by that.
> 
> It's not obvious from your description what the issue is here. If
> i run the obvious reproducer from your description:
> 
> $ xfs_io -fd -c "truncate $((262144 - 1536))" \
> > -c "mmap 0 260k" -c "mwrite 253k 1k" -c msync -c munmap \
> > -c "pwrite -b 32k $((262144 - 8192 - 2560)) 32k" \
> > -c "mmap 276k 4k" -c "mread -v 276k 4k" /mnt/test/blah

OK, so there's a precondition required here - it's definitely not an
empty file - there has to be an active delayed allocation mapping
beyond EOF for this test case to cause trouble.  So, here's the
precondition:

$ rm -rf /mnt/test/blah
$ xfs_io -f -c "pwrite 0 32k" /mnt/test/blah
$ xfs_io -f -c "pwrite -b 64k 0 240k" /mnt/test/blah

The first write sets the I_DIRTY_RELEASE flag on close, then second
sees that flag and does not remove the delayed allocation beyond EOF
on the second close. So now there's a delalloc extent out to roughly
450k (block 112). Then run the above test case, and the msync causes
writeback to occur and hence that turns the entire delalloc extent
into allocated blocks - a single extent that extends beyond EOF.

> > Looks like __xfs_get_blocks() is broken in that respect - I'm definitely
> > seeing O_DIRECT write() crossing the EOF calling it *once*, getting
> > ->b_size set to a lot more than what remains until EOF and buffer_head
> > not getting BH_New on it.
> 
> XFS should never do that. It does not mix existing block mappings
> with newly allocated mappings, and so the newly allocated region
> beyond EOF should always give a "xfs_get_blocks_alloc" trace, and
> when that is emitted we always set buffer_new()....

And so XFS is not mixing extent types, nor is it seeing newly
allocated blocks beyond EOF, and so it sees it as a single
extent full of valid data....

OK, now I understand where the bad mapping problem is coming from,
and why a changing in speculative prealloc size might be seen as the
"cause" by a bisect. And it is a pre-existing condition that has
been there for years - I'd think that testing with
allocsize=<something large> would expose it easily on old kernels.

Al, you're right that the code that calculates the mapping size
needs to split mappings that extend across EOF for direct IO.  I'll
cook up a patch to fix it after lunch.

Thanks for digging in to this and finding what I couldn't.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
  2014-03-17  2:43         ` Dave Chinner
@ 2014-03-18  1:16           ` Dave Chinner
  0 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2014-03-18  1:16 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Brian Foster, Dave Chinner, xfs

On Mon, Mar 17, 2014 at 01:43:05PM +1100, Dave Chinner wrote:
> On Mon, Mar 17, 2014 at 12:36:39PM +1100, Dave Chinner wrote:
> > On Sun, Mar 16, 2014 at 08:56:24PM +0000, Al Viro wrote:
> > > Looks like __xfs_get_blocks() is broken in that respect - I'm definitely
> > > seeing O_DIRECT write() crossing the EOF calling it *once*, getting
> > > ->b_size set to a lot more than what remains until EOF and buffer_head
> > > not getting BH_New on it.
> > 
> > XFS should never do that. It does not mix existing block mappings
> > with newly allocated mappings, and so the newly allocated region
> > beyond EOF should always give a "xfs_get_blocks_alloc" trace, and
> > when that is emitted we always set buffer_new()....
> 
> And so XFS is not mixing extent types, nor is it seeing newly
> allocated blocks beyond EOF, and so it sees it as a single
> extent full of valid data....
> 
> OK, now I understand where the bad mapping problem is coming from,
> and why a changing in speculative prealloc size might be seen as the
> "cause" by a bisect. And it is a pre-existing condition that has
> been there for years - I'd think that testing with
> allocsize=<something large> would expose it easily on old kernels.
> 
> Al, you're right that the code that calculates the mapping size
> needs to split mappings that extend across EOF for direct IO.  I'll
> cook up a patch to fix it after lunch.

Except that this now leaves stale, unreferenced delalloc extents on
the inode, so there's some other problem being masked by this
behaviour and so generic/270 now assert fails. I'll have to look
further.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-03-18  1:16 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-15 21:02 fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent" Al Viro
2014-03-16  2:21 ` Al Viro
2014-03-16  2:39   ` Al Viro
2014-03-16 20:56     ` Al Viro
2014-03-17  1:36       ` Dave Chinner
2014-03-17  2:43         ` Dave Chinner
2014-03-18  1:16           ` Dave Chinner
2014-03-17  0:11 ` Dave Chinner
2014-03-17  0:29   ` Al Viro
2014-03-17  1:28     ` Al Viro
2014-03-17  1:38       ` Al Viro
2014-03-17  1:41       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).