Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Brian Foster <bfoster@redhat.com>
Cc: xfs@oss.sgi.com, Dave Chinner <dchinner@redhat.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent"
Date: Sun, 16 Mar 2014 20:56:24 +0000	[thread overview]
Message-ID: <20140316205624.GS18016@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20140316023931.GR18016@ZenIV.linux.org.uk>

On Sun, Mar 16, 2014 at 02:39:31AM +0000, Al Viro wrote:

> Hrm...  s/unused/not zeroed out/, actually - block size is 4K.  So we have
> an empty file extended by ftruncate(), then mmap+msync+munmap in its tail,
> then O_DIRECT write starting from a couple of blocks prior to EOF and
> extending it by ~15 blocks.  New EOF is 2.5Kb off the beginning of the
> (new) last block.  Then it's closed.  Remaining 1.5Kb of that last
> block is _not_ zeroed out; moreover, pagefault on that page ends up
> reading the entire block, the junk in the tail not getting zeroed out
> in in-core copy either.  Interesting...

AFAICS, what happens is that we hit this
        /*
         * If this is O_DIRECT or the mpage code calling tell them how large
         * the mapping is, so that we can avoid repeated get_blocks calls.
         */
        if (direct || size > (1 << inode->i_blkbits)) {
                xfs_off_t               mapping_size;

                mapping_size = imap.br_startoff + imap.br_blockcount - iblock;
                mapping_size <<= inode->i_blkbits;

                ASSERT(mapping_size > 0);
                if (mapping_size > size)
                        mapping_size = size;
                if (mapping_size > LONG_MAX)
                        mapping_size = LONG_MAX;

                bh_result->b_size = mapping_size;
        }
and while the caller (do_direct_IO()) is quite happy to skip subsequent calls
of get_block, buffer_new() is *NOT* set by that one.  Fair enough, since the
_first_ block of that run (the one we'd called __xfs_get_blocks() for) isn't
new, but dio_zero_block() for the partial block in the end of the area gets
confused by that.

Basically, with direct-io.c as it is, get_block may report more than one
block if they are contiguous on disk *AND* are all old or all new.  Returning
several old blocks + several freshly allocated is broken, and "preallocated"
is the same as "freshly allocated" in that respect - they need to be zeroed.

Looks like __xfs_get_blocks() is broken in that respect - I'm definitely
seeing O_DIRECT write() crossing the EOF calling it *once*, getting
->b_size set to a lot more than what remains until EOF and buffer_head
not getting BH_New on it.  And once that has happened, we are SOL - the
tail of the last block isn't zeroed.  Increase of prealloc size made that
more likely to happen (unsurprisingly, since it can only happen when blocks
adjacent to the last block of file are not taken by anything else).

next prev parent reply	other threads:[~2014-03-16 20:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-15 21:02 fs corruption exposed by "xfs: increase prealloc size to double that of the previous extent" Al Viro
2014-03-16  2:21 ` Al Viro
2014-03-16  2:39   ` Al Viro
2014-03-16 20:56     ` Al Viro [this message]
2014-03-17  1:36       ` Dave Chinner
2014-03-17  2:43         ` Dave Chinner
2014-03-18  1:16           ` Dave Chinner
2014-03-17  0:11 ` Dave Chinner
2014-03-17  0:29   ` Al Viro
2014-03-17  1:28     ` Al Viro
2014-03-17  1:38       ` Al Viro
2014-03-17  1:41       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140316205624.GS18016@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=bfoster@redhat.com \
    --cc=dchinner@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).