linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mmap writes vs truncate causing data corruption
@ 2014-09-17  9:28 Dave Chinner
  2014-09-23 12:27 ` Jan Kara
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2014-09-17  9:28 UTC (permalink / raw)
  To: linux-fsdevel

Hi folks,

Brian, Eric and I have been tracking down a set of data corruption
problems on XFS over the past couple of days. The one that is
important to the wider developer community is the truncate/mmap
write issue that Eric isolated from a real-world application that
was triggering it.

The corruption only affects block size smaller than page size
configurations and is caused by mmapped writes to the EOF page
which has been partially truncated. If we then extend the file
again, the region of the page that was truncated and had blocks
punched out of it can be written to via mapped writes without blocks
being allocated for the hole. Hence while the page is in the page
cache, the contents of the file look OK. Unmount/mount the
filesystem, then re-read the page from disk and it will contain
zeros because there is a hole rather than data blocks.

In the XFS case, the bug was that the filesystem truncate code is
not cleaning the partial page fully during the truncate down or up,
and hence the pte remains mapped dirty in the TLB. Hence when new
data is written to the page, it doesn't trigger a write fault,
->page_mkwrite is not called and hence blocks are not allocated over
the hole. I chose to fix it on the truncate up as it was the lesser
of two evils - we can't actually fix the problem entirely because we
can't serialise page faults against truncate.

Initially I couldn't reproduce the data corruptions on ext4, but
Eric came to my rescue and provided me with an updated mremap test
that triggered corruptions. I also added another variant to the
plain truncate/mwrite test and so now that itest also reliably
produces data corruptions on ext4. I suspect the ext4 issue is
similar to the XFS case (i.e. no page_mkwrite call), but I can't
follow the ext4 code with any level of cluefulness....

And so: practise what I preach and post a heads-up to -fsdevel.

That is, if two filesystems that support block size smaller than
page size have similar data corruptions when exercising the same
generic code paths in similar ways, then it is likely that other
filesystems have similar problems and need to be checked.

While the tests I packaged for xfstests are not yet reviewed, they
do work and expose the corruptions on both XFS and ext4. Hence I've
pushed them to a git tree branch so that everyone can test their
filesystems against the reproducers. The tests in question are
generic/029 and generic/030 and can be found here:

git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git mmap-truncate

FWIW, any filesystem that supports FALLOC_FL_COLLAPSE_RANGE should
also have generic/031 run against it. This is the test case that
Brian isolated from an fsx failure that exposed a different partial
page truncation data corruption issue in XFS with block size smaller
than page size. However, it's a similar situation with ext4: the
exact same underlying partial page writeback bug was found in ext4
back in May and fixed in 3.16....

Most importantly, all the credit must go to Eric and Brian for doing
the hard work of turning application failures into simple,
reproducable test cases.  Finding bugs is easy when you are provided
with a 100% reliable reproducer and a bunch of analysis about where
the bug most likely lies. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mmap writes vs truncate causing data corruption
  2014-09-17  9:28 mmap writes vs truncate causing data corruption Dave Chinner
@ 2014-09-23 12:27 ` Jan Kara
  2014-09-23 13:18   ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Kara @ 2014-09-23 12:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel

  Hi,

On Wed 17-09-14 19:28:05, Dave Chinner wrote:
> Brian, Eric and I have been tracking down a set of data corruption
> problems on XFS over the past couple of days. The one that is
> important to the wider developer community is the truncate/mmap
> write issue that Eric isolated from a real-world application that
> was triggering it.
> 
> The corruption only affects block size smaller than page size
> configurations and is caused by mmapped writes to the EOF page
> which has been partially truncated. If we then extend the file
> again, the region of the page that was truncated and had blocks
> punched out of it can be written to via mapped writes without blocks
> being allocated for the hole. Hence while the page is in the page
> cache, the contents of the file look OK. Unmount/mount the
> filesystem, then re-read the page from disk and it will contain
> zeros because there is a hole rather than data blocks.
  Hum, this is what we already discussed in
http://lists.openwall.net/linux-ext4/2014/03/13/23, isn't it? I never
thought about using mremap() in the test cases. That makes it even a POSIX
valid test case... Nasty.

> In the XFS case, the bug was that the filesystem truncate code is
> not cleaning the partial page fully during the truncate down or up,
> and hence the pte remains mapped dirty in the TLB. Hence when new
> data is written to the page, it doesn't trigger a write fault,
> ->page_mkwrite is not called and hence blocks are not allocated over
> the hole. I chose to fix it on the truncate up as it was the lesser
> of two evils - we can't actually fix the problem entirely because we
> can't serialise page faults against truncate.
  Actually, as I mentioned in the above email, exactly the same problem
happens when file gets extended because of a write beyond EOF (just change
truncate up for pwrite in your test cases). You didn't handle that case in
your XFS patch AFAICS.

> Initially I couldn't reproduce the data corruptions on ext4, but
> Eric came to my rescue and provided me with an updated mremap test
> that triggered corruptions. I also added another variant to the
> plain truncate/mwrite test and so now that itest also reliably
> produces data corruptions on ext4. I suspect the ext4 issue is
> similar to the XFS case (i.e. no page_mkwrite call), but I can't
> follow the ext4 code with any level of cluefulness....
  I'm surprised ext4 is vulnerable. When I was checking a few years back
(2009 or so) it was not because if we found dirty buffers not marked
delalloc we just bit the bullet and tried allocating blocks. But probably
this got broken... checking the code... yeah, I broke that when rewriting
ext4 writeback path :-|

> That is, if two filesystems that support block size smaller than
> page size have similar data corruptions when exercising the same
> generic code paths in similar ways, then it is likely that other
> filesystems have similar problems and need to be checked.
  Frankly, I'd like to handle the problem in the generic code rather than
having hacks in various filesystems. I have a patch back from 2009 which
implements a helper function which gets called when creating a hole (either
from ->setattr or ->write_end) and which handles this. It also has various
optimizations built in - it doesn't do anything when blocksize == pagesize
or when no hole block is actually created. Also it doesn't do any IO as you
do in XFS - it only writeprotects the page.  I'll port the patch and try it
out with ext4.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: mmap writes vs truncate causing data corruption
  2014-09-23 12:27 ` Jan Kara
@ 2014-09-23 13:18   ` Dave Chinner
  0 siblings, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2014-09-23 13:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Sep 23, 2014 at 02:27:54PM +0200, Jan Kara wrote:
>   Hi,
> 
> On Wed 17-09-14 19:28:05, Dave Chinner wrote:
> > Brian, Eric and I have been tracking down a set of data corruption
> > problems on XFS over the past couple of days. The one that is
> > important to the wider developer community is the truncate/mmap
> > write issue that Eric isolated from a real-world application that
> > was triggering it.
> > 
> > The corruption only affects block size smaller than page size
> > configurations and is caused by mmapped writes to the EOF page
> > which has been partially truncated. If we then extend the file
> > again, the region of the page that was truncated and had blocks
> > punched out of it can be written to via mapped writes without blocks
> > being allocated for the hole. Hence while the page is in the page
> > cache, the contents of the file look OK. Unmount/mount the
> > filesystem, then re-read the page from disk and it will contain
> > zeros because there is a hole rather than data blocks.
>   Hum, this is what we already discussed in
> http://lists.openwall.net/linux-ext4/2014/03/13/23, isn't it? I never
> thought about using mremap() in the test cases. That makes it even a POSIX
> valid test case... Nasty.

Yup, and the test case came from an application rather than being
something that was thought up in a drunken rampage of random
syscalls....

> > In the XFS case, the bug was that the filesystem truncate code is
> > not cleaning the partial page fully during the truncate down or up,
> > and hence the pte remains mapped dirty in the TLB. Hence when new
> > data is written to the page, it doesn't trigger a write fault,
> > ->page_mkwrite is not called and hence blocks are not allocated over
> > the hole. I chose to fix it on the truncate up as it was the lesser
> > of two evils - we can't actually fix the problem entirely because we
> > can't serialise page faults against truncate.
>   Actually, as I mentioned in the above email, exactly the same problem
> happens when file gets extended because of a write beyond EOF (just change
> truncate up for pwrite in your test cases). You didn't handle that case in
> your XFS patch AFAICS.

That's because XFS already does tail block zeroing on truncate up
earlier in the truncate code (the xfs_zero_eof() call). The patch I
wrote simply stabilises the page so that a new page fault occurs
and remaps it correctly.

> > That is, if two filesystems that support block size smaller than
> > page size have similar data corruptions when exercising the same
> > generic code paths in similar ways, then it is likely that other
> > filesystems have similar problems and need to be checked.
>   Frankly, I'd like to handle the problem in the generic code rather than
> having hacks in various filesystems. I have a patch back from 2009 which
> implements a helper function which gets called when creating a hole (either
> from ->setattr or ->write_end) and which handles this. It also has various
> optimizations built in - it doesn't do anything when blocksize == pagesize
> or when no hole block is actually created. Also it doesn't do any IO as you
> do in XFS - it only writeprotects the page.  I'll port the patch and try it
> out with ext4.

I'm not sure exactly how that helps - I'll understand better when I
see the code ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-09-23 13:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-17  9:28 mmap writes vs truncate causing data corruption Dave Chinner
2014-09-23 12:27 ` Jan Kara
2014-09-23 13:18   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).