* [PATCH v5] xfs: rework zero range to prevent invalid i_size updates
@ 2014-10-20 17:06 Brian Foster
2014-10-29 0:22 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2014-10-20 17:06 UTC (permalink / raw)
To: xfs
The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.
The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
The latter behavior not only incorrectly increases the inode size, but
can lead to stray delalloc blocks on the inode. Typically, post-eof
preallocation blocks are either truncated on release or inode eviction
or explicitly written to by xfs_zero_eof() on natural file size
extension. If the inode size increases due to zero range, however,
associated blocks leak into the address space having never been
converted or mapped to pagecache pages. A direct I/O to such an
uncovered range cannot convert the extent via writeback and will BUG().
For example:
$ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
...
$ xfs_io -d -c "pread 128k 128k" <file>
<BUG>
If the entire delalloc extent happens to not have page coverage
whatsoever (e.g., delalloc conversion couldn't find a large enough free
space extent), even a full file writeback won't convert what's left of
the extent and we'll assert on inode eviction.
Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
Use the existing hole punch and prealloc mechanisms as primitives for
zero range. This implementation is not efficient nor ideal as we
writeback dirty data over the range and remove existing extents rather
than convert to unwrittern. The former writeback, however, is currently
the only mechanism available to ensure consistency between pagecache and
extent state. Even a pagecache truncate/delalloc punch prior to hole
punch has lead to inconsistencies due to racing with writeback.
This provides a consistent, correct implementation of zero range that
survives fsstress/fsx testing without assert failures. The
implementation can be optimized from this point forward once the
fundamental issue of pagecache and delalloc extent state consistency is
addressed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
v5:
- Further simplify to eliminate delalloc block punch.
v4: http://oss.sgi.com/archives/xfs/2014-10/msg00276.html
- Simplify the implementation to use hole punch.
v3: http://oss.sgi.com/archives/xfs/2014-10/msg00149.html
- Pass length to xfs_alloc_file_space() rather than end offset.
- Split up start/end page writeback branches.
- Fix up a bunch of comments.
v2: http://oss.sgi.com/archives/xfs/2014-10/msg00138.html
- Refactor the logic to punch out pagecache/delalloc first and do
allocation last to prevent stray delalloc on ENOSPC.
v1: http://oss.sgi.com/archives/xfs/2014-10/msg00052.html
fs/xfs/xfs_bmap_util.c | 72 ++++++++++++++------------------------------------
1 file changed, 20 insertions(+), 52 deletions(-)
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 92e8f99..2810026 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1338,7 +1338,10 @@ xfs_free_file_space(
goto out;
}
-
+/*
+ * Preallocate and zero a range of a file. This mechanism has the allocation
+ * semantics of fallocate and in addition converts data in the range to zeroes.
+ */
int
xfs_zero_file_space(
struct xfs_inode *ip,
@@ -1346,65 +1349,30 @@ xfs_zero_file_space(
xfs_off_t len)
{
struct xfs_mount *mp = ip->i_mount;
- uint granularity;
- xfs_off_t start_boundary;
- xfs_off_t end_boundary;
+ uint blksize;
int error;
trace_xfs_zero_file_space(ip);
- granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+ blksize = 1 << mp->m_sb.sb_blocklog;
/*
- * Round the range of extents we are going to convert inwards. If the
- * offset is aligned, then it doesn't get changed so we zero from the
- * start of the block offset points to.
+ * Punch a hole and prealloc the range. We use hole punch rather than
+ * unwritten extent conversion for two reasons:
+ *
+ * 1.) Hole punch handles partial block zeroing for us.
+ *
+ * 2.) If prealloc returns ENOSPC, the file range is still zero-valued
+ * by virtue of the hole punch.
*/
- start_boundary = round_up(offset, granularity);
- end_boundary = round_down(offset + len, granularity);
-
- ASSERT(start_boundary >= offset);
- ASSERT(end_boundary <= offset + len);
-
- if (start_boundary < end_boundary - 1) {
- /*
- * Writeback the range to ensure any inode size updates due to
- * appending writes make it to disk (otherwise we could just
- * punch out the delalloc blocks).
- */
- error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
- start_boundary, end_boundary - 1);
- if (error)
- goto out;
- truncate_pagecache_range(VFS_I(ip), start_boundary,
- end_boundary - 1);
-
- /* convert the blocks */
- error = xfs_alloc_file_space(ip, start_boundary,
- end_boundary - start_boundary - 1,
- XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
- if (error)
- goto out;
-
- /* We've handled the interior of the range, now for the edges */
- if (start_boundary != offset) {
- error = xfs_iozero(ip, offset, start_boundary - offset);
- if (error)
- goto out;
- }
-
- if (end_boundary != offset + len)
- error = xfs_iozero(ip, end_boundary,
- offset + len - end_boundary);
-
- } else {
- /*
- * It's either a sub-granularity range or the range spanned lies
- * partially across two adjacent blocks.
- */
- error = xfs_iozero(ip, offset, len);
- }
+ error = xfs_free_file_space(ip, offset, len);
+ if (error)
+ goto out;
+ error = xfs_alloc_file_space(ip, round_down(offset, blksize),
+ round_up(offset + len, blksize) -
+ round_down(offset, blksize),
+ XFS_BMAPI_PREALLOC);
out:
return error;
--
1.8.3.1
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v5] xfs: rework zero range to prevent invalid i_size updates
2014-10-20 17:06 [PATCH v5] xfs: rework zero range to prevent invalid i_size updates Brian Foster
@ 2014-10-29 0:22 ` Dave Chinner
2014-10-29 1:24 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2014-10-29 0:22 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Mon, Oct 20, 2014 at 01:06:21PM -0400, Brian Foster wrote:
> The zero range operation is analogous to fallocate with the exception of
> converting the range to zeroes. E.g., it attempts to allocate zeroed
> blocks over the range specified by the caller. The XFS implementation
> kills all delalloc blocks currently over the aligned range, converts the
> range to allocated zero blocks (unwritten extents) and handles the
> partial pages at the ends of the range by sending writes through the
> pagecache.
>
> The current implementation suffers from several problems associated with
> inode size. If the aligned range covers an extending I/O, said I/O is
> discarded and an inode size update from a previous write never makes it
> to disk. Further, if an unaligned zero range extends beyond eof, the
> page write induced for the partial end page can itself increase the
> inode size, even if the zero range request is not supposed to update
> i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
>
> The latter behavior not only incorrectly increases the inode size, but
> can lead to stray delalloc blocks on the inode. Typically, post-eof
> preallocation blocks are either truncated on release or inode eviction
> or explicitly written to by xfs_zero_eof() on natural file size
> extension. If the inode size increases due to zero range, however,
> associated blocks leak into the address space having never been
> converted or mapped to pagecache pages. A direct I/O to such an
> uncovered range cannot convert the extent via writeback and will BUG().
> For example:
>
> $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
> ...
> $ xfs_io -d -c "pread 128k 128k" <file>
> <BUG>
>
> If the entire delalloc extent happens to not have page coverage
> whatsoever (e.g., delalloc conversion couldn't find a large enough free
> space extent), even a full file writeback won't convert what's left of
> the extent and we'll assert on inode eviction.
>
> Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
> Use the existing hole punch and prealloc mechanisms as primitives for
> zero range. This implementation is not efficient nor ideal as we
> writeback dirty data over the range and remove existing extents rather
> than convert to unwrittern. The former writeback, however, is currently
> the only mechanism available to ensure consistency between pagecache and
> extent state. Even a pagecache truncate/delalloc punch prior to hole
> punch has lead to inconsistencies due to racing with writeback.
>
> This provides a consistent, correct implementation of zero range that
> survives fsstress/fsx testing without assert failures. The
> implementation can be optimized from this point forward once the
> fundamental issue of pagecache and delalloc extent state consistency is
> addressed.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>
> v5:
> - Further simplify to eliminate delalloc block punch.
This now causes xfs/053 to fail, probably because it changes the
behaviour to actually write the file and hence change EOF. Can you
please check that this is working correctly, and if so submit
patches to change xfs/053 to expect the file size to change
and contain the correct data?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v5] xfs: rework zero range to prevent invalid i_size updates
2014-10-29 0:22 ` Dave Chinner
@ 2014-10-29 1:24 ` Dave Chinner
2014-10-29 16:37 ` Brian Foster
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2014-10-29 1:24 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Wed, Oct 29, 2014 at 11:22:53AM +1100, Dave Chinner wrote:
> On Mon, Oct 20, 2014 at 01:06:21PM -0400, Brian Foster wrote:
> > The zero range operation is analogous to fallocate with the exception of
> > converting the range to zeroes. E.g., it attempts to allocate zeroed
> > blocks over the range specified by the caller. The XFS implementation
> > kills all delalloc blocks currently over the aligned range, converts the
> > range to allocated zero blocks (unwritten extents) and handles the
> > partial pages at the ends of the range by sending writes through the
> > pagecache.
> >
> > The current implementation suffers from several problems associated with
> > inode size. If the aligned range covers an extending I/O, said I/O is
> > discarded and an inode size update from a previous write never makes it
> > to disk. Further, if an unaligned zero range extends beyond eof, the
> > page write induced for the partial end page can itself increase the
> > inode size, even if the zero range request is not supposed to update
> > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> >
> > The latter behavior not only incorrectly increases the inode size, but
> > can lead to stray delalloc blocks on the inode. Typically, post-eof
> > preallocation blocks are either truncated on release or inode eviction
> > or explicitly written to by xfs_zero_eof() on natural file size
> > extension. If the inode size increases due to zero range, however,
> > associated blocks leak into the address space having never been
> > converted or mapped to pagecache pages. A direct I/O to such an
> > uncovered range cannot convert the extent via writeback and will BUG().
> > For example:
> >
> > $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
> > ...
> > $ xfs_io -d -c "pread 128k 128k" <file>
> > <BUG>
> >
> > If the entire delalloc extent happens to not have page coverage
> > whatsoever (e.g., delalloc conversion couldn't find a large enough free
> > space extent), even a full file writeback won't convert what's left of
> > the extent and we'll assert on inode eviction.
> >
> > Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
> > Use the existing hole punch and prealloc mechanisms as primitives for
> > zero range. This implementation is not efficient nor ideal as we
> > writeback dirty data over the range and remove existing extents rather
> > than convert to unwrittern. The former writeback, however, is currently
> > the only mechanism available to ensure consistency between pagecache and
> > extent state. Even a pagecache truncate/delalloc punch prior to hole
> > punch has lead to inconsistencies due to racing with writeback.
> >
> > This provides a consistent, correct implementation of zero range that
> > survives fsstress/fsx testing without assert failures. The
> > implementation can be optimized from this point forward once the
> > fundamental issue of pagecache and delalloc extent state consistency is
> > addressed.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >
> > v5:
> > - Further simplify to eliminate delalloc block punch.
>
> This now causes xfs/053 to fail, probably because it changes the
> behaviour to actually write the file and hence change EOF. Can you
> please check that this is working correctly, and if so submit
> patches to change xfs/053 to expect the file size to change
> and contain the correct data?
Hmm, it appears that I tested the wrong "old kernel" when checking
for regression. 3.18-rc2 also fails, so we need fixes for the
xfs/053 test, not this patch....
-Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v5] xfs: rework zero range to prevent invalid i_size updates
2014-10-29 1:24 ` Dave Chinner
@ 2014-10-29 16:37 ` Brian Foster
0 siblings, 0 replies; 4+ messages in thread
From: Brian Foster @ 2014-10-29 16:37 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Oct 29, 2014 at 12:24:11PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2014 at 11:22:53AM +1100, Dave Chinner wrote:
> > On Mon, Oct 20, 2014 at 01:06:21PM -0400, Brian Foster wrote:
> > > The zero range operation is analogous to fallocate with the exception of
> > > converting the range to zeroes. E.g., it attempts to allocate zeroed
> > > blocks over the range specified by the caller. The XFS implementation
> > > kills all delalloc blocks currently over the aligned range, converts the
> > > range to allocated zero blocks (unwritten extents) and handles the
> > > partial pages at the ends of the range by sending writes through the
> > > pagecache.
> > >
> > > The current implementation suffers from several problems associated with
> > > inode size. If the aligned range covers an extending I/O, said I/O is
> > > discarded and an inode size update from a previous write never makes it
> > > to disk. Further, if an unaligned zero range extends beyond eof, the
> > > page write induced for the partial end page can itself increase the
> > > inode size, even if the zero range request is not supposed to update
> > > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> > >
> > > The latter behavior not only incorrectly increases the inode size, but
> > > can lead to stray delalloc blocks on the inode. Typically, post-eof
> > > preallocation blocks are either truncated on release or inode eviction
> > > or explicitly written to by xfs_zero_eof() on natural file size
> > > extension. If the inode size increases due to zero range, however,
> > > associated blocks leak into the address space having never been
> > > converted or mapped to pagecache pages. A direct I/O to such an
> > > uncovered range cannot convert the extent via writeback and will BUG().
> > > For example:
> > >
> > > $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
> > > ...
> > > $ xfs_io -d -c "pread 128k 128k" <file>
> > > <BUG>
> > >
> > > If the entire delalloc extent happens to not have page coverage
> > > whatsoever (e.g., delalloc conversion couldn't find a large enough free
> > > space extent), even a full file writeback won't convert what's left of
> > > the extent and we'll assert on inode eviction.
> > >
> > > Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
> > > Use the existing hole punch and prealloc mechanisms as primitives for
> > > zero range. This implementation is not efficient nor ideal as we
> > > writeback dirty data over the range and remove existing extents rather
> > > than convert to unwrittern. The former writeback, however, is currently
> > > the only mechanism available to ensure consistency between pagecache and
> > > extent state. Even a pagecache truncate/delalloc punch prior to hole
> > > punch has lead to inconsistencies due to racing with writeback.
> > >
> > > This provides a consistent, correct implementation of zero range that
> > > survives fsstress/fsx testing without assert failures. The
> > > implementation can be optimized from this point forward once the
> > > fundamental issue of pagecache and delalloc extent state consistency is
> > > addressed.
> > >
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >
> > > v5:
> > > - Further simplify to eliminate delalloc block punch.
> >
> > This now causes xfs/053 to fail, probably because it changes the
> > behaviour to actually write the file and hence change EOF. Can you
> > please check that this is working correctly, and if so submit
> > patches to change xfs/053 to expect the file size to change
> > and contain the correct data?
>
> Hmm, it appears that I tested the wrong "old kernel" when checking
> for regression. 3.18-rc2 also fails, so we need fixes for the
> xfs/053 test, not this patch....
>
It's not clear to me what you mean by fixes for the xfs/053 test...
xfs/053 was intended to emphasize the partial writeback of an extent
leading to stale data exposure problem. We had a chat a while ago about
the possibility of stale data exposure after a crash at just the right
time between delalloc extent conversion and writeback. That test was
intended to expose that problem in sort of a hacky way by virtue of the
partial writeback that might occur due to file allocation operations.
Hole punch happens to expose this problem, fzero is now allocated via
hole punch and thus exposes it as well. I think I called this out as a
side effect in one of the previous versions of this patch.
As far as xfs/053 goes, it just uses falloc operations to reproduce the
stale data problem. Now that I think of it, I could look into using
something like sync_file_range() to reproduce in xfs/053 instead. Either
way, I would expect xfs/053 to fail as such until we address the general
extent conversion/writeback problem. I sent the test at the time as I
simply happened to come across it and wanted to make sure the test case
wasn't lost (as opposed to it being any kind of regression test for
fzero).
Brian
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-10-29 16:37 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-20 17:06 [PATCH v5] xfs: rework zero range to prevent invalid i_size updates Brian Foster
2014-10-29 0:22 ` Dave Chinner
2014-10-29 1:24 ` Dave Chinner
2014-10-29 16:37 ` Brian Foster
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox