* [PATCH 0/2] xfs: fix eof zeroing i_size race @ 2015-09-28 18:37 Brian Foster 2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster 2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster 0 siblings, 2 replies; 4+ messages in thread From: Brian Foster @ 2015-09-28 18:37 UTC (permalink / raw) To: xfs Hi all, This addresses an EOF zeroing race that's been reproduced by a somewhat obscure virtual machine workload on qcow2 vdisks with an increasing i_size. Patch 1 contains the fix and patch 2 just adds a tracepoint that helped identify the problem. Thoughts, reviews, flames appreciated. Brian Brian Foster (2): xfs: always drain dio before extending aio write submission xfs: add an xfs_zero_eof() tracepoint fs/xfs/xfs_file.c | 17 +++++++++++------ fs/xfs/xfs_trace.h | 1 + 2 files changed, 12 insertions(+), 6 deletions(-) -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] xfs: always drain dio before extending aio write submission 2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster @ 2015-09-28 18:38 ` Brian Foster 2015-09-29 22:25 ` Eric Sandeen 2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster 1 sibling, 1 reply; 4+ messages in thread From: Brian Foster @ 2015-09-28 18:38 UTC (permalink / raw) To: xfs XFS supports and typically allows concurrent asynchronous direct I/O submission to a single file. One exception to the rule is that file extending dio writes that start beyond the current EOF (e.g., potentially create a hole at EOF) require exclusive I/O access to the file. This is because such writes must zero any pre-existing blocks beyond EOF that are exposed by virtue of now residing within EOF as a result of the write about to be submitted. Before EOF zeroing can occur, the current file i_size must be stabilized to avoid data corruption. In this scenario, XFS upgrades the iolock to exclude any further I/O submission, waits on in-flight I/O to complete to ensure i_size is up to date (i_size is updated on dio write completion) and restarts the various checks against the state of the file. The problem is that this protection sequence is triggered only when the iolock is currently held shared. While this is true for async dio in most cases, the caller may upgrade the lock in advance based on arbitrary circumstances with respect to EOF zeroing. For example, the iolock is always acquired exclusively if the start offset is not block aligned. This means that even though the iolock is already held exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing can occur based on an unstable i_size. This problem has been reproduced as guest data corruption in virtual machines with file-backed qcow2 virtual disks hosted on an XFS filesystem. The virtual disks must be configured with aio=native mode and the must not be truncated out to the maximum file size (as some virt managers will do). Update xfs_file_aio_write_checks() to unconditionally drain in-flight dio before EOF zeroing can occur. Rather than trigger the wait based on iolock state, use a new flag and upgrade the iolock when necessary. Note that this results in a full restart of the inode checks even when the iolock was already held exclusive when technically it is only required to recheck i_size. This should be a rare enough occurrence that it is preferable to keep the code simple rather than create an alternate restart jump target. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_file.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index e78feb4..347b3e0 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -574,6 +574,7 @@ xfs_file_aio_write_checks( struct xfs_inode *ip = XFS_I(inode); ssize_t error = 0; size_t count = iov_iter_count(from); + bool drained_dio = false; restart: error = generic_write_checks(iocb, from); @@ -611,12 +612,13 @@ restart: bool zero = false; spin_unlock(&ip->i_flags_lock); - if (*iolock == XFS_IOLOCK_SHARED) { - xfs_rw_iunlock(ip, *iolock); - *iolock = XFS_IOLOCK_EXCL; - xfs_rw_ilock(ip, *iolock); - iov_iter_reexpand(from, count); - + if (!drained_dio) { + if (*iolock == XFS_IOLOCK_SHARED) { + xfs_rw_iunlock(ip, *iolock); + *iolock = XFS_IOLOCK_EXCL; + xfs_rw_ilock(ip, *iolock); + iov_iter_reexpand(from, count); + } /* * We now have an IO submission barrier in place, but * AIO can do EOF updates during IO completion and hence @@ -626,6 +628,7 @@ restart: * no-op. */ inode_dio_wait(inode); + drained_dio = true; goto restart; } error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero); -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] xfs: always drain dio before extending aio write submission 2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster @ 2015-09-29 22:25 ` Eric Sandeen 0 siblings, 0 replies; 4+ messages in thread From: Eric Sandeen @ 2015-09-29 22:25 UTC (permalink / raw) To: xfs Looks good to me, and tested w/ a testcase I'll send in a bit. Reviewed-by: Eric Sandeen <sandeen@redhat.com> On 9/28/15 1:38 PM, Brian Foster wrote: > XFS supports and typically allows concurrent asynchronous direct I/O > submission to a single file. One exception to the rule is that file > extending dio writes that start beyond the current EOF (e.g., > potentially create a hole at EOF) require exclusive I/O access to the > file. This is because such writes must zero any pre-existing blocks > beyond EOF that are exposed by virtue of now residing within EOF as a > result of the write about to be submitted. > > Before EOF zeroing can occur, the current file i_size must be stabilized > to avoid data corruption. In this scenario, XFS upgrades the iolock to > exclude any further I/O submission, waits on in-flight I/O to complete > to ensure i_size is up to date (i_size is updated on dio write > completion) and restarts the various checks against the state of the > file. The problem is that this protection sequence is triggered only > when the iolock is currently held shared. While this is true for async > dio in most cases, the caller may upgrade the lock in advance based on > arbitrary circumstances with respect to EOF zeroing. For example, the > iolock is always acquired exclusively if the start offset is not block > aligned. This means that even though the iolock is already held > exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing > can occur based on an unstable i_size. > > This problem has been reproduced as guest data corruption in virtual > machines with file-backed qcow2 virtual disks hosted on an XFS > filesystem. The virtual disks must be configured with aio=native mode > and the must not be truncated out to the maximum file size (as some virt > managers will do). > > Update xfs_file_aio_write_checks() to unconditionally drain in-flight > dio before EOF zeroing can occur. Rather than trigger the wait based on > iolock state, use a new flag and upgrade the iolock when necessary. Note > that this results in a full restart of the inode checks even when the > iolock was already held exclusive when technically it is only required > to recheck i_size. This should be a rare enough occurrence that it is > preferable to keep the code simple rather than create an alternate > restart jump target. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_file.c | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index e78feb4..347b3e0 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -574,6 +574,7 @@ xfs_file_aio_write_checks( > struct xfs_inode *ip = XFS_I(inode); > ssize_t error = 0; > size_t count = iov_iter_count(from); > + bool drained_dio = false; > > restart: > error = generic_write_checks(iocb, from); > @@ -611,12 +612,13 @@ restart: > bool zero = false; > > spin_unlock(&ip->i_flags_lock); > - if (*iolock == XFS_IOLOCK_SHARED) { > - xfs_rw_iunlock(ip, *iolock); > - *iolock = XFS_IOLOCK_EXCL; > - xfs_rw_ilock(ip, *iolock); > - iov_iter_reexpand(from, count); > - > + if (!drained_dio) { > + if (*iolock == XFS_IOLOCK_SHARED) { > + xfs_rw_iunlock(ip, *iolock); > + *iolock = XFS_IOLOCK_EXCL; > + xfs_rw_ilock(ip, *iolock); > + iov_iter_reexpand(from, count); > + } > /* > * We now have an IO submission barrier in place, but > * AIO can do EOF updates during IO completion and hence > @@ -626,6 +628,7 @@ restart: > * no-op. > */ > inode_dio_wait(inode); > + drained_dio = true; > goto restart; > } > error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero); > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint 2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster 2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster @ 2015-09-28 18:38 ` Brian Foster 1 sibling, 0 replies; 4+ messages in thread From: Brian Foster @ 2015-09-28 18:38 UTC (permalink / raw) To: xfs Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging EOF zeroing events. This has proven useful in the context of other direct I/O tracepoints to ensure EOF zeroing occurs within appropriate file ranges. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_file.c | 2 ++ fs/xfs/xfs_trace.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 347b3e0..541dcfb 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -482,6 +482,8 @@ xfs_zero_eof( ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL)); ASSERT(offset > isize); + trace_xfs_zero_eof(ip, isize, offset - isize); + /* * First handle zeroing the block on which isize resides. * diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 5ed36b1..957f5cc 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1312,6 +1312,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc); DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert); DEFINE_SIMPLE_IO_EVENT(xfs_get_blocks_notfound); DEFINE_SIMPLE_IO_EVENT(xfs_setfilesize); +DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof); DECLARE_EVENT_CLASS(xfs_itrunc_class, TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size), -- 2.1.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-09-29 22:25 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster 2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster 2015-09-29 22:25 ` Eric Sandeen 2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox