* [PATCH 0/2] xfs: fix eof zeroing i_size race
@ 2015-09-28 18:37 Brian Foster
2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster
2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster
0 siblings, 2 replies; 4+ messages in thread
From: Brian Foster @ 2015-09-28 18:37 UTC (permalink / raw)
To: xfs
Hi all,
This addresses an EOF zeroing race that's been reproduced by a somewhat
obscure virtual machine workload on qcow2 vdisks with an increasing
i_size. Patch 1 contains the fix and patch 2 just adds a tracepoint that
helped identify the problem. Thoughts, reviews, flames appreciated.
Brian
Brian Foster (2):
xfs: always drain dio before extending aio write submission
xfs: add an xfs_zero_eof() tracepoint
fs/xfs/xfs_file.c | 17 +++++++++++------
fs/xfs/xfs_trace.h | 1 +
2 files changed, 12 insertions(+), 6 deletions(-)
--
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] xfs: always drain dio before extending aio write submission
2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster
@ 2015-09-28 18:38 ` Brian Foster
2015-09-29 22:25 ` Eric Sandeen
2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster
1 sibling, 1 reply; 4+ messages in thread
From: Brian Foster @ 2015-09-28 18:38 UTC (permalink / raw)
To: xfs
XFS supports and typically allows concurrent asynchronous direct I/O
submission to a single file. One exception to the rule is that file
extending dio writes that start beyond the current EOF (e.g.,
potentially create a hole at EOF) require exclusive I/O access to the
file. This is because such writes must zero any pre-existing blocks
beyond EOF that are exposed by virtue of now residing within EOF as a
result of the write about to be submitted.
Before EOF zeroing can occur, the current file i_size must be stabilized
to avoid data corruption. In this scenario, XFS upgrades the iolock to
exclude any further I/O submission, waits on in-flight I/O to complete
to ensure i_size is up to date (i_size is updated on dio write
completion) and restarts the various checks against the state of the
file. The problem is that this protection sequence is triggered only
when the iolock is currently held shared. While this is true for async
dio in most cases, the caller may upgrade the lock in advance based on
arbitrary circumstances with respect to EOF zeroing. For example, the
iolock is always acquired exclusively if the start offset is not block
aligned. This means that even though the iolock is already held
exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing
can occur based on an unstable i_size.
This problem has been reproduced as guest data corruption in virtual
machines with file-backed qcow2 virtual disks hosted on an XFS
filesystem. The virtual disks must be configured with aio=native mode
and the must not be truncated out to the maximum file size (as some virt
managers will do).
Update xfs_file_aio_write_checks() to unconditionally drain in-flight
dio before EOF zeroing can occur. Rather than trigger the wait based on
iolock state, use a new flag and upgrade the iolock when necessary. Note
that this results in a full restart of the inode checks even when the
iolock was already held exclusive when technically it is only required
to recheck i_size. This should be a rare enough occurrence that it is
preferable to keep the code simple rather than create an alternate
restart jump target.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_file.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e78feb4..347b3e0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -574,6 +574,7 @@ xfs_file_aio_write_checks(
struct xfs_inode *ip = XFS_I(inode);
ssize_t error = 0;
size_t count = iov_iter_count(from);
+ bool drained_dio = false;
restart:
error = generic_write_checks(iocb, from);
@@ -611,12 +612,13 @@ restart:
bool zero = false;
spin_unlock(&ip->i_flags_lock);
- if (*iolock == XFS_IOLOCK_SHARED) {
- xfs_rw_iunlock(ip, *iolock);
- *iolock = XFS_IOLOCK_EXCL;
- xfs_rw_ilock(ip, *iolock);
- iov_iter_reexpand(from, count);
-
+ if (!drained_dio) {
+ if (*iolock == XFS_IOLOCK_SHARED) {
+ xfs_rw_iunlock(ip, *iolock);
+ *iolock = XFS_IOLOCK_EXCL;
+ xfs_rw_ilock(ip, *iolock);
+ iov_iter_reexpand(from, count);
+ }
/*
* We now have an IO submission barrier in place, but
* AIO can do EOF updates during IO completion and hence
@@ -626,6 +628,7 @@ restart:
* no-op.
*/
inode_dio_wait(inode);
+ drained_dio = true;
goto restart;
}
error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero);
--
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint
2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster
2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster
@ 2015-09-28 18:38 ` Brian Foster
1 sibling, 0 replies; 4+ messages in thread
From: Brian Foster @ 2015-09-28 18:38 UTC (permalink / raw)
To: xfs
Add a tracepoint in xfs_zero_eof() to facilitate tracking and debugging
EOF zeroing events. This has proven useful in the context of other
direct I/O tracepoints to ensure EOF zeroing occurs within appropriate
file ranges.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/xfs/xfs_file.c | 2 ++
fs/xfs/xfs_trace.h | 1 +
2 files changed, 3 insertions(+)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 347b3e0..541dcfb 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -482,6 +482,8 @@ xfs_zero_eof(
ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
ASSERT(offset > isize);
+ trace_xfs_zero_eof(ip, isize, offset - isize);
+
/*
* First handle zeroing the block on which isize resides.
*
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 5ed36b1..957f5cc 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1312,6 +1312,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_delalloc_enospc);
DEFINE_SIMPLE_IO_EVENT(xfs_unwritten_convert);
DEFINE_SIMPLE_IO_EVENT(xfs_get_blocks_notfound);
DEFINE_SIMPLE_IO_EVENT(xfs_setfilesize);
+DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof);
DECLARE_EVENT_CLASS(xfs_itrunc_class,
TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),
--
2.1.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] xfs: always drain dio before extending aio write submission
2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster
@ 2015-09-29 22:25 ` Eric Sandeen
0 siblings, 0 replies; 4+ messages in thread
From: Eric Sandeen @ 2015-09-29 22:25 UTC (permalink / raw)
To: xfs
Looks good to me, and tested w/ a testcase I'll send in a bit.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
On 9/28/15 1:38 PM, Brian Foster wrote:
> XFS supports and typically allows concurrent asynchronous direct I/O
> submission to a single file. One exception to the rule is that file
> extending dio writes that start beyond the current EOF (e.g.,
> potentially create a hole at EOF) require exclusive I/O access to the
> file. This is because such writes must zero any pre-existing blocks
> beyond EOF that are exposed by virtue of now residing within EOF as a
> result of the write about to be submitted.
>
> Before EOF zeroing can occur, the current file i_size must be stabilized
> to avoid data corruption. In this scenario, XFS upgrades the iolock to
> exclude any further I/O submission, waits on in-flight I/O to complete
> to ensure i_size is up to date (i_size is updated on dio write
> completion) and restarts the various checks against the state of the
> file. The problem is that this protection sequence is triggered only
> when the iolock is currently held shared. While this is true for async
> dio in most cases, the caller may upgrade the lock in advance based on
> arbitrary circumstances with respect to EOF zeroing. For example, the
> iolock is always acquired exclusively if the start offset is not block
> aligned. This means that even though the iolock is already held
> exclusive for such I/Os, pending I/O is not drained and thus EOF zeroing
> can occur based on an unstable i_size.
>
> This problem has been reproduced as guest data corruption in virtual
> machines with file-backed qcow2 virtual disks hosted on an XFS
> filesystem. The virtual disks must be configured with aio=native mode
> and the must not be truncated out to the maximum file size (as some virt
> managers will do).
>
> Update xfs_file_aio_write_checks() to unconditionally drain in-flight
> dio before EOF zeroing can occur. Rather than trigger the wait based on
> iolock state, use a new flag and upgrade the iolock when necessary. Note
> that this results in a full restart of the inode checks even when the
> iolock was already held exclusive when technically it is only required
> to recheck i_size. This should be a rare enough occurrence that it is
> preferable to keep the code simple rather than create an alternate
> restart jump target.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_file.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e78feb4..347b3e0 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -574,6 +574,7 @@ xfs_file_aio_write_checks(
> struct xfs_inode *ip = XFS_I(inode);
> ssize_t error = 0;
> size_t count = iov_iter_count(from);
> + bool drained_dio = false;
>
> restart:
> error = generic_write_checks(iocb, from);
> @@ -611,12 +612,13 @@ restart:
> bool zero = false;
>
> spin_unlock(&ip->i_flags_lock);
> - if (*iolock == XFS_IOLOCK_SHARED) {
> - xfs_rw_iunlock(ip, *iolock);
> - *iolock = XFS_IOLOCK_EXCL;
> - xfs_rw_ilock(ip, *iolock);
> - iov_iter_reexpand(from, count);
> -
> + if (!drained_dio) {
> + if (*iolock == XFS_IOLOCK_SHARED) {
> + xfs_rw_iunlock(ip, *iolock);
> + *iolock = XFS_IOLOCK_EXCL;
> + xfs_rw_ilock(ip, *iolock);
> + iov_iter_reexpand(from, count);
> + }
> /*
> * We now have an IO submission barrier in place, but
> * AIO can do EOF updates during IO completion and hence
> @@ -626,6 +628,7 @@ restart:
> * no-op.
> */
> inode_dio_wait(inode);
> + drained_dio = true;
> goto restart;
> }
> error = xfs_zero_eof(ip, iocb->ki_pos, i_size_read(inode), &zero);
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-09-29 22:25 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-28 18:37 [PATCH 0/2] xfs: fix eof zeroing i_size race Brian Foster
2015-09-28 18:38 ` [PATCH 1/2] xfs: always drain dio before extending aio write submission Brian Foster
2015-09-29 22:25 ` Eric Sandeen
2015-09-28 18:38 ` [PATCH 2/2] xfs: add an xfs_zero_eof() tracepoint Brian Foster
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox