* [PATCH v2] xfs: always log the inode on unwritten extent conversion
@ 2015-04-23 16:42 Brian Foster
2015-04-23 22:32 ` Dave Chinner
0 siblings, 1 reply; 2+ messages in thread
From: Brian Foster @ 2015-04-23 16:42 UTC (permalink / raw)
To: xfs
The fsync() requirements for crash consistency on XFS are to flush file
data and force any in-core inode updates to the log. We currently check
whether the inode is pinned to identify whether the log needs to be
forced, since a non-zero pin count generally represents an inode that
has transactions awaiting a flush to the on-disk log.
This is not sufficient in all cases, however. Reports of xfstests test
generic/311 failures on ppc64/s390x hosts have identified failures to
fsync outstanding inode modifications due to the inode not being pinned
at the time of the fsync. This occurs because certain bmap updates can
complete by logging bmapbt buffers but without ever dirtying (and thus
pinning) the core inode. The following is a specific incarnation of this
problem:
$ mount $dev /mnt -o noatime,nobarrier
$ for i in $(seq 0 2 31); do \
xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
done
$ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
hexdump /mnt/file; \
./xfstests-dev/src/godown /mnt
...
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0014000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
$ umount /mnt; mount ...
$ hexdump /mnt/file
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
00f8000
In short, the unwritten extent conversion for the last write is lost
despite the fact that an fsync executed before the filesystem was
shutdown. Note that this is impossible to reproduce on v5 supers due to
unconditional time callbacks for di_changecount and highly difficult to
reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
reduces timer granularity enough to increase the odds that time updates
are skipped and allows this to reproduce within a handful of attempts.
To deal with this problem, make sure that the inode is logged in the
unwritten extent conversion path. Fix up the logflags, if necessary,
after the extent conversion to keep the extent update code consistent
with the other extent update helpers. This fixup is not necessary for
the other (hole, delay) extent helpers because they execute in the block
allocation codepath, which already logs the inode for other reasons
(e.g., for di_nblocks).
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
v2:
- Log inode unconditionally on unwritten extent conversion and retain
the fsync pincount check.
v1: http://oss.sgi.com/pipermail/xfs/2015-April/041468.html
fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index aeffeaa..e74e42bf 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4417,6 +4417,21 @@ xfs_bmapi_convert_unwritten(
error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
&bma->cur, mval, bma->firstblock, bma->flist,
&tmp_logflags);
+ /*
+ * Unwritten extent conversion might not have dirtied the inode
+ * depending on the extent state. Unlike block allocation (e.g.,
+ * di_nblocks), there may be no other reason to log the inode in the
+ * unwritten extent conversion path.
+ *
+ * We need to make sure the inode is dirty in the transaction for the
+ * sake of fsync(), which will not force the log for this transaction
+ * unless it sees the inode pinned. This can only happen for btree
+ * format inodes so use XFS_ILOG_CORE.
+ */
+ if (!error && !tmp_logflags) {
+ ASSERT(bma->cur);
+ tmp_logflags |= XFS_ILOG_CORE;
+ }
bma->logflags |= tmp_logflags;
if (error)
return error;
--
1.9.3
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 2+ messages in thread* Re: [PATCH v2] xfs: always log the inode on unwritten extent conversion
2015-04-23 16:42 [PATCH v2] xfs: always log the inode on unwritten extent conversion Brian Foster
@ 2015-04-23 22:32 ` Dave Chinner
0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2015-04-23 22:32 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Thu, Apr 23, 2015 at 12:42:44PM -0400, Brian Foster wrote:
> The fsync() requirements for crash consistency on XFS are to flush file
> data and force any in-core inode updates to the log. We currently check
> whether the inode is pinned to identify whether the log needs to be
> forced, since a non-zero pin count generally represents an inode that
> has transactions awaiting a flush to the on-disk log.
>
> This is not sufficient in all cases, however. Reports of xfstests test
> generic/311 failures on ppc64/s390x hosts have identified failures to
> fsync outstanding inode modifications due to the inode not being pinned
> at the time of the fsync. This occurs because certain bmap updates can
> complete by logging bmapbt buffers but without ever dirtying (and thus
> pinning) the core inode. The following is a specific incarnation of this
> problem:
>
> $ mount $dev /mnt -o noatime,nobarrier
> $ for i in $(seq 0 2 31); do \
> xfs_io -f -c "falloc $((i * 32768)) 32k" -c fsync /mnt/file; \
> done
> $ xfs_io -c "pwrite -S 0 80k 16k" -c fsync -c "pwrite 76k 4k" -c fsync /mnt/file; \
> hexdump /mnt/file; \
> ./xfstests-dev/src/godown /mnt
> ...
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0013000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0014000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
> $ umount /mnt; mount ...
> $ hexdump /mnt/file
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 00f8000
>
> In short, the unwritten extent conversion for the last write is lost
> despite the fact that an fsync executed before the filesystem was
> shutdown. Note that this is impossible to reproduce on v5 supers due to
> unconditional time callbacks for di_changecount and highly difficult to
> reproduce on CONFIG_HZ=1000 kernels due to those same callbacks
> frequently updating cmtime prior to the bmap update. CONFIG_HZ=100
> reduces timer granularity enough to increase the odds that time updates
> are skipped and allows this to reproduce within a handful of attempts.
>
> To deal with this problem, make sure that the inode is logged in the
> unwritten extent conversion path. Fix up the logflags, if necessary,
> after the extent conversion to keep the extent update code consistent
> with the other extent update helpers. This fixup is not necessary for
> the other (hole, delay) extent helpers because they execute in the block
> allocation codepath, which already logs the inode for other reasons
> (e.g., for di_nblocks).
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>
> v2:
> - Log inode unconditionally on unwritten extent conversion and retain
> the fsync pincount check.
> v1: http://oss.sgi.com/pipermail/xfs/2015-April/041468.html
>
> fs/xfs/libxfs/xfs_bmap.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index aeffeaa..e74e42bf 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -4417,6 +4417,21 @@ xfs_bmapi_convert_unwritten(
> error = xfs_bmap_add_extent_unwritten_real(bma->tp, bma->ip, &bma->idx,
> &bma->cur, mval, bma->firstblock, bma->flist,
> &tmp_logflags);
> + /*
> + * Unwritten extent conversion might not have dirtied the inode
> + * depending on the extent state. Unlike block allocation (e.g.,
> + * di_nblocks), there may be no other reason to log the inode in the
> + * unwritten extent conversion path.
> + *
> + * We need to make sure the inode is dirty in the transaction for the
> + * sake of fsync(), which will not force the log for this transaction
> + * unless it sees the inode pinned. This can only happen for btree
> + * format inodes so use XFS_ILOG_CORE.
> + */
> + if (!error && !tmp_logflags) {
> + ASSERT(bma->cur);
> + tmp_logflags |= XFS_ILOG_CORE;
> + }
> bma->logflags |= tmp_logflags;
> if (error)
> return error;
I'd just do:
bma->logflags |= tmp_logflags | XFS_ILOG_CORE;
Because it really doesn't matter if we log an unchanged inode core
or not - it's likely already in the CIL or AIL given we are doing
unwritten extent conversion, so it is unlikely to introduce
significant new overhead from doing this....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2015-04-23 22:34 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-23 16:42 [PATCH v2] xfs: always log the inode on unwritten extent conversion Brian Foster
2015-04-23 22:32 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox