From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter
Date: Mon, 14 Jan 2019 08:49:05 +1100 [thread overview]
Message-ID: <20190113214905.GB4205@dastard> (raw)
In-Reply-To: <20190111123032.31538-4-bfoster@redhat.com>
On Fri, Jan 11, 2019 at 07:30:31AM -0500, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
>
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
>
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
>
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/xfs/xfs_aops.c | 8 ++++++--
> fs/xfs/xfs_iomap.c | 4 ++--
> 2 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..33a1be5df99f 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
> struct xfs_writepage_ctx {
> struct xfs_bmbt_irec imap;
> unsigned int io_type;
> + unsigned int data_seq;
> unsigned int cow_seq;
> struct xfs_ioend *ioend;
> };
> @@ -347,7 +348,8 @@ xfs_map_blocks(
> * out that ensures that we always see the current value.
> */
> imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> - offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> + offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount &&
> + wpc->data_seq == READ_ONCE(ip->i_df.if_seq);
> if (imap_valid &&
> (!xfs_inode_has_cow_data(ip) ||
> wpc->io_type == XFS_IO_COW ||
I suspect this next "if (imap_valid) ..." logic needs to be updated,
too. i.e. the next line is checking if the cow_seq has not changed.
i.e. I think wrapping this up in a helper (again!) might make more
sense:
static bool
xfs_imap_valid(
struct xfs_inode *ip,
struct xfs_writepage_ctx *wpc,
xfs_fileoff_t offset_fsb)
{
if (offset_fsb < wpc->imap.br_startoff)
return false;
if (offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
return false;
if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq)
return false;
if (!xfs_inode_has_cow_data(ip))
return true;
if (wpc->io_type != XFS_IO_COW)
return true;
if (wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq)
return false;
return true;
}
and then put the shutdown check before we check the map for validity
(i.e. don't continue to write to the cached map after a shutdown has
been triggered):
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
if (xfs_imap_valid(ip, wpc, offset_fsb))
return 0;
> @@ -417,6 +419,7 @@ xfs_map_blocks(
> */
> if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
> imap.br_startoff = end_fsb; /* fake a hole past EOF */
> + wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
> xfs_iunlock(ip, XFS_ILOCK_SHARED);
>
> if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +457,8 @@ xfs_map_blocks(
> return 0;
> allocate_blocks:
> error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> - &wpc->cow_seq);
> + whichfork == XFS_COW_FORK ?
> + &wpc->cow_seq : &wpc->data_seq);
> if (error)
> return error;
> ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..0401e33d4e8f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
> int whichfork,
> xfs_off_t offset,
> xfs_bmbt_irec_t *imap,
> - unsigned int *cow_seq)
> + unsigned int *seq)
> {
> xfs_mount_t *mp = ip->i_mount;
> struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -798,7 +798,7 @@ xfs_iomap_write_allocate(
> goto error0;
>
> if (whichfork == XFS_COW_FORK)
> - *cow_seq = READ_ONCE(ifp->if_seq);
> + *seq = READ_ONCE(ifp->if_seq);
> xfs_iunlock(ip, XFS_ILOCK_EXCL);
> }
One of the things that limits xfs_iomap_write_allocate() efficiency
is the mitigations for races against truncate. i.e. the huge comment that
starts:
/*
* it is possible that the extents have changed since
* we did the read call as we dropped the ilock for a
* while. We have to be careful about truncates or hole
* punchs here - we are not allowed to allocate
* non-delalloc blocks here.
....
Now that we can detect that the extents have changed in the data
fork, we can go back to allocating multiple extents per
xfs_bmapi_write() call by doing a sequence number check after we
lock the inode. If the sequence number does not match what was
passed in or returned from the previous loop, we return -EAGAIN.
Hmmm, looking at the existing -EAGAIN case, I suspect this isn't
handled correctly by xfs_map_blocks() anymore. i.e. it just returns
the error which can lead to discarding the page rather than checking
to see if the there was a valid map allocated. I think there's some
followup work here (another patch series). :/
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2019-01-13 21:49 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-01-11 12:30 [PATCH 0/4] xfs: properly invalidate cached writeback mapping Brian Foster
2019-01-11 12:30 ` [PATCH 1/4] xfs: eof trim writeback mapping as soon as it is cached Brian Foster
2019-01-16 13:35 ` Sasha Levin
2019-01-16 14:10 ` Brian Foster
2019-01-11 12:30 ` [PATCH 2/4] xfs: update fork seq counter on data fork changes Brian Foster
2019-01-17 14:41 ` Christoph Hellwig
2019-01-11 12:30 ` [PATCH 3/4] xfs: validate writeback mapping using data fork seq counter Brian Foster
2019-01-13 21:49 ` Dave Chinner [this message]
2019-01-14 15:34 ` Brian Foster
2019-01-14 20:57 ` Dave Chinner
2019-01-15 11:26 ` Brian Foster
2019-01-17 14:47 ` Christoph Hellwig
2019-01-17 16:35 ` Brian Foster
2019-01-17 16:41 ` Christoph Hellwig
2019-01-17 17:53 ` Brian Foster
2019-01-11 12:30 ` [PATCH 4/4] xfs: remove superfluous writeback mapping eof trimming Brian Foster
2019-01-11 13:31 ` [PATCH] tests/generic: test writepage cached mapping validity Brian Foster
2019-01-14 9:30 ` Eryu Guan
2019-01-14 15:34 ` Brian Foster
2019-01-15 3:52 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190113214905.GB4205@dastard \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox