From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-xfs-owner@vger.kernel.org>
Received: from userp2120.oracle.com ([156.151.31.85]:49184 "EHLO
        userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726442AbfARGMZ (ORCPT
        <rfc822;linux-xfs@vger.kernel.org>); Fri, 18 Jan 2019 01:12:25 -0500
Subject: Re: [PATCH v2 3/5] xfs: validate writeback mapping using data fork
 seq counter
References: <20190117192004.49346-1-bfoster@redhat.com>
 <20190117192004.49346-4-bfoster@redhat.com>
From: Allison Henderson <allison.henderson@oracle.com>
Message-ID: <d761eaea-4114-95f3-e21a-5ff74e8cfdae@oracle.com>
Date: Thu, 17 Jan 2019 23:12:19 -0700
MIME-Version: 1.0
In-Reply-To: <20190117192004.49346-4-bfoster@redhat.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-xfs-owner@vger.kernel.org
List-ID: <linux-xfs.vger.kernel.org>
List-Id: xfs
To: Brian Foster <bfoster@redhat.com>, linux-xfs@vger.kernel.org


On 1/17/19 12:20 PM, Brian Foster wrote:
> The writeback code caches the current extent mapping across multiple
> xfs_do_writepage() calls to avoid repeated lookups for sequential
> pages backed by the same extent. This is known to be slightly racy
> with extent fork changes in certain difficult to reproduce
> scenarios. The cached extent is trimmed to within EOF to help avoid
> the most common vector for this problem via speculative
> preallocation management, but this is a band-aid that does not
> address the fundamental problem.
> 
> Now that we have an xfs_ifork sequence counter mechanism used to
> facilitate COW writeback, we can use the same mechanism to validate
> consistency between the data fork and cached writeback mappings. On
> its face, this is somewhat of a big hammer approach because any
> change to the data fork invalidates any mapping currently cached by
> a writeback in progress regardless of whether the data fork change
> overlaps with the range under writeback. In practice, however, the
> impact of this approach is minimal in most cases.
> 
> First, data fork changes (delayed allocations) caused by sustained
> sequential buffered writes are amortized across speculative
> preallocations. This means that a cached mapping won't be
> invalidated by each buffered write of a common file copy workload,
> but rather only on less frequent allocation events. Second, the
> extent tree is always entirely in-core so an additional lookup of a
> usable extent mostly costs a shared ilock cycle and in-memory tree
> lookup. This means that a cached mapping reval is relatively cheap
> compared to the I/O itself. Third, spurious invalidations don't
> impact ioend construction. This means that even if the same extent
> is revalidated multiple times across multiple writepage instances,
> we still construct and submit the same size ioend (and bio) if the
> blocks are physically contiguous.
> 
> Update struct xfs_writepage_ctx with a new field to hold the
> sequence number of the data fork associated with the currently
> cached mapping. Check the wpc seqno against the data fork when the
> mapping is validated and reestablish the mapping whenever the fork
> has changed since the mapping was cached. This ensures that
> writeback always uses a valid extent mapping and thus prevents lost
> writebacks and stale delalloc block problems.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>   fs/xfs/xfs_aops.c  | 60 ++++++++++++++++++++++++++++++++++++----------
>   fs/xfs/xfs_iomap.c |  5 ++--
>   2 files changed, 49 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d9048bcea49c..649e4ad76add 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -29,6 +29,7 @@
>   struct xfs_writepage_ctx {
>   	struct xfs_bmbt_irec    imap;
>   	unsigned int		io_type;
> +	unsigned int		data_seq;
>   	unsigned int		cow_seq;
>   	struct xfs_ioend	*ioend;
>   };
> @@ -301,6 +302,42 @@ xfs_end_bio(
>   		xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
>   }
>   
> +/*
> + * Fast revalidation of the cached writeback mapping. Return true if the current
> + * mapping is valid, false otherwise.
> + */
> +static bool
> +xfs_imap_valid(
> +	struct xfs_writepage_ctx	*wpc,
> +	struct xfs_inode		*ip,
> +	xfs_fileoff_t			offset_fsb)
> +{
> +	/*
> +	 * If this is a COW mapping, it is sufficient to check that the mapping
> +	 * covers the offset. Be careful to check this first because the caller
> +	 * can revalidate a COW mapping without updating the data seqno.
> +	 */
> +	if (offset_fsb < wpc->imap.br_startoff ||
> +	    offset_fsb >= wpc->imap.br_startoff + wpc->imap.br_blockcount)
> +		return false;
> +	if (wpc->io_type == XFS_IO_COW)
> +		return true;
> +
> +	/*
> +	 * This is not a COW mapping. Check the sequence number of the data fork
> +	 * because concurrent changes could have invalidated the extent. Check
> +	 * the COW fork because concurrent changes since the last time we
> +	 * checked (and found nothing at this offset) could have added
> +	 * overlapping blocks.
> +	 */
> +	if (wpc->data_seq != READ_ONCE(ip->i_df.if_seq))
> +		return false;
> +	if (xfs_inode_has_cow_data(ip) &&
> +	    wpc->cow_seq != READ_ONCE(ip->i_cowfp->if_seq))
> +		return false;
> +	return true;
> +}
> +
>   STATIC int
>   xfs_map_blocks(
>   	struct xfs_writepage_ctx *wpc,
> @@ -315,9 +352,11 @@ xfs_map_blocks(
>   	struct xfs_bmbt_irec	imap;
>   	int			whichfork = XFS_DATA_FORK;
>   	struct xfs_iext_cursor	icur;
> -	bool			imap_valid;
>   	int			error = 0;
>   
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
>   	/*
>   	 * We have to make sure the cached mapping is within EOF to protect
>   	 * against eofblocks trimming on file release leaving us with a stale
> @@ -346,17 +385,9 @@ xfs_map_blocks(
>   	 * against concurrent updates and provides a memory barrier on the way
>   	 * out that ensures that we always see the current value.
>   	 */
> -	imap_valid = offset_fsb >= wpc->imap.br_startoff &&
> -		     offset_fsb < wpc->imap.br_startoff + wpc->imap.br_blockcount;
> -	if (imap_valid &&
> -	    (!xfs_inode_has_cow_data(ip) ||
> -	     wpc->io_type == XFS_IO_COW ||
> -	     wpc->cow_seq == READ_ONCE(ip->i_cowfp->if_seq)))
> +	if (xfs_imap_valid(wpc, ip, offset_fsb))
>   		return 0;
>   
> -	if (XFS_FORCED_SHUTDOWN(mp))
> -		return -EIO;
> -
>   	/*
>   	 * If we don't have a valid map, now it's time to get a new one for this
>   	 * offset.  This will convert delayed allocations (including COW ones)
> @@ -403,9 +434,10 @@ xfs_map_blocks(
>   	}
>   
>   	/*
> -	 * Map valid and no COW extent in the way?  We're done.
> +	 * No COW extent overlap. Revalidate now that we may have updated
> +	 * ->cow_seq. If the data mapping is still valid, we're done.
>   	 */
> -	if (imap_valid) {
> +	if (xfs_imap_valid(wpc, ip, offset_fsb)) {
>   		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>   		return 0;
>   	}
> @@ -417,6 +449,7 @@ xfs_map_blocks(
>   	 */
>   	if (!xfs_iext_lookup_extent(ip, &ip->i_df, offset_fsb, &icur, &imap))
>   		imap.br_startoff = end_fsb;	/* fake a hole past EOF */
> +	wpc->data_seq = READ_ONCE(ip->i_df.if_seq);
>   	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>   
>   	if (imap.br_startoff > offset_fsb) {
> @@ -454,7 +487,8 @@ xfs_map_blocks(
>   	return 0;
>   allocate_blocks:
>   	error = xfs_iomap_write_allocate(ip, whichfork, offset, &imap,
> -			&wpc->cow_seq);
> +			whichfork == XFS_COW_FORK ?
> +					 &wpc->cow_seq : &wpc->data_seq);
>   	if (error)
>   		return error;
>   	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 27c93b5f029d..ab69caa685b4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -681,7 +681,7 @@ xfs_iomap_write_allocate(
>   	int		whichfork,
>   	xfs_off_t	offset,
>   	xfs_bmbt_irec_t *imap,
> -	unsigned int	*cow_seq)
> +	unsigned int	*seq)
>   {
>   	xfs_mount_t	*mp = ip->i_mount;
>   	struct xfs_ifork *ifp = XFS_IFORK_PTR(ip, whichfork);
> @@ -797,8 +797,7 @@ xfs_iomap_write_allocate(
>   			if (error)
>   				goto error0;
>   
> -			if (whichfork == XFS_COW_FORK)
> -				*cow_seq = READ_ONCE(ifp->if_seq);
> +			*seq = READ_ONCE(ifp->if_seq);
>   			xfs_iunlock(ip, XFS_ILOCK_EXCL);
>   		}
>   
> 

Looks good, I think the new version is a lot cleaner with the helper 
routine.  Thx!

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>