linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Eryu Guan <guaneryu@gmail.com>
Cc: linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	hch@lst.de, Eryu Guan <eguan@redhat.com>
Subject: Re: [PATCH] iomap: invalidate page caches should be after iomap_dio_complete() in direct write
Date: Tue, 28 Feb 2017 14:23:56 -0800	[thread overview]
Message-ID: <20170228222356.GJ5297@birch.djwong.org> (raw)
In-Reply-To: <20170228173626.27640-1-guaneryu@gmail.com>

On Wed, Mar 01, 2017 at 01:36:26AM +0800, Eryu Guan wrote:
> From: Eryu Guan <eguan@redhat.com>
> 
> After XFS switching to iomap based DIO (commit acdda3aae146 ("xfs:
> use iomap_dio_rw")), I started to notice dio29/dio30 tests failures
> from LTP run on ppc64 hosts, and they can be reproduced on x86_64
> hosts with 512B/1k block size XFS too.
> 
> dio29	diotest3 -b 65536 -n 100 -i 1000 -o 1024000
> dio30	diotest6 -b 65536 -n 100 -i 1000 -o 1024000
> 
> The failure message is like:
> bufcmp: offset 0: Expected: 0x62, got 0x0
> diotest03    1  TPASS  :  Read with Direct IO, Write without
> diotest03    2  TFAIL  :  diotest3.c:142: comparsion failed; child=98 offset=1425408
> diotest03    3  TFAIL  :  diotest3.c:194: Write Direct-child 98 failed
> 
> Direct write wrote 0x62 but buffer read got zero. This is because,
> when doing direct write to a hole or preallocated file, we
> invalidate the page caches before converting the extent from
> unwritten state to normal state, which is done by
> iomap_dio_complete(), thus leave a window for other buffer reader to
> cache the unwritten state extent.
> 
> Consider this case, with sub-page blocksize XFS, two processes are
> direct writing to different blocksize-aligned regions (say 512B) of
> the same preallocated file, and reading the region back via buffered
> I/O to compare contents.
> 
> process A, region [0,512]		process B, region [512,1024]
> xfs_file_write_iter
>  xfs_file_aio_dio_write
>   iomap_dio_rw
>    iomap_apply
>    invalidate_inode_pages2_range
>    					xfs_file_write_iter
> 				 	xfs_file_aio_dio_write
> 					  iomap_dio_rw
> 					   iomap_apply
> 					   invalidate_inode_pages2_range
> 					   iomap_dio_complete
> 					xfs_file_read_iter
> 					 xfs_file_buffered_aio_read
> 					  generic_file_read_iter
> 					   do_generic_file_read
> 					    <readahead fills pagecache with 0>
>    iomap_dio_complete
> xfs_file_read_iter
>  <read gets 0 from pagecache>
> 
> Process A first invalidates page caches, at this point the
> underlying extent is still in unwritten state (iomap_dio_complete
> not called yet), and process B finishs direct write and populates
> page caches via readahead, which caches zeros in page for region A,
> then process A reads zeros from page cache, instead of the actual
> data.
> 
> Fix it by invalidating page caches after converting unwritten extent
> to make sure we read content from disk after extent state changed,
> as what we did before switching to iomap based dio.

/me thinks this looks ok, though I'd rather have Christoph's r-v-b since
it's his code... :)

--D

> 
> Also introduce a new 'start' variable to save the original write
> offset (iomap_dio_complete() updates iocb->ki_pos), and a 'res'
> variable for invalidating caches result, cause we can't reuse 'ret'
> anymore.
> 
> Signed-off-by: Eryu Guan <eguan@redhat.com>
> ---
>  fs/iomap.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0f85f24..d5e4ea0 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -844,10 +844,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	struct address_space *mapping = iocb->ki_filp->f_mapping;
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  	size_t count = iov_iter_count(iter);
> -	loff_t pos = iocb->ki_pos, end = iocb->ki_pos + count - 1, ret = 0;
> +	loff_t pos = iocb->ki_pos, start = pos;
> +	loff_t end = iocb->ki_pos + count - 1, ret = 0;
>  	unsigned int flags = IOMAP_DIRECT;
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
> +	int res;
>  
>  	lockdep_assert_held(&inode->i_rwsem);
>  
> @@ -885,14 +887,13 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	}
>  
>  	if (mapping->nrpages) {
> -		ret = filemap_write_and_wait_range(mapping, iocb->ki_pos, end);
> +		ret = filemap_write_and_wait_range(mapping, start, end);
>  		if (ret)
>  			goto out_free_dio;
>  
> -		ret = invalidate_inode_pages2_range(mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> -		ret = 0;
> +		res = invalidate_inode_pages2_range(mapping,
> +				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +		WARN_ON_ONCE(res);
>  	}
>  
>  	inode_dio_begin(inode);
> @@ -939,6 +940,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		__set_current_state(TASK_RUNNING);
>  	}
>  
> +	ret = iomap_dio_complete(dio);
> +
>  	/*
>  	 * Try again to invalidate clean pages which might have been cached by
>  	 * non-direct readahead, or faulted in by get_user_pages() if the source
> @@ -947,12 +950,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	 * this invalidation fails, tough, the write still worked...
>  	 */
>  	if (iov_iter_rw(iter) == WRITE && mapping->nrpages) {
> -		ret = invalidate_inode_pages2_range(mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(ret);
> +		res = invalidate_inode_pages2_range(mapping,
> +				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> +		WARN_ON_ONCE(res);
>  	}
>  
> -	return iomap_dio_complete(dio);
> +	return ret;
>  
>  out_free_dio:
>  	kfree(dio);
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2017-02-28 23:35 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-28 17:36 [PATCH] iomap: invalidate page caches should be after iomap_dio_complete() in direct write Eryu Guan
2017-02-28 22:23 ` Darrick J. Wong [this message]
2017-03-01 15:02 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170228222356.GJ5297@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=eguan@redhat.com \
    --cc=guaneryu@gmail.com \
    --cc=hch@lst.de \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).