linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Moyer <jmoyer@redhat.com>
To: Lukas Czerner <lczerner@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
	jack@suse.cz, david@fromorbit.com
Subject: Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
Date: Thu, 21 Sep 2017 09:44:11 -0400	[thread overview]
Message-ID: <x498th8i2tw.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <1502803734-27706-1-git-send-email-lczerner@redhat.com> (Lukas Czerner's message of "Tue, 15 Aug 2017 15:28:54 +0200")

Lukas Czerner <lczerner@redhat.com> writes:

> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
>
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
>
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
>
> This was based on proposal from Jeff Moyer, thanks!
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>

Is this still in limbo?

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
>
>  fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
>  fs/iomap.c     | 29 ++++++++++++++++-------------
>  mm/filemap.c   | 10 ++++++++--
>  3 files changed, 67 insertions(+), 21 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  {
>  	loff_t offset = dio->iocb->ki_pos;
>  	ssize_t transferred = 0;
> +	int err;
>  
>  	/*
>  	 * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
>  	if (ret == 0)
>  		ret = transferred;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (ret > 0 && dio->op == REQ_OP_WRITE &&
> +	    dio->inode->i_mapping->nrpages) {
> +		err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> +					offset >> PAGE_SHIFT,
> +					(offset + ret - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(err);
> +	}
> +
>  	if (dio->end_io) {
> -		int err;
>  
>  		// XXX: ki_pos??
>  		err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
>  	struct dio *dio = bio->bi_private;
>  	unsigned long remaining;
>  	unsigned long flags;
> +	bool defer_completion = false;
>  
>  	/* cleanup the bio */
>  	dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
>  	spin_unlock_irqrestore(&dio->bio_lock, flags);
>  
>  	if (remaining == 0) {
> -		if (dio->result && dio->defer_completion) {
> +		/*
> +		 * Defer completion when defer_completion is set or
> +		 * when the inode has pages mapped and this is AIO write.
> +		 * We need to invalidate those pages because there is a
> +		 * chance they contain stale data in the case buffered IO
> +		 * went in between AIO submission and completion into the
> +		 * same region.
> +		 */
> +		if (dio->result)
> +			defer_completion = dio->defer_completion ||
> +					   (dio->op == REQ_OP_WRITE &&
> +					    dio->inode->i_mapping->nrpages);
> +		if (defer_completion) {
>  			INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>  			queue_work(dio->inode->i_sb->s_dio_done_wq,
>  				   &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
>  	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
>  	 * so that we can call ->fsync.
>  	 */
> -	if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> -	    ((iocb->ki_filp->f_flags & O_DSYNC) ||
> -	     IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> -		retval = dio_set_defer_completion(dio);
> +	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> +		retval = 0;
> +		if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> +		    IS_SYNC(iocb->ki_filp->f_mapping->host))
> +			retval = dio_set_defer_completion(dio);
> +		else if (!dio->inode->i_sb->s_dio_done_wq) {
> +			/*
> +			 * In case of AIO write racing with buffered read we
> +			 * need to defer completion. We can't decide this now,
> +			 * however the workqueue needs to be initialized here.
> +			 */
> +			retval = sb_init_dio_done_wq(dio->inode->i_sb);
> +		}
>  		if (retval) {
>  			/*
>  			 * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>  {
>  	struct kiocb *iocb = dio->iocb;
> +	struct inode *inode = file_inode(iocb->ki_filp);
>  	ssize_t ret;
>  
> +	/*
> +	 * Try again to invalidate clean pages which might have been cached by
> +	 * non-direct readahead, or faulted in by get_user_pages() if the source
> +	 * of the write was an mmap'ed region of the file we're writing.  Either
> +	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> +	 * this invalidation fails, tough, the write still worked...
> +	 */
> +	if (!dio->error &&
> +	    (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> +		ret = invalidate_inode_pages2_range(inode->i_mapping,
> +				iocb->ki_pos >> PAGE_SHIFT,
> +				(iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> +		WARN_ON_ONCE(ret);
> +	}
> +
>  	if (dio->end_io) {
>  		ret = dio->end_io(iocb,
>  				dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	ret = iomap_dio_complete(dio);
>  
> -	/*
> -	 * Try again to invalidate clean pages which might have been cached by
> -	 * non-direct readahead, or faulted in by get_user_pages() if the source
> -	 * of the write was an mmap'ed region of the file we're writing.  Either
> -	 * one is a pretty crazy thing to do, so we don't support it 100%.  If
> -	 * this invalidation fails, tough, the write still worked...
> -	 */
> -	if (iov_iter_rw(iter) == WRITE) {
> -		int err = invalidate_inode_pages2_range(mapping,
> -				start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> -		WARN_ON_ONCE(err);
> -	}
> -
>  	return ret;
>  
>  out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
>  	 * we're writing.  Either one is a pretty crazy thing to do,
>  	 * so we don't support it 100%.  If this invalidation
>  	 * fails, tough, the write still worked...
> +	 *
> +	 * Most of the time we do not need this since dio_complete() will do
> +	 * the invalidation for us. However there are some file systems that
> +	 * do not end up with dio_complete() being called, so let's not break
> +	 * them by removing it completely
>  	 */
> -	invalidate_inode_pages2_range(mapping,
> -				pos >> PAGE_SHIFT, end);
> +	if (mapping->nrpages)
> +		invalidate_inode_pages2_range(mapping,
> +					pos >> PAGE_SHIFT, end);
>  
>  	if (written > 0) {
>  		pos += written;

  parent reply	other threads:[~2017-09-21 13:44 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner
2017-07-14 10:41 ` kbuild test robot
2017-07-14 13:40   ` Lukas Czerner
2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
2017-07-17 15:12   ` Jan Kara
2017-07-17 15:28     ` Lukas Czerner
2017-07-17 15:39       ` Jeff Moyer
2017-07-17 16:17         ` Jan Kara
2017-07-17 19:52           ` Jeff Moyer
2017-07-18  7:39         ` Lukas Czerner
2017-07-18  9:06           ` Jan Kara
2017-07-18  9:32             ` Lukas Czerner
2017-07-18 12:19   ` [PATCH v3] " Lukas Czerner
2017-07-18 13:44     ` Christoph Hellwig
2017-07-18 14:17       ` Jan Kara
2017-07-19  8:42       ` Lukas Czerner
2017-07-19  8:48     ` [PATCH v4] " Lukas Czerner
2017-07-19  9:26       ` Jan Kara
2017-07-19 11:01         ` Lukas Czerner
2017-07-19 11:28     ` [PATCH v5] " Lukas Czerner
2017-07-19 11:37       ` Jan Kara
2017-07-19 12:17       ` Jeff Moyer
2017-08-03 18:10       ` Jeff Moyer
2017-08-04 10:09         ` Dave Chinner
2017-08-07 15:52           ` Jeff Moyer
2017-08-08  8:41             ` Lukas Czerner
2017-08-10 12:59       ` [PATCH v6] " Lukas Czerner
2017-08-10 13:56         ` Jan Kara
2017-08-10 14:22           ` Jeff Moyer
2017-08-11  9:03             ` Lukas Czerner
2017-08-14  9:43               ` Jan Kara
2017-08-15 12:47                 ` Lukas Czerner
2017-08-15 13:28         ` [PATCH v7] " Lukas Czerner
2017-08-16 13:15           ` Jan Kara
2017-08-16 16:01           ` Darrick J. Wong
2017-09-21 13:44           ` Jeff Moyer [this message]
2017-09-21 13:44           ` Lukas Czerner
2017-09-21 14:14             ` Jens Axboe
2017-10-10 14:34           ` David Sterba
2017-10-11  9:21             ` Lukas Czerner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=x498th8i2tw.fsf@segfault.boston.devel.redhat.com \
    --to=jmoyer@redhat.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=lczerner@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).