From: Jeff Moyer <jmoyer@redhat.com>
To: Lukas Czerner <lczerner@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, viro@zeniv.linux.org.uk,
jack@suse.cz, david@fromorbit.com
Subject: Re: [PATCH v7] fs: Fix page cache inconsistency when mixing buffered and AIO DIO
Date: Thu, 21 Sep 2017 09:44:11 -0400 [thread overview]
Message-ID: <x498th8i2tw.fsf@segfault.boston.devel.redhat.com> (raw)
In-Reply-To: <1502803734-27706-1-git-send-email-lczerner@redhat.com> (Lukas Czerner's message of "Tue, 15 Aug 2017 15:28:54 +0200")
Lukas Czerner <lczerner@redhat.com> writes:
> Currently when mixing buffered reads and asynchronous direct writes it
> is possible to end up with the situation where we have stale data in the
> page cache while the new data is already written to disk. This is
> permanent until the affected pages are flushed away. Despite the fact
> that mixing buffered and direct IO is ill-advised it does pose a thread
> for a data integrity, is unexpected and should be fixed.
>
> Fix this by deferring completion of asynchronous direct writes to a
> process context in the case that there are mapped pages to be found in
> the inode. Later before the completion in dio_complete() invalidate
> the pages in question. This ensures that after the completion the pages
> in the written area are either unmapped, or populated with up-to-date
> data. Also do the same for the iomap case which uses
> iomap_dio_complete() instead.
>
> This has a side effect of deferring the completion to a process context
> for every AIO DIO that happens on inode that has pages mapped. However
> since the consensus is that this is ill-advised practice the performance
> implication should not be a problem.
>
> This was based on proposal from Jeff Moyer, thanks!
>
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
Is this still in limbo?
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> ---
> v2: Remove leftover ret variable from invalidate call in iomap_dio_complete
> v3: Do not invalidate in case of error. Add some coments
> v4: Remove unnecessary variable, remove unnecessary inner braces
> v5: Style changes
> v6: Remove redundant invalidatepage, add warning and comment
> v7: Run invalidateion conditionally from generic_file_direct_write()
>
> fs/direct-io.c | 49 +++++++++++++++++++++++++++++++++++++++++++------
> fs/iomap.c | 29 ++++++++++++++++-------------
> mm/filemap.c | 10 ++++++++--
> 3 files changed, 67 insertions(+), 21 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 08cf278..ffb9e19 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -229,6 +229,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> {
> loff_t offset = dio->iocb->ki_pos;
> ssize_t transferred = 0;
> + int err;
>
> /*
> * AIO submission can race with bio completion to get here while
> @@ -258,8 +259,22 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, bool is_async)
> if (ret == 0)
> ret = transferred;
>
> + /*
> + * Try again to invalidate clean pages which might have been cached by
> + * non-direct readahead, or faulted in by get_user_pages() if the source
> + * of the write was an mmap'ed region of the file we're writing. Either
> + * one is a pretty crazy thing to do, so we don't support it 100%. If
> + * this invalidation fails, tough, the write still worked...
> + */
> + if (ret > 0 && dio->op == REQ_OP_WRITE &&
> + dio->inode->i_mapping->nrpages) {
> + err = invalidate_inode_pages2_range(dio->inode->i_mapping,
> + offset >> PAGE_SHIFT,
> + (offset + ret - 1) >> PAGE_SHIFT);
> + WARN_ON_ONCE(err);
> + }
> +
> if (dio->end_io) {
> - int err;
>
> // XXX: ki_pos??
> err = dio->end_io(dio->iocb, offset, ret, dio->private);
> @@ -304,6 +319,7 @@ static void dio_bio_end_aio(struct bio *bio)
> struct dio *dio = bio->bi_private;
> unsigned long remaining;
> unsigned long flags;
> + bool defer_completion = false;
>
> /* cleanup the bio */
> dio_bio_complete(dio, bio);
> @@ -315,7 +331,19 @@ static void dio_bio_end_aio(struct bio *bio)
> spin_unlock_irqrestore(&dio->bio_lock, flags);
>
> if (remaining == 0) {
> - if (dio->result && dio->defer_completion) {
> + /*
> + * Defer completion when defer_completion is set or
> + * when the inode has pages mapped and this is AIO write.
> + * We need to invalidate those pages because there is a
> + * chance they contain stale data in the case buffered IO
> + * went in between AIO submission and completion into the
> + * same region.
> + */
> + if (dio->result)
> + defer_completion = dio->defer_completion ||
> + (dio->op == REQ_OP_WRITE &&
> + dio->inode->i_mapping->nrpages);
> + if (defer_completion) {
> INIT_WORK(&dio->complete_work, dio_aio_complete_work);
> queue_work(dio->inode->i_sb->s_dio_done_wq,
> &dio->complete_work);
> @@ -1210,10 +1238,19 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
> * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
> * so that we can call ->fsync.
> */
> - if (dio->is_async && iov_iter_rw(iter) == WRITE &&
> - ((iocb->ki_filp->f_flags & O_DSYNC) ||
> - IS_SYNC(iocb->ki_filp->f_mapping->host))) {
> - retval = dio_set_defer_completion(dio);
> + if (dio->is_async && iov_iter_rw(iter) == WRITE) {
> + retval = 0;
> + if ((iocb->ki_filp->f_flags & O_DSYNC) ||
> + IS_SYNC(iocb->ki_filp->f_mapping->host))
> + retval = dio_set_defer_completion(dio);
> + else if (!dio->inode->i_sb->s_dio_done_wq) {
> + /*
> + * In case of AIO write racing with buffered read we
> + * need to defer completion. We can't decide this now,
> + * however the workqueue needs to be initialized here.
> + */
> + retval = sb_init_dio_done_wq(dio->inode->i_sb);
> + }
> if (retval) {
> /*
> * We grab i_mutex only for reads so we don't have
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 0392661..c3e299a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -713,8 +713,24 @@ struct iomap_dio {
> static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> {
> struct kiocb *iocb = dio->iocb;
> + struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
>
> + /*
> + * Try again to invalidate clean pages which might have been cached by
> + * non-direct readahead, or faulted in by get_user_pages() if the source
> + * of the write was an mmap'ed region of the file we're writing. Either
> + * one is a pretty crazy thing to do, so we don't support it 100%. If
> + * this invalidation fails, tough, the write still worked...
> + */
> + if (!dio->error &&
> + (dio->flags & IOMAP_DIO_WRITE) && inode->i_mapping->nrpages) {
> + ret = invalidate_inode_pages2_range(inode->i_mapping,
> + iocb->ki_pos >> PAGE_SHIFT,
> + (iocb->ki_pos + dio->size - 1) >> PAGE_SHIFT);
> + WARN_ON_ONCE(ret);
> + }
> +
> if (dio->end_io) {
> ret = dio->end_io(iocb,
> dio->error ? dio->error : dio->size,
> @@ -1042,19 +1058,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>
> ret = iomap_dio_complete(dio);
>
> - /*
> - * Try again to invalidate clean pages which might have been cached by
> - * non-direct readahead, or faulted in by get_user_pages() if the source
> - * of the write was an mmap'ed region of the file we're writing. Either
> - * one is a pretty crazy thing to do, so we don't support it 100%. If
> - * this invalidation fails, tough, the write still worked...
> - */
> - if (iov_iter_rw(iter) == WRITE) {
> - int err = invalidate_inode_pages2_range(mapping,
> - start >> PAGE_SHIFT, end >> PAGE_SHIFT);
> - WARN_ON_ONCE(err);
> - }
> -
> return ret;
>
> out_free_dio:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..9440e02 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2885,9 +2885,15 @@ generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
> * we're writing. Either one is a pretty crazy thing to do,
> * so we don't support it 100%. If this invalidation
> * fails, tough, the write still worked...
> + *
> + * Most of the time we do not need this since dio_complete() will do
> + * the invalidation for us. However there are some file systems that
> + * do not end up with dio_complete() being called, so let's not break
> + * them by removing it completely
> */
> - invalidate_inode_pages2_range(mapping,
> - pos >> PAGE_SHIFT, end);
> + if (mapping->nrpages)
> + invalidate_inode_pages2_range(mapping,
> + pos >> PAGE_SHIFT, end);
>
> if (written > 0) {
> pos += written;
next prev parent reply other threads:[~2017-09-21 13:44 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-13 15:17 [PATCH] fs: Fix page cache inconsistency when mixing buffered and AIO DIO Lukas Czerner
2017-07-14 10:41 ` kbuild test robot
2017-07-14 13:40 ` Lukas Czerner
2017-07-14 15:40 ` [PATCH v2] " Lukas Czerner
2017-07-17 15:12 ` Jan Kara
2017-07-17 15:28 ` Lukas Czerner
2017-07-17 15:39 ` Jeff Moyer
2017-07-17 16:17 ` Jan Kara
2017-07-17 19:52 ` Jeff Moyer
2017-07-18 7:39 ` Lukas Czerner
2017-07-18 9:06 ` Jan Kara
2017-07-18 9:32 ` Lukas Czerner
2017-07-18 12:19 ` [PATCH v3] " Lukas Czerner
2017-07-18 13:44 ` Christoph Hellwig
2017-07-18 14:17 ` Jan Kara
2017-07-19 8:42 ` Lukas Czerner
2017-07-19 8:48 ` [PATCH v4] " Lukas Czerner
2017-07-19 9:26 ` Jan Kara
2017-07-19 11:01 ` Lukas Czerner
2017-07-19 11:28 ` [PATCH v5] " Lukas Czerner
2017-07-19 11:37 ` Jan Kara
2017-07-19 12:17 ` Jeff Moyer
2017-08-03 18:10 ` Jeff Moyer
2017-08-04 10:09 ` Dave Chinner
2017-08-07 15:52 ` Jeff Moyer
2017-08-08 8:41 ` Lukas Czerner
2017-08-10 12:59 ` [PATCH v6] " Lukas Czerner
2017-08-10 13:56 ` Jan Kara
2017-08-10 14:22 ` Jeff Moyer
2017-08-11 9:03 ` Lukas Czerner
2017-08-14 9:43 ` Jan Kara
2017-08-15 12:47 ` Lukas Czerner
2017-08-15 13:28 ` [PATCH v7] " Lukas Czerner
2017-08-16 13:15 ` Jan Kara
2017-08-16 16:01 ` Darrick J. Wong
2017-09-21 13:44 ` Jeff Moyer [this message]
2017-09-21 13:44 ` Lukas Czerner
2017-09-21 14:14 ` Jens Axboe
2017-10-10 14:34 ` David Sterba
2017-10-11 9:21 ` Lukas Czerner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=x498th8i2tw.fsf@segfault.boston.devel.redhat.com \
--to=jmoyer@redhat.com \
--cc=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=lczerner@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).