linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: axboe@kernel.dk, tytso@mit.edu, jmoyer@redhat.com, bpm@sgi.com,
	viro@zeniv.linux.org.uk, jack@suse.cz,
	linux-fsdevel@vger.kernel.org, hch@infradead.org,
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	xfs@oss.sgi.com, djwong+kernel@djwong.org
Subject: Re: [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests
Date: Tue, 20 Nov 2012 22:20:38 +1100	[thread overview]
Message-ID: <20121120112038.GC2591@dastard> (raw)
In-Reply-To: <20121120075114.25270.40680.stgit@blackbox.djwong.org>

On Mon, Nov 19, 2012 at 11:51:14PM -0800, Darrick J. Wong wrote:
> If a file is opened with O_SYNC|O_DIRECT, the drive cache does not get
> flushed after the write completion for AIOs.  This patch attempts to fix
> that problem by marking an I/O as requiring a cache flush in endio
> processing, and then issuing the cache flush after any unwritten extent
> conversion is done.
> 
> From: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
> [darrick.wong@oracle.com: Rework patch to use per-mount workqueues]
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_aops.c  |   52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_aops.h  |    1 +
>  fs/xfs/xfs_mount.h |    1 +
>  fs/xfs/xfs_super.c |    8 ++++++++
>  4 files changed, 61 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e57e2da..9cebbb7 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -173,6 +173,24 @@ xfs_setfilesize(
>  }
>  
>  /*
> + * In the case of synchronous, AIO, O_DIRECT writes, we need to flush
> + * the disk cache when the I/O is complete.
> + */

That's not quite right - we need to fsync the metadata when the data
IO is complete. Block device/disk cache flushes are irrelevant at
this level as that is wholly encapsulated inside the metadata fsync
processing.

> +STATIC bool
> +xfs_ioend_needs_cache_flush(

xfs_ioend_need_fsync()

> +	struct xfs_ioend	*ioend)
> +{
> +	struct xfs_inode *ip = XFS_I(ioend->io_inode);
> +	struct xfs_mount *mp = ip->i_mount;
> +
> +	if (!(mp->m_flags & XFS_MOUNT_BARRIER))
> +		return false;

And regardless of whether we have barriers enabled or not, we need
to flush the dirty metadata to the log for O_SYNC/O_DSYNC+AIO+DIO
writes here. So there should be no check of the mount flags here.

> +	return IS_SYNC(ioend->io_inode) ||
> +	       (ioend->io_iocb->ki_filp->f_flags & O_DSYNC);
> +}
> +
> +/*
>   * Schedule IO completion handling on the final put of an ioend.
>   *
>   * If there is no work to do we might as well call it a day and free the
> @@ -189,11 +207,30 @@ xfs_finish_ioend(
>  			queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
>  		else if (ioend->io_append_trans)
>  			queue_work(mp->m_data_workqueue, &ioend->io_work);
> +		else if (ioend->io_needs_fsync)
> +			queue_work(mp->m_aio_blkdev_flush_wq, &ioend->io_work);
>  		else
>  			xfs_destroy_ioend(ioend);
>  	}
>  }
>  
> +STATIC int
> +xfs_ioend_force_cache_flush(

STATIC void
xfs_ioend_fsync()

> +	xfs_ioend_t	*ioend)
> +{
> +	struct xfs_inode *ip = XFS_I(ioend->io_inode);
> +	struct xfs_mount *mp = ip->i_mount;
> +	int		err = 0;
> +	int		datasync;

	trace_xfs_ioend_fsync(ip);

> +	datasync = !IS_SYNC(ioend->io_inode) &&
> +		!(ioend->io_iocb->ki_filp->f_flags & __O_SYNC);
> +	err = do_xfs_file_fsync(ip, mp, datasync);
> +	xfs_destroy_ioend(ioend);

I think this is wrong. The ioend is destroyed by the caller, so
putting it here turns all subsequent uses by the caller into
use-after-free memory corruption bugs.


> +	/* do_xfs_file_fsync returns -errno. our caller expects positive. */
> +	return -err;

This is why xfs_do_file_fsync() should be returning a positive
error - to avoid repeated juggling of error signs.

> +}
> +
>  /*
>   * IO write completion.
>   */
> @@ -250,12 +287,22 @@ xfs_end_io(
>  		error = xfs_setfilesize(ioend);
>  		if (error)
>  			ioend->io_error = -error;
> +	} else if (ioend->io_needs_fsync) {
> +		error = xfs_ioend_force_cache_flush(ioend);
> +		if (error && ioend->io_result > 0)
> +			ioend->io_error = -error;
> +		ioend->io_needs_fsync = 0;

So this is all use-after-free. Also, there's no need to clear
io_needs_fsync() as the ioend is about to be destroyed.

>  	} else {
>  		ASSERT(!xfs_ioend_is_append(ioend));
>  	}
>  
>  done:
> -	xfs_destroy_ioend(ioend);
> +	/* the honoring of O_SYNC has to be done last */
> +	if (ioend->io_needs_fsync) {
> +		atomic_inc(&ioend->io_remaining);
> +		xfs_finish_ioend(ioend);
> +	} else
> +		xfs_destroy_ioend(ioend);

And requeuing work from one workqueue to the next is something that
we can avoid.  We know at IO submission time (i.e.
xfs_vm_direct_io)) whether an fsync completion is going to be needed
during Io completion.  The ioend->io_needs_fsync flag can be set
then, and the first pass through xfs_finish_ioend() can queue it to
the correct workqueue.  i.e. it only needs to be queued if it's not
already an unwritten or append ioend and it needs an fsync.

As it is, all the data completion workqueues run the same completion
function so all you need to do is handle the fsync case at the end
of the existing processing - it's not an else case. i.e the end of
xfs_end_io() becomes:

	if (ioend->io_needs_fsync) {
		error = xfs_ioend_fsync(ioend);
		if (error)
			ioend->io_error = -error;
		goto done;
	}
done:
	xfs_destroy_ioend(ioend);

As it is, this code is going to change before these changes go in -
there's a nasty regression in the DIO code that I found this
afternoon that requires reworking this IO completion logic to
avoid. The patch will appear on the list soon....

> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -209,6 +209,7 @@ typedef struct xfs_mount {
>  	struct workqueue_struct	*m_data_workqueue;
>  	struct workqueue_struct	*m_unwritten_workqueue;
>  	struct workqueue_struct	*m_cil_workqueue;
> +	struct workqueue_struct *m_aio_blkdev_flush_wq;

	struct workqueue_struct *m_aio_fsync_wq;

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2012-11-20 11:20 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-20  7:41 [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part Darrick J. Wong
2012-11-20  7:41 ` [PATCH 1/9] vfs: Handle O_SYNC AIO DIO in generic code properly Darrick J. Wong
2012-11-21 10:08   ` Christoph Hellwig
2012-11-21 16:58     ` Jeff Moyer
2012-11-21 18:29       ` Christoph Hellwig
2012-11-21 18:38         ` Jeff Moyer
2012-11-21 21:37         ` Jan Kara
2012-11-21 23:09           ` Jeffrey Ellis
2012-11-20  7:41 ` [PATCH 2/9] ext4: honor the O_SYNC flag for aysnchronous direct I/O requests Darrick J. Wong
2012-11-20 10:07   ` Jan Kara
2012-11-20 20:02     ` Jeff Moyer
2012-11-21  0:56       ` Jan Kara
2012-11-21 14:09         ` Jeff Moyer
2012-11-21 16:54           ` Jan Kara
2012-11-20  7:41 ` [PATCH 3/9] xfs: factor out everything but the filemap_write_and_wait from xfs_file_fsync Darrick J. Wong
2012-11-20 10:47   ` Dave Chinner
2012-11-21 10:09   ` Christoph Hellwig
2012-11-21 14:10     ` Jeff Moyer
2012-11-20  7:51 ` [PATCH 6/9] gfs2: Use generic handlers of O_SYNC AIO DIO Darrick J. Wong
2012-11-20  7:51 ` [PATCH 7/9] ocfs2: " Darrick J. Wong
2012-11-21 19:32   ` Joel Becker
2012-11-20  7:51 ` [PATCH 5/9] btrfs: " Darrick J. Wong
2012-11-20  7:51 ` [PATCH 4/9] xfs: honor the O_SYNC flag for aysnchronous direct I/O requests Darrick J. Wong
2012-11-20 10:24   ` Jan Kara
2012-11-20 11:20   ` Dave Chinner [this message]
2012-11-20 19:42     ` Jeff Moyer
2012-11-20 20:08       ` Dave Chinner
2012-11-20  7:51 ` [PATCH 9/9] blkdev: Fix up AIO+DIO+O_SYNC to do the sync part correctly Darrick J. Wong
2012-11-20 10:15   ` Jan Kara
2012-11-20 20:47     ` Jeff Moyer
2012-11-21  0:57       ` Jan Kara
2012-11-20  7:51 ` [PATCH 8/9] filemap: don't call generic_write_sync for -EIOCBQUEUED Darrick J. Wong
2012-11-20  8:38 ` [PATCH v1 0/9] fs: fix up AIO+DIO+O_SYNC to actually do the sync part Darrick J. Wong
2012-11-20 14:23 ` Jeff Moyer
2012-11-20 18:57   ` Darrick J. Wong
2012-11-20 19:05     ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121120112038.GC2591@dastard \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=bpm@sgi.com \
    --cc=darrick.wong@oracle.com \
    --cc=djwong+kernel@djwong.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).