Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Cc: tytso@mit.edu, jack@suse.cz, adilger.kernel@dilger.ca,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure
Date: Mon, 23 Sep 2019 23:10:11 +0200	[thread overview]
Message-ID: <20190923211011.GH20367@quack2.suse.cz> (raw)
In-Reply-To: <db33705f9ba35ccbe20fc19b8ecbbf2078beff08.1568282664.git.mbobrowski@mbobrowski.org>

I'll try to comment just on top of refactoring Christoph has suggested...

On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>  	return 0;
>  }
>  
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> +				 unsigned int flags)
> +{
> +	int ret;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +
> +	if (error) {
> +		ret = ext4_handle_failed_inode_extension(inode, offset + size);
> +		return ret ? ret : error;
> +	}
> +
> +	if (flags & IOMAP_DIO_UNWRITTEN) {
> +		ret = ext4_convert_unwritten_extents(NULL, inode,
> +						     offset, size);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (offset + size > i_size_read(inode)) {
> +		ret = ext4_handle_inode_extension(inode, offset, size, 0);
> +		if (ret)
> +			return ret;
> +	}

With the suggestions I made to your patch 3/6 this could be simplified to:

	if (!error && flags & IOMAP_DIO_UNWRITTEN) {
		error = ext4_convert_unwritten_extents(NULL, inode, offset,
						       size);
	}
	return ext4_handle_inode_extension(inode, offset, error ? : size, size);


Note the change that when ext4_convert_unwritten_extents() fails (although
this should not really happen unless there's some corruption going on), we
do properly truncate possible extents beyond i_size.

> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	ssize_t ret;
> +	size_t count;
> +	loff_t offset = iocb->ki_pos;
> +	struct inode *inode = file_inode(iocb->ki_filp);
> +	bool extend = false, overwrite = false, unaligned_aio = false;
> +
> +	if (!inode_trylock(inode)) {
> +		if (iocb->ki_flags & IOCB_NOWAIT)
> +			return -EAGAIN;
> +		inode_lock(inode);
> +	}
> +
> +	if (!ext4_dio_checks(inode)) {
> +		inode_unlock(inode);
> +		/*
> +		 * Fallback to buffered IO if the operation on the
> +		 * inode is not supported by direct IO.
> +		 */
> +		return ext4_buffered_write_iter(iocb, from);
> +	}
> +
> +	ret = ext4_write_checks(iocb, from);
> +	if (ret <= 0) {
> +		inode_unlock(inode);
> +		return ret;
> +	}
> +
> +	/*
> +	 * Unaligned direct AIO must be serialized among each other as
> +	 * the zeroing of partial blocks of two competing unaligned
> +	 * AIOs can result in data corruption.
> +	 */
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> +		unaligned_aio = true;
> +		inode_dio_wait(inode);
> +	}
> +
> +	/*
> +	 * Determine whether the IO operation will overwrite allocated
> +	 * and initialized blocks. If so, check to see whether it is
> +	 * possible to take the dioread_nolock path.
> +	 */
> +	count = iov_iter_count(from);
> +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> +	    ext4_should_dioread_nolock(inode)) {
> +		overwrite = true;
> +		downgrade_write(&inode->i_rwsem);
> +	}
> +
> +	if (offset + count > i_size_read(inode) ||
> +	    offset + count > EXT4_I(inode)->i_disksize) {
> +		ext4_update_i_disksize(inode, inode->i_size);
> +		extend = true;
> +	}

This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
you need to also have transaction started and call ext4_mark_inode_dirty()
to actually journal the change of i_disksize (ext4_update_i_disksize()
updates only the in-memory copy of the entry). Also the direct IO code
needs to add the inode to the orphan list so that in case of crash, blocks
allocated beyond EOF get truncated on next mount. That is the whole point
of this excercise with i_disksize after all.

But I'm wondering if i_disksize update is needed. Truncate cannot be in
progress (we hold i_rwsem) and dirty pages will be flushed by
iomap_dio_rw() before we start to allocate any blocks. So it should be
enough to have here:

	if (offset + count > i_size_read(inode)) {
		/*
		 * Add inode to orphan list so that blocks allocated beyond
		 * EOF get properly truncated in case of crash.
		 */
		start transaction handle
		add inode to orphan list
		stop transaction handle
	}

And just leave i_disksize at whatever it currently is.

> +
> +	ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> +	/*
> +	 * Unaligned direct AIO must be the only IO in flight or else
> +	 * any overlapping aligned IO after unaligned IO might result
> +	 * in data corruption. We also need to wait here in the case
> +	 * where the inode is being extended so that inode extension
> +	 * routines in ext4_dio_write_end_io() are covered by the
> +	 * inode_lock().
> +	 */
> +	if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> +		inode_dio_wait(inode);
> +
> +	if (overwrite)
> +		inode_unlock_shared(inode);
> +	else
> +		inode_unlock(inode);
> +
> +	if (ret >= 0 && iov_iter_count(from))
> +		return ext4_buffered_write_iter(iocb, from);
> +	return ret;
> +}
> +
...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  			}
>  		}
>  	} else if (flags & IOMAP_WRITE) {
> -		int dio_credits;
>  		handle_t *handle;
> -		int retries = 0;
> +		int dio_credits, retries = 0, m_flags = 0;
>  
> -		/* Trim mapping request to maximum we can map at once for DIO */
> +		/*
> +		 * Trim mapping request to maximum we can map at once
> +		 * for DIO.
> +		 */
>  		if (map.m_len > DIO_MAX_BLOCKS)
>  			map.m_len = DIO_MAX_BLOCKS;
>  		dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  		if (IS_ERR(handle))
>  			return PTR_ERR(handle);
>  
> -		ret = ext4_map_blocks(handle, inode, &map,
> -				      EXT4_GET_BLOCKS_CREATE_ZERO);
> +		/*
> +		 * DAX and direct IO are the only two operations that
> +		 * are currently supported with IOMAP_WRITE.
> +		 */
> +		WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> +		if (IS_DAX(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> +		else if (round_down(offset, i_blocksize(inode)) >=
> +			 i_size_read(inode))
> +			m_flags = EXT4_GET_BLOCKS_CREATE;
> +		else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> +			m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> +		ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> +		/*
> +		 * We cannot fill holes in indirect tree based inodes
> +		 * as that could expose stale data in the case of a
> +		 * crash. Use the magic error code to fallback to
> +		 * buffered IO.
> +		 */
> +		if (!m_flags && !ret)
> +			ret = -ENOTBLK;
> +
>  		if (ret < 0) {
>  			ext4_journal_stop(handle);
>  			if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  		}
>  
>  		/*
> -		 * If we added blocks beyond i_size, we need to make sure they
> -		 * will get truncated if we crash before updating i_size in
> -		 * ext4_iomap_end(). For faults we don't need to do that (and
> -		 * even cannot because for orphan list operations inode_lock is
> -		 * required) - if we happen to instantiate block beyond i_size,
> -		 * it is because we race with truncate which has already added
> -		 * the inode to the orphan list.
> +		 * If we added blocks beyond i_size, we need to make
> +		 * sure they will get truncated if we crash before
> +		 * updating the i_size. For faults we don't need to do
> +		 * that (and even cannot because for orphan list
> +		 * operations inode_lock is required) - if we happen
> +		 * to instantiate block beyond i_size, it is because
> +		 * we race with truncate which has already added the
> +		 * inode to the orphan list.
>  		 */

Just a nit but it would be nice to use full width of 80 columns when
formatting comments so that they don't get unnecessarily long.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2019-09-23 21:10 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-12 11:03 [PATCH v3 0/6] ext4: port direct IO to iomap infrastructure Matthew Bobrowski
2019-09-12 11:03 ` [PATCH v3 1/6] ext4: introduce direct IO read path using " Matthew Bobrowski
2019-09-16 12:00   ` Christoph Hellwig
2019-09-16 13:07     ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
2019-09-23 16:21   ` Jan Kara
2019-09-24  9:50     ` Matthew Bobrowski
2019-09-24 13:13     ` Jan Kara
2019-09-12 11:04 ` [PATCH v3 3/6] iomap: split size and error for iomap_dio_rw ->end_io Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin() Matthew Bobrowski
2019-09-16 12:05   ` Christoph Hellwig
2019-09-17 12:48     ` Matthew Bobrowski
2019-09-23 15:08   ` Jan Kara
2019-09-24  9:35     ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Matthew Bobrowski
2019-09-16  4:37   ` Ritesh Harjani
2019-09-16 10:14     ` Matthew Bobrowski
2019-09-16 12:12   ` Christoph Hellwig
2019-09-16 22:37     ` Matthew Bobrowski
2019-09-17  9:00       ` Ritesh Harjani
2019-09-17  9:02         ` Christoph Hellwig
2019-09-17 10:12           ` Ritesh Harjani
2019-09-17 12:39           ` Matthew Bobrowski
2019-09-24 10:57         ` Jan Kara
2019-09-17  9:06       ` Christoph Hellwig
2019-09-17 11:31         ` Matthew Bobrowski
2019-09-20 13:24         ` Matthew Bobrowski
2019-09-23 21:10   ` Jan Kara [this message]
2019-09-24 10:29     ` Matthew Bobrowski
2019-09-24 14:13       ` Jan Kara
2019-09-25  7:14         ` Matthew Bobrowski
2019-09-25  8:40           ` Jan Kara
2019-09-12 11:05 ` [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code Matthew Bobrowski
2019-09-16 12:06   ` Christoph Hellwig
2019-09-16 12:53     ` Matthew Bobrowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190923211011.GH20367@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mbobrowski@mbobrowski.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).