linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Cc: Jan Kara <jack@suse.cz>,
	tytso@mit.edu, adilger.kernel@dilger.ca,
	linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure
Date: Tue, 24 Sep 2019 16:13:21 +0200	[thread overview]
Message-ID: <20190924141321.GC11819@quack2.suse.cz> (raw)
In-Reply-To: <20190924102926.GC17526@bobrowski>

On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > +	ssize_t ret;
> > > +	size_t count;
> > > +	loff_t offset = iocb->ki_pos;
> > > +	struct inode *inode = file_inode(iocb->ki_filp);
> > > +	bool extend = false, overwrite = false, unaligned_aio = false;
> > > +
> > > +	if (!inode_trylock(inode)) {
> > > +		if (iocb->ki_flags & IOCB_NOWAIT)
> > > +			return -EAGAIN;
> > > +		inode_lock(inode);
> > > +	}
> > > +
> > > +	if (!ext4_dio_checks(inode)) {
> > > +		inode_unlock(inode);
> > > +		/*
> > > +		 * Fallback to buffered IO if the operation on the
> > > +		 * inode is not supported by direct IO.
> > > +		 */
> > > +		return ext4_buffered_write_iter(iocb, from);
> > > +	}
> > > +
> > > +	ret = ext4_write_checks(iocb, from);
> > > +	if (ret <= 0) {
> > > +		inode_unlock(inode);
> > > +		return ret;
> > > +	}
> > > +
> > > +	/*
> > > +	 * Unaligned direct AIO must be serialized among each other as
> > > +	 * the zeroing of partial blocks of two competing unaligned
> > > +	 * AIOs can result in data corruption.
> > > +	 */
> > > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> > > +	    !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> > > +		unaligned_aio = true;
> > > +		inode_dio_wait(inode);
> > > +	}
> > > +
> > > +	/*
> > > +	 * Determine whether the IO operation will overwrite allocated
> > > +	 * and initialized blocks. If so, check to see whether it is
> > > +	 * possible to take the dioread_nolock path.
> > > +	 */
> > > +	count = iov_iter_count(from);
> > > +	if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > > +	    ext4_should_dioread_nolock(inode)) {
> > > +		overwrite = true;
> > > +		downgrade_write(&inode->i_rwsem);
> > > +	}
> > > +
> > > +	if (offset + count > i_size_read(inode) ||
> > > +	    offset + count > EXT4_I(inode)->i_disksize) {
> > > +		ext4_update_i_disksize(inode, inode->i_size);
> > > +		extend = true;
> > > +	}
> > 
> > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > you need to also have transaction started and call ext4_mark_inode_dirty()
> > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > updates only the in-memory copy of the entry). Also the direct IO code
> > needs to add the inode to the orphan list so that in case of crash, blocks
> > allocated beyond EOF get truncated on next mount. That is the whole point
> > of this excercise with i_disksize after all.
> > 
> > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > progress (we hold i_rwsem) and dirty pages will be flushed by
> > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > enough to have here:
> 
> Well, I initially thought the same, however doing some research shows that we
> have the following edge case:
>      - 45d8ec4d9fd54
>      and
>      - 73fdad00b208b
> 
> In fact you can reproduce the exact same i_size corruption issue by running
> the generic/475 xfstests mutitple times, as articulated within
> 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> be a problem that resides elsewhere that may need addressing?

Right, I forgot about the special case explained in 45d8ec4d9fd54 where
there's unwritted delalloc write beyond range where DIO write happens.

> > 	if (offset + count > i_size_read(inode)) {
> > 		/*
> > 		 * Add inode to orphan list so that blocks allocated beyond
> > 		 * EOF get properly truncated in case of crash.
> > 		 */
> > 		start transaction handle
> > 		add inode to orphan list
> > 		stop transaction handle
> > 	}
> > 
> > And just leave i_disksize at whatever it currently is.
> 
> I originally had the code which added the inode to the orphan list here, but
> then I thought to myself that it'd make more sense to actually do this step
> closer to the point where we've managed to successfully allocate the required
> blocks for the write. This prevents the need to spray orphan list clean up
> code all over the place just to cover the case that a write which had intended
> to extend the inode beyond i_size had failed prematurely (i.e. before block
> allocation). So, hence the reason why I thought having it in
> ext4_iomap_begin() would make more sense, because at that point in the write
> path, there is enough/or more assurance to make the call around whether we
> will in fact be able to perform the write which will be extending beyond
> i_size, or not and consequently whether the inode should be placed onto the
> orphan list?
> 
> Ideally I'd like to turn this statement into:
> 
> 	if (offset + count > i_size_read(inode))
> 	        extend = true;
> 
> Maybe I'm missing something here and there's actually a really good reason for
> doing this nice and early? What are your thoughts about what I've mentioned
> above?

Well, the slight trouble with adding inode to orphan list in
ext4_iomap_begin() is that then it is somewhat difficult to tell whether
you need to remove it when IO is done because there's no way how to
propagate that information from ext4_iomap_begin() and checking against
i_disksize is unreliable because it can change (due to writeback of
delalloc pages) while direct IO is running. But I think we can overcome
that by splitting our end_io functions to two - ext4_dio_write_end_io() and
ext4_dio_extend_write_end_io(). So:

	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
	/*
	 * Need to check against i_disksize as there may be dellalloc writes
	 * pending.
	 */
 	if (offset + count > EXT4_I(inode)->i_disksize)
		extend = true;

	...
	iomap_dio_rw(...,
		extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);

and ext4_dio_write_end_io() will just take care of conversion of unwritten
extents on successful IO completion, while ext4_dio_extend_write_end_io()
will take care of all the complex stuff with orphan handling, extension
of inode size, and truncation of blocks beyond EOF - and it can do that
because it is guaranteed to run under the protection of i_rwsem held in
ext4_dio_write_iter().

Alternatively, we could also just pass NULL instead of
ext4_dio_extend_write_end_io() and just do all the work explicitely in
ext4_dio_write_iter() in the 'extend' case. That might be actually the most
transparent option...

But at this point there are so many suggestions in flight that I need to
see current state of the code again to be able to tell anything useful :).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2019-09-24 14:13 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-12 11:03 [PATCH v3 0/6] ext4: port direct IO to iomap infrastructure Matthew Bobrowski
2019-09-12 11:03 ` [PATCH v3 1/6] ext4: introduce direct IO read path using " Matthew Bobrowski
2019-09-16 12:00   ` Christoph Hellwig
2019-09-16 13:07     ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
2019-09-23 16:21   ` Jan Kara
2019-09-24  9:50     ` Matthew Bobrowski
2019-09-24 13:13     ` Jan Kara
2019-09-12 11:04 ` [PATCH v3 3/6] iomap: split size and error for iomap_dio_rw ->end_io Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin() Matthew Bobrowski
2019-09-16 12:05   ` Christoph Hellwig
2019-09-17 12:48     ` Matthew Bobrowski
2019-09-23 15:08   ` Jan Kara
2019-09-24  9:35     ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Matthew Bobrowski
2019-09-16  4:37   ` Ritesh Harjani
2019-09-16 10:14     ` Matthew Bobrowski
2019-09-16 12:12   ` Christoph Hellwig
2019-09-16 22:37     ` Matthew Bobrowski
2019-09-17  9:00       ` Ritesh Harjani
2019-09-17  9:02         ` Christoph Hellwig
2019-09-17 10:12           ` Ritesh Harjani
2019-09-17 12:39           ` Matthew Bobrowski
2019-09-24 10:57         ` Jan Kara
2019-09-17  9:06       ` Christoph Hellwig
2019-09-17 11:31         ` Matthew Bobrowski
2019-09-20 13:24         ` Matthew Bobrowski
2019-09-23 21:10   ` Jan Kara
2019-09-24 10:29     ` Matthew Bobrowski
2019-09-24 14:13       ` Jan Kara [this message]
2019-09-25  7:14         ` Matthew Bobrowski
2019-09-25  8:40           ` Jan Kara
2019-09-12 11:05 ` [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code Matthew Bobrowski
2019-09-16 12:06   ` Christoph Hellwig
2019-09-16 12:53     ` Matthew Bobrowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190924141321.GC11819@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=adilger.kernel@dilger.ca \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mbobrowski@mbobrowski.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).