From: Jan Kara <jack@suse.cz>
To: Matthew Bobrowski <mbobrowski@mbobrowski.org>
Cc: Jan Kara <jack@suse.cz>,
tytso@mit.edu, adilger.kernel@dilger.ca,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure
Date: Wed, 25 Sep 2019 10:40:31 +0200 [thread overview]
Message-ID: <20190925084031.GA23277@quack2.suse.cz> (raw)
In-Reply-To: <20190925071429.GA27699@bobrowski>
On Wed 25-09-19 17:14:29, Matthew Bobrowski wrote:
> On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> > On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > > + if (offset + count > i_size_read(inode) ||
> > > > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > > > + ext4_update_i_disksize(inode, inode->i_size);
> > > > > + extend = true;
> > > > > + }
> > > >
> > > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > > updates only the in-memory copy of the entry). Also the direct IO code
> > > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > > of this excercise with i_disksize after all.
> > > >
> > > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > > enough to have here:
> > >
> > > Well, I initially thought the same, however doing some research shows that we
> > > have the following edge case:
> > > - 45d8ec4d9fd54
> > > and
> > > - 73fdad00b208b
> > >
> > > In fact you can reproduce the exact same i_size corruption issue by running
> > > the generic/475 xfstests mutitple times, as articulated within
> > > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > > be a problem that resides elsewhere that may need addressing?
> >
> > Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> > there's unwritted delalloc write beyond range where DIO write happens.
> >
> > > > if (offset + count > i_size_read(inode)) {
> > > > /*
> > > > * Add inode to orphan list so that blocks allocated beyond
> > > > * EOF get properly truncated in case of crash.
> > > > */
> > > > start transaction handle
> > > > add inode to orphan list
> > > > stop transaction handle
> > > > }
> > > >
> > > > And just leave i_disksize at whatever it currently is.
> > >
> > > I originally had the code which added the inode to the orphan list here, but
> > > then I thought to myself that it'd make more sense to actually do this step
> > > closer to the point where we've managed to successfully allocate the required
> > > blocks for the write. This prevents the need to spray orphan list clean up
> > > code all over the place just to cover the case that a write which had intended
> > > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > > allocation). So, hence the reason why I thought having it in
> > > ext4_iomap_begin() would make more sense, because at that point in the write
> > > path, there is enough/or more assurance to make the call around whether we
> > > will in fact be able to perform the write which will be extending beyond
> > > i_size, or not and consequently whether the inode should be placed onto the
> > > orphan list?
> > >
> > > Ideally I'd like to turn this statement into:
> > >
> > > if (offset + count > i_size_read(inode))
> > > extend = true;
> > >
> > > Maybe I'm missing something here and there's actually a really good reason for
> > > doing this nice and early? What are your thoughts about what I've mentioned
> > > above?
> >
> > Well, the slight trouble with adding inode to orphan list in
> > ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> > you need to remove it when IO is done because there's no way how to
> > propagate that information from ext4_iomap_begin() and checking against
> > i_disksize is unreliable because it can change (due to writeback of
> > delalloc pages) while direct IO is running. But I think we can overcome
> > that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> > ext4_dio_extend_write_end_io(). So:
> >
> > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > /*
> > * Need to check against i_disksize as there may be dellalloc writes
> > * pending.
> > */
> > if (offset + count > EXT4_I(inode)->i_disksize)
> > extend = true;
>
> Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
> should be using to determine whether an extension will be performed or
> not? Because, my understanding is that i_size is what holds the actual value
> of what the file size is expected to be and hence the reason why we previously
> updated the i_disksize to i_size using ext4_update_i_disksize().
So i_size is how inode size actually appears to user. i_disksize is what
inode size really is on disk. And because of orphan handling and similar
stuff requiring i_rwsem protection, we need to use the slow path waiting
for DIO to complete whenever the direct IO is beyond the on-disk version of
inode size. Possibly there are other places in DIO path that need to use
i_disksize instead of i_size as well - I'll check that once I see new
version of your patches.
> Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
> however offset + count < i_size_read(inode). So in that case we may take an
> incorrect path somewhere i.e. below where extend clause is true. Also, I feel
> as though we should try stick to using one value as the reference to determine
> whether we're performing an extension and not use i_disksize here and then
> i_size over there kind of thing as that leads to unnecessary confusion?
Well, when DIO extends past i_disksize, it needs to add inode to orphan
list, call truncate in case of failed write, etc. So extension of
i_disksize is what actually matters. I was speaking in the past about
i_size because I thought i_size == i_disksize for the cases we care for but
as you properly pointed out that isn't necessarily the case.
> > ...
> > iomap_dio_rw(...,
> > extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
> >
> > and ext4_dio_write_end_io() will just take care of conversion of unwritten
> > extents on successful IO completion, while ext4_dio_extend_write_end_io()
> > will take care of all the complex stuff with orphan handling, extension
> > of inode size, and truncation of blocks beyond EOF - and it can do that
> > because it is guaranteed to run under the protection of i_rwsem held in
> > ext4_dio_write_iter().
> >
> > Alternatively, we could also just pass NULL instead of
> > ext4_dio_extend_write_end_io() and just do all the work explicitely in
> > ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> > transparent option...
>
> Well, with the changes to ext4_handle_inode_extension() conditions that you
> recommended in patch 2/6, then I can't see why we'd need two separate
> ->end_io() handlers as we'd just abort early if we're not extending?
The problem is that the condition I've suggested for patch 2/6 will be
actually racy if we use i_disksize for comparison. Consider the following
situation:
CPU1 CPU2
fd1 = open("file"); fd2 = open("file", O_DIRECT);
/* Delalloc write */
pwrite(fd1, buf, 4096, 16384);
/* O_DIRECT write */
pwrite(fd2, buf, 4096, 4096)
i_disksize == 0 so we have to add
inode to orphan list
submit DIO
writeback happens, i_disksize extended
to 20480.
DIO completes ->
ext4_dio_write_end_io() - sees
big i_disksize so does not
touch orphan list and inode is
wrongly left there.
And we cannot remove inode unconditionally from the orphan list in
ext4_dio_write_end_io() as when DIO starts already below i_disksize,
ext4_dio_write_end_io() may get called without i_rwsem protection.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
next prev parent reply other threads:[~2019-09-25 8:40 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-12 11:03 [PATCH v3 0/6] ext4: port direct IO to iomap infrastructure Matthew Bobrowski
2019-09-12 11:03 ` [PATCH v3 1/6] ext4: introduce direct IO read path using " Matthew Bobrowski
2019-09-16 12:00 ` Christoph Hellwig
2019-09-16 13:07 ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
2019-09-23 16:21 ` Jan Kara
2019-09-24 9:50 ` Matthew Bobrowski
2019-09-24 13:13 ` Jan Kara
2019-09-12 11:04 ` [PATCH v3 3/6] iomap: split size and error for iomap_dio_rw ->end_io Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin() Matthew Bobrowski
2019-09-16 12:05 ` Christoph Hellwig
2019-09-17 12:48 ` Matthew Bobrowski
2019-09-23 15:08 ` Jan Kara
2019-09-24 9:35 ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Matthew Bobrowski
2019-09-16 4:37 ` Ritesh Harjani
2019-09-16 10:14 ` Matthew Bobrowski
2019-09-16 12:12 ` Christoph Hellwig
2019-09-16 22:37 ` Matthew Bobrowski
2019-09-17 9:00 ` Ritesh Harjani
2019-09-17 9:02 ` Christoph Hellwig
2019-09-17 10:12 ` Ritesh Harjani
2019-09-17 12:39 ` Matthew Bobrowski
2019-09-24 10:57 ` Jan Kara
2019-09-17 9:06 ` Christoph Hellwig
2019-09-17 11:31 ` Matthew Bobrowski
2019-09-20 13:24 ` Matthew Bobrowski
2019-09-23 21:10 ` Jan Kara
2019-09-24 10:29 ` Matthew Bobrowski
2019-09-24 14:13 ` Jan Kara
2019-09-25 7:14 ` Matthew Bobrowski
2019-09-25 8:40 ` Jan Kara [this message]
2019-09-12 11:05 ` [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code Matthew Bobrowski
2019-09-16 12:06 ` Christoph Hellwig
2019-09-16 12:53 ` Matthew Bobrowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190925084031.GA23277@quack2.suse.cz \
--to=jack@suse.cz \
--cc=adilger.kernel@dilger.ca \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mbobrowski@mbobrowski.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).