From: Matthew Bobrowski <mbobrowski@mbobrowski.org>
To: Jan Kara <jack@suse.cz>
Cc: tytso@mit.edu, adilger.kernel@dilger.ca,
linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure
Date: Wed, 25 Sep 2019 17:14:29 +1000 [thread overview]
Message-ID: <20190925071429.GA27699@bobrowski> (raw)
In-Reply-To: <20190924141321.GC11819@quack2.suse.cz>
On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > + if (offset + count > i_size_read(inode) ||
> > > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > > + ext4_update_i_disksize(inode, inode->i_size);
> > > > + extend = true;
> > > > + }
> > >
> > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > updates only the in-memory copy of the entry). Also the direct IO code
> > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > of this excercise with i_disksize after all.
> > >
> > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > enough to have here:
> >
> > Well, I initially thought the same, however doing some research shows that we
> > have the following edge case:
> > - 45d8ec4d9fd54
> > and
> > - 73fdad00b208b
> >
> > In fact you can reproduce the exact same i_size corruption issue by running
> > the generic/475 xfstests mutitple times, as articulated within
> > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > be a problem that resides elsewhere that may need addressing?
>
> Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> there's unwritted delalloc write beyond range where DIO write happens.
>
> > > if (offset + count > i_size_read(inode)) {
> > > /*
> > > * Add inode to orphan list so that blocks allocated beyond
> > > * EOF get properly truncated in case of crash.
> > > */
> > > start transaction handle
> > > add inode to orphan list
> > > stop transaction handle
> > > }
> > >
> > > And just leave i_disksize at whatever it currently is.
> >
> > I originally had the code which added the inode to the orphan list here, but
> > then I thought to myself that it'd make more sense to actually do this step
> > closer to the point where we've managed to successfully allocate the required
> > blocks for the write. This prevents the need to spray orphan list clean up
> > code all over the place just to cover the case that a write which had intended
> > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > allocation). So, hence the reason why I thought having it in
> > ext4_iomap_begin() would make more sense, because at that point in the write
> > path, there is enough/or more assurance to make the call around whether we
> > will in fact be able to perform the write which will be extending beyond
> > i_size, or not and consequently whether the inode should be placed onto the
> > orphan list?
> >
> > Ideally I'd like to turn this statement into:
> >
> > if (offset + count > i_size_read(inode))
> > extend = true;
> >
> > Maybe I'm missing something here and there's actually a really good reason for
> > doing this nice and early? What are your thoughts about what I've mentioned
> > above?
>
> Well, the slight trouble with adding inode to orphan list in
> ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> you need to remove it when IO is done because there's no way how to
> propagate that information from ext4_iomap_begin() and checking against
> i_disksize is unreliable because it can change (due to writeback of
> delalloc pages) while direct IO is running. But I think we can overcome
> that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> ext4_dio_extend_write_end_io(). So:
>
> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> /*
> * Need to check against i_disksize as there may be dellalloc writes
> * pending.
> */
> if (offset + count > EXT4_I(inode)->i_disksize)
> extend = true;
Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
should be using to determine whether an extension will be performed or
not? Because, my understanding is that i_size is what holds the actual value
of what the file size is expected to be and hence the reason why we previously
updated the i_disksize to i_size using ext4_update_i_disksize().
Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
however offset + count < i_size_read(inode). So in that case we may take an
incorrect path somewhere i.e. below where extend clause is true. Also, I feel
as though we should try stick to using one value as the reference to determine
whether we're performing an extension and not use i_disksize here and then
i_size over there kind of thing as that leads to unnecessary confusion?
> ...
> iomap_dio_rw(...,
> extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
>
> and ext4_dio_write_end_io() will just take care of conversion of unwritten
> extents on successful IO completion, while ext4_dio_extend_write_end_io()
> will take care of all the complex stuff with orphan handling, extension
> of inode size, and truncation of blocks beyond EOF - and it can do that
> because it is guaranteed to run under the protection of i_rwsem held in
> ext4_dio_write_iter().
>
> Alternatively, we could also just pass NULL instead of
> ext4_dio_extend_write_end_io() and just do all the work explicitely in
> ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> transparent option...
Well, with the changes to ext4_handle_inode_extension() conditions that you
recommended in patch 2/6, then I can't see why we'd need two separate
->end_io() handlers as we'd just abort early if we're not extending?
> But at this point there are so many suggestions in flight that I need to
> see current state of the code again to be able to tell anything useful :).
Heh, true. I will post through an updated patch series taking into account
most of the recommendations put forward for this series version and then we
can have a discussion based on that. :)
--<M>--
next prev parent reply other threads:[~2019-09-25 7:14 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-09-12 11:03 [PATCH v3 0/6] ext4: port direct IO to iomap infrastructure Matthew Bobrowski
2019-09-12 11:03 ` [PATCH v3 1/6] ext4: introduce direct IO read path using " Matthew Bobrowski
2019-09-16 12:00 ` Christoph Hellwig
2019-09-16 13:07 ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end() Matthew Bobrowski
2019-09-23 16:21 ` Jan Kara
2019-09-24 9:50 ` Matthew Bobrowski
2019-09-24 13:13 ` Jan Kara
2019-09-12 11:04 ` [PATCH v3 3/6] iomap: split size and error for iomap_dio_rw ->end_io Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin() Matthew Bobrowski
2019-09-16 12:05 ` Christoph Hellwig
2019-09-17 12:48 ` Matthew Bobrowski
2019-09-23 15:08 ` Jan Kara
2019-09-24 9:35 ` Matthew Bobrowski
2019-09-12 11:04 ` [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Matthew Bobrowski
2019-09-16 4:37 ` Ritesh Harjani
2019-09-16 10:14 ` Matthew Bobrowski
2019-09-16 12:12 ` Christoph Hellwig
2019-09-16 22:37 ` Matthew Bobrowski
2019-09-17 9:00 ` Ritesh Harjani
2019-09-17 9:02 ` Christoph Hellwig
2019-09-17 10:12 ` Ritesh Harjani
2019-09-17 12:39 ` Matthew Bobrowski
2019-09-24 10:57 ` Jan Kara
2019-09-17 9:06 ` Christoph Hellwig
2019-09-17 11:31 ` Matthew Bobrowski
2019-09-20 13:24 ` Matthew Bobrowski
2019-09-23 21:10 ` Jan Kara
2019-09-24 10:29 ` Matthew Bobrowski
2019-09-24 14:13 ` Jan Kara
2019-09-25 7:14 ` Matthew Bobrowski [this message]
2019-09-25 8:40 ` Jan Kara
2019-09-12 11:05 ` [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code Matthew Bobrowski
2019-09-16 12:06 ` Christoph Hellwig
2019-09-16 12:53 ` Matthew Bobrowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190925071429.GA27699@bobrowski \
--to=mbobrowski@mbobrowski.org \
--cc=adilger.kernel@dilger.ca \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.