From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Long Li <leo.lilong@huawei.com>,
brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
yi.zhang@huawei.com, houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
Date: Tue, 3 Dec 2024 09:54:41 -0500 [thread overview]
Message-ID: <Z08bsQ07cilOsUKi@bfoster> (raw)
In-Reply-To: <Z05oJqT7983ifKqv@dread.disaster.area>
On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > When performing fsstress test with this patch set, there is a very low probability of
> > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > Consider a scenario with 4K block size and a file size of 12K.
> > >
> > > //write back [8K, 12K] //truncate file to 4K
> > > ---------------------- ----------------------
> > > iomap_writepage_map xfs_setattr_size
>
> folio is locked here
>
> > > iomap_writepage_handle_eof
> > > truncate_setsize
> > > i_size_write(inode, newsize) //update inode size to 4K
>
...
> > >
> > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > resulting in situations where isize is less than pos. In such cases,
> > > maybe we should not trim the io_size further.
> > >
> >
> > Hmm.. it might be wise to characterize this further to determine whether
> > there are potentially larger problems to address before committing to
> > anything. For example, assuming truncate acquires ilock and does
> > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
>
> I don't think xfs_itruncate_extents() is the concern here - that
> happens after the page cache and writeback has been sorted out and
> the ILOCK has been taken and the page cache state should
> have already been sorted out. truncate_setsize() does that for us;
> it guarantees that all writeback in the truncate down range has
> been completed and the page cache invalidated.
>
Ah this is what I was missing on previous read through.
truncate_inode_pages_range() waits on writeback in its second pass. I
was initially confused by what would have prevented removing a writeback
folio in the first pass, but that is handled a level down in
find_lock_entries() where it skips locked||writeback folios and thus
leaves them for the second pass.
> We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> can be instantiated over the range whilst we are running
> xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> are guaranteed that there will be no IO in progress or can be
> started over the range we are removing.
>
> Really, the issue is that writeback mappings have to be able to
> handle the range being mapped suddenly appear to be beyond EOF.
> This behaviour is a longstanding writeback constraint, and is what
> iomap_writepage_handle_eof() is attempting to handle.
>
> We handle this by only sampling i_size_read() whilst we have the
> folio locked and can determine the action we should take with that
> folio (i.e. nothing, partial zeroing, or skip altogether). Once
> we've made the decision that the folio is within EOF and taken
> action on it (i.e. moved the folio to writeback state), we cannot
> then resample the inode size because a truncate may have started
> and changed the inode size.
>
> We have to complete the mapping of the folio to disk blocks - the
> disk block mapping is guaranteed to be valid for the life of the IO
> because the folio is locked and under writeback - and submit the IO
> so that truncate_pagecache() will unblock and invalidate the folio
> when the IO completes.
>
> Hence writeback vs truncate serialisation is really dependent on
> only sampling the inode size -once- whilst the dirty folio we are
> writing back is locked.
>
Not sure I see how this is a serialization dependency given that
writeback completion also samples i_size. But no matter, it seems a
reasonable implementation to me to make the submission path consistent
in handling eof.
I wonder if this could just use end_pos returned from
iomap_writepage_handle_eof()?
Brian
> I suspect that we can store and pass the sampled inode size through
> the block mapping and ioend management code so it is constant for
> the entire folio IO submission process, but whether we can do that
> and still fix the orginal issue that we are trying to fix is not
> something I've considered at this point....
>
> > does anything in that submission or completion path detect and handle
> > this scenario gracefully? What if the ioend happens to be unwritten
> > post-eof preallocation and completion wants to convert blocks that might
> > no longer exist in the file..?
>
> That can't happen because writeback must complete before
> truncate_setsize() will be allowed to remove the pages from the
> cache before xfs_itruncate_extents() can run.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2024-12-03 14:53 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-27 6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27 6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07 ` Darrick J. Wong
2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
2024-11-28 6:38 ` Long Li
2024-11-28 3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26 ` Brian Foster
2024-12-03 2:08 ` Dave Chinner
2024-12-03 14:54 ` Brian Foster [this message]
2024-12-03 21:12 ` Dave Chinner
2024-12-04 12:05 ` Brian Foster
2024-12-04 9:06 ` Long Li
2024-12-04 12:05 ` Brian Foster
2024-12-06 3:36 ` Long Li
2024-12-04 9:00 ` Long Li
2024-12-04 12:17 ` Brian Foster
2024-12-05 12:47 ` Long Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z08bsQ07cilOsUKi@bfoster \
--to=bfoster@redhat.com \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=houtao1@huawei.com \
--cc=leo.lilong@huawei.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.