public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Long Li <leo.lilong@huawei.com>,
	brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	yi.zhang@huawei.com, houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
Date: Tue, 3 Dec 2024 09:54:41 -0500	[thread overview]
Message-ID: <Z08bsQ07cilOsUKi@bfoster> (raw)
In-Reply-To: <Z05oJqT7983ifKqv@dread.disaster.area>

On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > When performing fsstress test with this patch set, there is a very low probability of
> > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > Consider a scenario with 4K block size and a file size of 12K.
> > > 
> > > //write back [8K, 12K]           //truncate file to 4K
> > > ----------------------          ----------------------
> > > iomap_writepage_map             xfs_setattr_size
> 
> folio is locked here
> 
> > >   iomap_writepage_handle_eof
> > >                                   truncate_setsize
> > > 				    i_size_write(inode, newsize)  //update inode size to 4K
> 
...
> > > 
> > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > resulting in situations where isize is less than pos. In such cases,
> > > maybe we should not trim the io_size further.
> > > 
> > 
> > Hmm.. it might be wise to characterize this further to determine whether
> > there are potentially larger problems to address before committing to
> > anything. For example, assuming truncate acquires ilock and does
> > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
> 
> I don't think xfs_itruncate_extents() is the concern here - that
> happens after the page cache and writeback has been sorted out and
> the ILOCK has been taken and the page cache state should
> have already been sorted out. truncate_setsize() does that for us;
> it guarantees that all writeback in the truncate down range has
> been completed and the page cache invalidated.
> 

Ah this is what I was missing on previous read through.
truncate_inode_pages_range() waits on writeback in its second pass. I
was initially confused by what would have prevented removing a writeback
folio in the first pass, but that is handled a level down in
find_lock_entries() where it skips locked||writeback folios and thus
leaves them for the second pass.

> We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> can be instantiated over the range whilst we are running
> xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> are guaranteed that there will be no IO in progress or can be
> started over the range we are removing.
> 
> Really, the issue is that writeback mappings have to be able to
> handle the range being mapped suddenly appear to be beyond EOF.
> This behaviour is a longstanding writeback constraint, and is what
> iomap_writepage_handle_eof() is attempting to handle.
> 
> We handle this by only sampling i_size_read() whilst we have the
> folio locked and can determine the action we should take with that
> folio (i.e. nothing, partial zeroing, or skip altogether). Once
> we've made the decision that the folio is within EOF and taken
> action on it (i.e. moved the folio to writeback state), we cannot
> then resample the inode size because a truncate may have started
> and changed the inode size.
> 
> We have to complete the mapping of the folio to disk blocks - the
> disk block mapping is guaranteed to be valid for the life of the IO
> because the folio is locked and under writeback - and submit the IO
> so that truncate_pagecache() will unblock and invalidate the folio
> when the IO completes.
> 
> Hence writeback vs truncate serialisation is really dependent on
> only sampling the inode size -once- whilst the dirty folio we are
> writing back is locked.
> 

Not sure I see how this is a serialization dependency given that
writeback completion also samples i_size. But no matter, it seems a
reasonable implementation to me to make the submission path consistent
in handling eof.

I wonder if this could just use end_pos returned from
iomap_writepage_handle_eof()?

Brian

> I suspect that we can store and pass the sampled inode size through
> the block mapping and ioend management code so it is constant for
> the entire folio IO submission process, but whether we can do that
> and still fix the orginal issue that we are trying to fix is not
> something I've considered at this point....
> 
> > does anything in that submission or completion path detect and handle
> > this scenario gracefully? What if the ioend happens to be unwritten
> > post-eof preallocation and completion wants to convert blocks that might
> > no longer exist in the file..?
> 
> That can't happen because writeback must complete before
> truncate_setsize() will be allowed to remove the pages from the
> cache before xfs_itruncate_extents() can run.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2024-12-03 14:53 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07   ` Darrick J. Wong
2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
2024-11-28  6:38   ` Long Li
2024-11-28  3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26   ` Brian Foster
2024-12-03  2:08     ` Dave Chinner
2024-12-03 14:54       ` Brian Foster [this message]
2024-12-03 21:12         ` Dave Chinner
2024-12-04 12:05           ` Brian Foster
2024-12-04  9:06         ` Long Li
2024-12-04 12:05           ` Brian Foster
2024-12-06  3:36             ` Long Li
2024-12-04  9:00       ` Long Li
2024-12-04 12:17         ` Brian Foster
2024-12-05 12:47           ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z08bsQ07cilOsUKi@bfoster \
    --to=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=houtao1@huawei.com \
    --cc=leo.lilong@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox