Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Long Li <leo.lilong@huawei.com>
Cc: Dave Chinner <david@fromorbit.com>,
	brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	yi.zhang@huawei.com, houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
Date: Wed, 4 Dec 2024 07:17:45 -0500	[thread overview]
Message-ID: <Z1BIab8G3KmXuyfS@bfoster> (raw)
In-Reply-To: <Z1AaPNoN_z5EQxFQ@localhost.localdomain>

On Wed, Dec 04, 2024 at 05:00:44PM +0800, Long Li wrote:
> On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > > When performing fsstress test with this patch set, there is a very low probability of
> > > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > > Consider a scenario with 4K block size and a file size of 12K.
> > > > 
> > > > //write back [8K, 12K]           //truncate file to 4K
> > > > ----------------------          ----------------------
> > > > iomap_writepage_map             xfs_setattr_size
> > 
> > folio is locked here
> > 
> > > >   iomap_writepage_handle_eof
> > > >                                   truncate_setsize
> > > > 				    i_size_write(inode, newsize)  //update inode size to 4K
> > 
> > truncate_setsize() is supposed to invalidate whole pages beyond
> > EOF before completing, yes?
> > 
> > /**
> >  * truncate_setsize - update inode and pagecache for a new file size
> >  * @inode: inode
> >  * @newsize: new file size
> >  *
> >  * truncate_setsize updates i_size and performs pagecache truncation (if
> >  * necessary) to @newsize. It will be typically be called from the filesystem's
> >  * setattr function when ATTR_SIZE is passed in.
> >  *
> >  * Must be called with a lock serializing truncates and writes (generally
> >  * i_rwsem but e.g. xfs uses a different lock) and before all filesystem
> >  * specific block truncation has been performed.
> >  */
> > void truncate_setsize(struct inode *inode, loff_t newsize)
> > {
> >         loff_t oldsize = inode->i_size;
> > 
> >         i_size_write(inode, newsize);
> >         if (newsize > oldsize)
> >                 pagecache_isize_extended(inode, oldsize, newsize);
> >         truncate_pagecache(inode, newsize);
> > }
> > EXPORT_SYMBOL(truncate_setsize);
> > 
> > Note that this says "serialising truncates and writes" - the
> > emphasis needs to be placed on "writes" here, not "writeback". The
> > comment about XFS is also stale - it uses the i_rwsem here like
> > all other filesystems now.
> > 
> > The issue demonstrated above is -write back- racing against
> > truncate_setsize(), not writes. And -write back- is only serialised
> > against truncate_pagecache() by folio locks and state, not inode
> > locks. hence any change to the inode size in truncate can and will
> > race with writeback in progress.
> > 
> > Hence writeback needs to be able to handle folios end up beyond
> > EOF at any time during writeback. i.e. once we have a folio locked
> > in writeback and we've checked against i_size_read() for validity,
> > it needs to be considered a valid offset all the way through to
> > IO completion.
> > 
> > 
> > > >   iomap_writepage_map_blocks
> > > >     iomap_add_to_ioend
> > > >            < iszie < ioend->io_offset>
> > > > 	   <iszie = 4K,  ioend->io_offset=8K>
> > 
> > Ah, so the bug fix adds a new call to i_size_read() in the IO
> > submission path? I suspect that is the underlying problem leading
> > to the observed behaviour....
> > 
> > > > 
> > > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > > resulting in situations where isize is less than pos. In such cases,
> > > > maybe we should not trim the io_size further.
> > > > 
> > > 
> > > Hmm.. it might be wise to characterize this further to determine whether
> > > there are potentially larger problems to address before committing to
> > > anything. For example, assuming truncate acquires ilock and does
> > > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
> > 
> > I don't think xfs_itruncate_extents() is the concern here - that
> > happens after the page cache and writeback has been sorted out and
> > the ILOCK has been taken and the page cache state should
> > have already been sorted out. truncate_setsize() does that for us;
> > it guarantees that all writeback in the truncate down range has
> > been completed and the page cache invalidated.
> > 
> > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > can be instantiated over the range whilst we are running
> > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > are guaranteed that there will be no IO in progress or can be
> > started over the range we are removing.
> > 
> > Really, the issue is that writeback mappings have to be able to
> > handle the range being mapped suddenly appear to be beyond EOF.
> > This behaviour is a longstanding writeback constraint, and is what
> > iomap_writepage_handle_eof() is attempting to handle.
> > 
> > We handle this by only sampling i_size_read() whilst we have the
> > folio locked and can determine the action we should take with that
> > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > we've made the decision that the folio is within EOF and taken
> > action on it (i.e. moved the folio to writeback state), we cannot
> > then resample the inode size because a truncate may have started
> > and changed the inode size.
> > 
> 
> My understanding is the issue isn't that we can't sample the inode size. 
> The key point is that writeback mapping must be able to handle cases where
> the mapped range suddenly appears beyond EOF. If we can handle such
> cases properly, wouldn't sampling still be possible?
> 

I think so. I wouldn't harp too much on this as I think we're tripping
over words. ISTM the critical thing is that the folio is handled
properly wrt the truncate operation, which should be facilitated by the
folio lock.

> Coming back to our current issue, during writeback mapping, we sample
> the inode size to determine if the ioend is within EOF and attempt to
> trim io_size. Concurrent truncate operations may update the inode size,
> causing the pos of write back beyond EOF. In such cases, we simply don't
> trim io_size, which seems like a viable approach.
> 

Perhaps. I'm not claiming it isn't functional. But to Dave's (more
elaborated) point and in light of the racy i_size issue you've
uncovered, what bugs me also about this is that this creates an internal
inconsistency in the submission codepath.

I.e., the top level code does one thing based on one value of i_size,
then the ioend construction does another, and the logic is not directly
correlated so there is no real guarantee changes in one area correlate
to the other. IME, this increases potential for future bugs and adds
maintenance burden.

A simple example to consider might be.. suppose sometime in the future
we determine there is a selective case where we do want to allow a
post-eof writeback. As of right now, all that really requires is
adjustment to the "handle_eof()" logic and the rest of the codepath does
the right thing agnostic to outside operations like truncate. I think
there's value if we can preserve that invariant going forward.

FWIW, I'm not objecting to the alternative if something in the above
reasoning is wrong. I'm just trying to prioritize keeping things simple
and maintainable, particularly since truncate is kind of a complicated
beast as it is.

Brian

> Thanks,
> Long Li
>

next prev parent reply	other threads:[~2024-12-04 12:16 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07   ` Darrick J. Wong
2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
2024-11-28  6:38   ` Long Li
2024-11-28  3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26   ` Brian Foster
2024-12-03  2:08     ` Dave Chinner
2024-12-03 14:54       ` Brian Foster
2024-12-03 21:12         ` Dave Chinner
2024-12-04 12:05           ` Brian Foster
2024-12-04  9:06         ` Long Li
2024-12-04 12:05           ` Brian Foster
2024-12-06  3:36             ` Long Li
2024-12-04  9:00       ` Long Li
2024-12-04 12:17         ` Brian Foster [this message]
2024-12-05 12:47           ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1BIab8G3KmXuyfS@bfoster \
    --to=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=houtao1@huawei.com \
    --cc=leo.lilong@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox