Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Long Li <leo.lilong@huawei.com>,
	brauner@kernel.org, djwong@kernel.org, cem@kernel.org,
	linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	yi.zhang@huawei.com, houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
Date: Wed, 4 Dec 2024 07:05:16 -0500	[thread overview]
Message-ID: <Z1BFfIk_iDAh2uwF@bfoster> (raw)
In-Reply-To: <Z090Jd06yjgh_Q-y@dread.disaster.area>

On Wed, Dec 04, 2024 at 08:12:05AM +1100, Dave Chinner wrote:
> On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> > On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > > can be instantiated over the range whilst we are running
> > > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > > are guaranteed that there will be no IO in progress or can be
> > > started over the range we are removing.
> > > 
> > > Really, the issue is that writeback mappings have to be able to
> > > handle the range being mapped suddenly appear to be beyond EOF.
> > > This behaviour is a longstanding writeback constraint, and is what
> > > iomap_writepage_handle_eof() is attempting to handle.
> > > 
> > > We handle this by only sampling i_size_read() whilst we have the
> > > folio locked and can determine the action we should take with that
> > > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > > we've made the decision that the folio is within EOF and taken
> > > action on it (i.e. moved the folio to writeback state), we cannot
> > > then resample the inode size because a truncate may have started
> > > and changed the inode size.
> > > 
> > > We have to complete the mapping of the folio to disk blocks - the
> > > disk block mapping is guaranteed to be valid for the life of the IO
> > > because the folio is locked and under writeback - and submit the IO
> > > so that truncate_pagecache() will unblock and invalidate the folio
> > > when the IO completes.
> > > 
> > > Hence writeback vs truncate serialisation is really dependent on
> > > only sampling the inode size -once- whilst the dirty folio we are
> > > writing back is locked.
> > > 
> > 
> > Not sure I see how this is a serialization dependency given that
> > writeback completion also samples i_size.
> 
> Ah, I didn't explain what I meant very clearly, did I?
> 
> What I mean was we can't sample i_size in the IO path without
> specific checking/serialisation against truncate operations. And
> that means once we have partially zeroed the contents of a EOF
> straddling folio, we can't then sample the EOF again to determine
> the length of valid data in the folio as this can race with truncate
> and result in a different size for the data in the folio than we
> prepared it for.
> 

Ok, I think we're just saying the same thing using different words.

> > But no matter, it seems a
> > reasonable implementation to me to make the submission path consistent
> > in handling eof.
> 
> Yes, the IO completion path does sample it again via xfs_new_eof().
> However, as per above, it has specific checking for truncate down
> races and handles them:
> 
> /*
>  * If this I/O goes past the on-disk inode size update it unless it would
>  * be past the current in-core inode size.
>  */
> static inline xfs_fsize_t
> xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size)
> {
>         xfs_fsize_t i_size = i_size_read(VFS_I(ip));
> 
> >>>>    if (new_size > i_size || new_size < 0)
> >>>>            new_size = i_size;
>         return new_size > ip->i_disk_size ? new_size : 0;
> }
> 
> If we have a truncate_setsize() called for a truncate down whilst
> this IO is in progress, then xfs_new_eof() will see the new, smaller
> inode isize. The clamp on new_size handles this situation, and we
> then only triggers an update if the on-disk size is still smaller
> than the new truncated size (i.e. the IO being completed is still
> partially within the new EOF from the truncate down).
> 
> So I don't think there's an issue here at all at IO completion;
> it handles truncate down races cleanly...
> 

Agree.. this was kind of the point of the submit side trimming. I'm not
sure a second sample of i_size on submission for trimming purposes
affects this in any problematic way either.

Brian

> > I wonder if this could just use end_pos returned from
> > iomap_writepage_handle_eof()?
> 
> Yeah, that was what I was thinking, but I haven't looked at the code
> for long enough to have any real idea of whether that is sufficient
> or not.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>

next prev parent reply	other threads:[~2024-12-04 12:03 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07   ` Darrick J. Wong
2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
2024-11-28  6:38   ` Long Li
2024-11-28  3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26   ` Brian Foster
2024-12-03  2:08     ` Dave Chinner
2024-12-03 14:54       ` Brian Foster
2024-12-03 21:12         ` Dave Chinner
2024-12-04 12:05           ` Brian Foster [this message]
2024-12-04  9:06         ` Long Li
2024-12-04 12:05           ` Brian Foster
2024-12-06  3:36             ` Long Li
2024-12-04  9:00       ` Long Li
2024-12-04 12:17         ` Brian Foster
2024-12-05 12:47           ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z1BFfIk_iDAh2uwF@bfoster \
    --to=bfoster@redhat.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=houtao1@huawei.com \
    --cc=leo.lilong@huawei.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.