Re: [PATCH] tmpfs: zero post-eof folio range on file extension

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	Usama Arif <usamaarif64@gmail.com>
Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension
Date: Mon, 14 Jul 2025 10:38:57 -0400	[thread overview]
Message-ID: <aHUWgRP4NKq6mKME@bfoster> (raw)
In-Reply-To: <18c5d84b-1449-411e-8cd7-ee8c6af37677@linux.alibaba.com>

On Mon, Jul 14, 2025 at 11:05:35AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/7/12 04:15, Brian Foster wrote:
> > On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
> > > On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> > > > 
> > > > 
> > > > On 2025/7/11 06:20, Hugh Dickins wrote:
> > > > > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > > > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > > > > ...
> > > > > > > 
> > > > > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > > > > what goes in as a large folio may there have to be split into small
> > > > > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > > > > at swapin time demand that it be split into small pages when swapped in.
> > > > > > 
> > > > > > Good point.
> > > > > > 
> > > > > > > So, if there has been swapout since the large folio was modified beyond
> > > > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > > > > length needs to be zeroed.
> > > > > > > 
> > > > > > > We could set that aside as a deficiency to be fixed later on: that
> > > > > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > > > > 
> > > > > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > > > > 
> > > > > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > > > > small folios which beyond EOF will be dropped and freed in
> > > > > > __split_unmapped_folio(), should we still consider them?
> > > > > 
> > > > > You're absolutely right about the normal case, and thank you for making
> > > > > that point.  Had I forgotten that when writing?  Or was I already
> > > > > jumping ahead to the problem case?  I don't recall, but was certainly
> > > > > wrong for not mentioning it.
> > > > > 
> > > > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > > > > the small page extent spanning i_size) i.e. fallocate() has promised to
> > > > > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > > > > is keeping those pages.
> > > > 
> > > > Ah, yes, you are right.
> > > > 
> > > > > There could well be some optimization, involving fallocend, to avoid
> > > > > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > > > > it's quite confusing!
> > > > 
> > > > Like you said, not only can a large folio split occur during swapout, but it
> > > > can also happen during a punch hole operation. Moreover, considering the
> > > > abnormal case of fallocate() you mentioned, we should find a more common
> > > > approach to mitigate the impact of fallocate().
> > > > 
> > > > For instance, when splitting, we could clear the 'uptodate' flag for these
> > > > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> > > > that these EOF small folios will be re-initialized if they are used again.
> > > > What do you think?
> > > > 
> > > ...
> > > 
> > > Hi Baolin,
> > > 
> > > So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
> > > a little curious here given that I intended to implement the writeout
> > > zeroing suggestion regardless of that discussion..
> > > 
> > > I see the hole punch case falls into truncate_inode_[partial_]folio(),
> > > which looks to me like it handles zeroing. The full truncate case just
> > > tosses the folio of course, but the partial case zeroes according to the
> > > target range prior to doing any potential split from that codepath.
> > > 
> > > That looks kind of similar to what I have prototyped for the
> > > shmem_writeout() case: tail zero the EOF straddling folio before falling
> > > into the split call. [1] Does that not solve the same general issue in
> > > the swapout path as potentially clearing uptodate via the split? I'm
> > > mainly trying to understand if that is just a potential alternative
> > > approach, or if this solves a corner case that I'm missing. Hm?
> > > 
> > 
> > Ok, after playing around a bit I think I see what I was missing. I
> > misinterpreted that the punch case is only going to zero in the target
> > range of the punch. So if you have something like a 1M file backed by an
> > fallocated 2M folio, map write the whole 2M, then punch the last 4k of
> > the file, you end up with the non-zeroed smaller folios beyond EOF. This
> > means that even with a zero of the eof folio, a truncate up over those
> > folios won't be zeroed.
> 
> Right.
> 
> > I need to think on it some more, but I take it this means that
> > essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
> > swapout (which I think is analogous to your earlier prototype logic) [1]
> > and 2. shmem_zero_eof() needs to turn into something like
> > shmem_zero_range().
> 
> Like we discussed, only considering swapout is not enough; it's necessary to
> consider all cases of large folio splits, such as swapout, punch hole,
> migration, shmem shrinker, etc. In the future, if there are other cases of
> splits, the impact on EOF folios will also need to be considered (should
> zero them before split). IMHO, this could lead to complexity and
> uncontrollability.
> 

Ok. FWIW, the purpose of the swap time zeroing in this case is not
necessarily to be a solution purely on its own. Rather (and to Hugh's
earlier point about the zeroing needing to cover a range vs. just
relying on the eof folio size), it's probably more ideal if that eof
zeroing code can assume post-eof swapped out folios are always zeroed.

But anyways, I'll try to shoot for something like that for a v2. I also
want to see if I can figure a way for a bit more thorough testing. We
can revisit from there if there are better options and/or furher gaps to
consider. Thanks again for the comments.

Brian

> So my suggestion is to address this issue during the split process, and it
> seems feasible to make EOF small folios not 'uptodate' during the split.
> Anyway, you can investigate further.
> 
> > The latter would zero a range of uptodate folios between current EOF and
> > the start of the extending operation, rather than just the EOF folio.
> > This is actually pretty consistent with traditional fs (see
> > xfs_file_write_zero_eof() for example) behavior. I was originally
> > operating under assumption that this wasn't necessary for tmpfs given
> > traditional pagecache post-eof behavior, but that has clearly proven
> > false.
> > 
> > Brian
> > 
> > [1] I'm also wondering if another option here is to just clear_uptodate
> > any uptodate folio that fully starts beyond EOF. I.e., if the folio
> > straddles EOF then partial zero as below, if the folio is beyond EOF
> > then clear uptodate and let the existing code further down zero it.
> > 
> > > If the former, I suspect we'd need to tail zero on writeout regardless
> > > of folio size. Given that, and IIUC that clearing uptodate as such will
> > > basically cause the split folios to fall back into the !uptodate -> zero
> > > -> mark_uptodate sequence of shmem_writeout(), I wonder what the
> > > advantage of that is. It feels a bit circular to me when considered
> > > along with the tail zeroing below, but again I'm peeling away at
> > > complexity as I go here.. ;) Thoughts?
> > > 
> > > Brian
> > > 
> > > [1] prototype writeout logic:
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 634e499b6197..535021ae5a2f 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	struct inode *inode = mapping->host;
> > >   	struct shmem_inode_info *info = SHMEM_I(inode);
> > >   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > -	pgoff_t index;
> > > +	loff_t i_size = i_size_read(inode);
> > > +	pgoff_t index = i_size >> PAGE_SHIFT;
> > >   	int nr_pages;
> > >   	bool split = false;
> > > @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	if (!total_swap_pages)
> > >   		goto redirty;
> > > +	/*
> > > +	 * If the folio straddles EOF, the tail portion must be zeroed on
> > > +	 * every swapout.
> > > +	 */
> > > +	if (folio_test_uptodate(folio) &&
> > > +	    folio->index <= index && folio_next_index(folio) > index) {
> > > +		size_t from = offset_in_folio(folio, i_size);
> > > +		if (from)
> > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > +	}
> > > +
> > >   	/*
> > >   	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
> > >   	 * split when swapping.
> > > 
> > > 
> > 
>

next prev parent reply	other threads:[~2025-07-14 14:35 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-25 18:49 [PATCH] tmpfs: zero post-eof folio range on file extension Brian Foster
2025-06-25 19:21 ` Matthew Wilcox
2025-06-26  5:35   ` Hugh Dickins
2025-06-26 12:55     ` Brian Foster
2025-06-27  3:21       ` Baolin Wang
2025-06-27 11:54         ` Brian Foster
2025-07-09  7:57 ` Hugh Dickins
2025-07-10  6:47   ` Baolin Wang
2025-07-10 22:20     ` Hugh Dickins
2025-07-11  3:50       ` Baolin Wang
2025-07-11  7:50         ` Hugh Dickins
2025-07-11  8:42           ` Baolin Wang
2025-07-11 16:08         ` Brian Foster
2025-07-11 20:15           ` Brian Foster
2025-07-14  3:05             ` Baolin Wang
2025-07-14 14:38               ` Brian Foster [this message]
2025-07-10 12:36   ` Brian Foster
2025-07-10 23:02     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aHUWgRP4NKq6mKME@bfoster \
    --to=bfoster@redhat.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=usamaarif64@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).