Re: [PATCH] tmpfs: zero post-eof folio range on file extension

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>,
	linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	Usama Arif <usamaarif64@gmail.com>
Subject: Re: [PATCH] tmpfs: zero post-eof folio range on file extension
Date: Fri, 11 Jul 2025 16:15:14 -0400	[thread overview]
Message-ID: <aHFw0jNehHYXQzRh@bfoster> (raw)
In-Reply-To: <aHE28N9tawdf4FGZ@bfoster>

On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
> On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> > 
> > 
> > On 2025/7/11 06:20, Hugh Dickins wrote:
> > > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > > ...
> > > > > 
> > > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > > what goes in as a large folio may there have to be split into small
> > > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > > at swapin time demand that it be split into small pages when swapped in.
> > > > 
> > > > Good point.
> > > > 
> > > > > So, if there has been swapout since the large folio was modified beyond
> > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > > length needs to be zeroed.
> > > > > 
> > > > > We could set that aside as a deficiency to be fixed later on: that
> > > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > > 
> > > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > > 
> > > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > > small folios which beyond EOF will be dropped and freed in
> > > > __split_unmapped_folio(), should we still consider them?
> > > 
> > > You're absolutely right about the normal case, and thank you for making
> > > that point.  Had I forgotten that when writing?  Or was I already
> > > jumping ahead to the problem case?  I don't recall, but was certainly
> > > wrong for not mentioning it.
> > > 
> > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > > the small page extent spanning i_size) i.e. fallocate() has promised to
> > > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > > is keeping those pages.
> > 
> > Ah, yes, you are right.
> > 
> > > There could well be some optimization, involving fallocend, to avoid
> > > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > > it's quite confusing!
> > 
> > Like you said, not only can a large folio split occur during swapout, but it
> > can also happen during a punch hole operation. Moreover, considering the
> > abnormal case of fallocate() you mentioned, we should find a more common
> > approach to mitigate the impact of fallocate().
> > 
> > For instance, when splitting, we could clear the 'uptodate' flag for these
> > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> > that these EOF small folios will be re-initialized if they are used again.
> > What do you think?
> > 
> ...
> 
> Hi Baolin,
> 
> So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
> a little curious here given that I intended to implement the writeout
> zeroing suggestion regardless of that discussion..
> 
> I see the hole punch case falls into truncate_inode_[partial_]folio(),
> which looks to me like it handles zeroing. The full truncate case just
> tosses the folio of course, but the partial case zeroes according to the
> target range prior to doing any potential split from that codepath.
> 
> That looks kind of similar to what I have prototyped for the
> shmem_writeout() case: tail zero the EOF straddling folio before falling
> into the split call. [1] Does that not solve the same general issue in
> the swapout path as potentially clearing uptodate via the split? I'm
> mainly trying to understand if that is just a potential alternative
> approach, or if this solves a corner case that I'm missing. Hm?
> 

Ok, after playing around a bit I think I see what I was missing. I
misinterpreted that the punch case is only going to zero in the target
range of the punch. So if you have something like a 1M file backed by an
fallocated 2M folio, map write the whole 2M, then punch the last 4k of
the file, you end up with the non-zeroed smaller folios beyond EOF. This
means that even with a zero of the eof folio, a truncate up over those
folios won't be zeroed.

I need to think on it some more, but I take it this means that
essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
swapout (which I think is analogous to your earlier prototype logic) [1]
and 2. shmem_zero_eof() needs to turn into something like
shmem_zero_range().

The latter would zero a range of uptodate folios between current EOF and
the start of the extending operation, rather than just the EOF folio.
This is actually pretty consistent with traditional fs (see
xfs_file_write_zero_eof() for example) behavior. I was originally
operating under assumption that this wasn't necessary for tmpfs given
traditional pagecache post-eof behavior, but that has clearly proven
false.

Brian

[1] I'm also wondering if another option here is to just clear_uptodate
any uptodate folio that fully starts beyond EOF. I.e., if the folio
straddles EOF then partial zero as below, if the folio is beyond EOF
then clear uptodate and let the existing code further down zero it.

> If the former, I suspect we'd need to tail zero on writeout regardless
> of folio size. Given that, and IIUC that clearing uptodate as such will
> basically cause the split folios to fall back into the !uptodate -> zero
> -> mark_uptodate sequence of shmem_writeout(), I wonder what the
> advantage of that is. It feels a bit circular to me when considered
> along with the tail zeroing below, but again I'm peeling away at
> complexity as I go here.. ;) Thoughts?
> 
> Brian
> 
> [1] prototype writeout logic:
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 634e499b6197..535021ae5a2f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>  	struct inode *inode = mapping->host;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> -	pgoff_t index;
> +	loff_t i_size = i_size_read(inode);
> +	pgoff_t index = i_size >> PAGE_SHIFT;
>  	int nr_pages;
>  	bool split = false;
>  
> @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>  	if (!total_swap_pages)
>  		goto redirty;
>  
> +	/*
> +	 * If the folio straddles EOF, the tail portion must be zeroed on
> +	 * every swapout.
> +	 */
> +	if (folio_test_uptodate(folio) &&
> +	    folio->index <= index && folio_next_index(folio) > index) {
> +		size_t from = offset_in_folio(folio, i_size);
> +		if (from)
> +			folio_zero_segment(folio, from, folio_size(folio));
> +	}
> +
>  	/*
>  	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
>  	 * split when swapping.
> 
>

next prev parent reply	other threads:[~2025-07-11 20:11 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-25 18:49 [PATCH] tmpfs: zero post-eof folio range on file extension Brian Foster
2025-06-25 19:21 ` Matthew Wilcox
2025-06-26  5:35   ` Hugh Dickins
2025-06-26 12:55     ` Brian Foster
2025-06-27  3:21       ` Baolin Wang
2025-06-27 11:54         ` Brian Foster
2025-07-09  7:57 ` Hugh Dickins
2025-07-10  6:47   ` Baolin Wang
2025-07-10 22:20     ` Hugh Dickins
2025-07-11  3:50       ` Baolin Wang
2025-07-11  7:50         ` Hugh Dickins
2025-07-11  8:42           ` Baolin Wang
2025-07-11 16:08         ` Brian Foster
2025-07-11 20:15           ` Brian Foster [this message]
2025-07-14  3:05             ` Baolin Wang
2025-07-14 14:38               ` Brian Foster
2025-07-10 12:36   ` Brian Foster
2025-07-10 23:02     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aHFw0jNehHYXQzRh@bfoster \
    --to=bfoster@redhat.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=usamaarif64@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).