Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: linux-mm@kvack.org, Hugh Dickins <hughd@google.com>
Subject: Re: [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout
Date: Thu, 20 Nov 2025 09:12:29 -0500	[thread overview]
Message-ID: <aR8hzf2gbJsFnTgZ@bfoster> (raw)
In-Reply-To: <8e766a2b-0d54-4905-9f67-53ef1397b8dc@linux.alibaba.com>

On Thu, Nov 20, 2025 at 09:57:40AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/11/19 22:08, Brian Foster wrote:
> > On Wed, Nov 19, 2025 at 11:53:41AM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 2025/11/18 22:39, Brian Foster wrote:
> > > > On Tue, Nov 18, 2025 at 10:33:44AM +0800, Baolin Wang wrote:
> > > > > 
> > > > > 
> > > > > On 2025/11/13 00:25, Brian Foster wrote:
> > > > > > As a first step to facilitate efficient post-eof zeroing in tmpfs,
> > > > > > zero post-eof uptodate folios at swap out time. This ensures that
> > > > > > post-eof ranges are zeroed "on disk" (i.e. analogous to traditional
> > > > > > pagecache writeback) and facilitates zeroing on file size changes by
> > > > > > allowing it to not have to swap in.
> > > > > > 
> > > > > > Note that shmem_writeout() already zeroes !uptodate folios so this
> > > > > > introduces some duplicate logic. We'll clean this up in the next
> > > > > > patch.
> > > > > > 
> > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > > > ---
> > > > > >     mm/shmem.c | 19 +++++++++++++++++--
> > > > > >     1 file changed, 17 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > > > index 0a25ee095b86..5fb3c911894f 100644
> > > > > > --- a/mm/shmem.c
> > > > > > +++ b/mm/shmem.c
> > > > > > @@ -1577,6 +1577,8 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     	struct inode *inode = mapping->host;
> > > > > >     	struct shmem_inode_info *info = SHMEM_I(inode);
> > > > > >     	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > > > > +	loff_t i_size = i_size_read(inode);
> > > > > > +	pgoff_t end_index = DIV_ROUND_UP(i_size, PAGE_SIZE);
> > > > > >     	pgoff_t index;
> > > > > >     	int nr_pages;
> > > > > >     	bool split = false;
> > > > > > @@ -1596,8 +1598,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     	 * (unless fallocate has been used to preallocate beyond EOF).
> > > > > >     	 */
> > > > > >     	if (folio_test_large(folio)) {
> > > > > > -		index = shmem_fallocend(inode,
> > > > > > -			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> > > > > > +		index = shmem_fallocend(inode, end_index);
> > > > > >     		if ((index > folio->index && index < folio_next_index(folio)) ||
> > > > > >     		    !IS_ENABLED(CONFIG_THP_SWAP))
> > > > > >     			split = true;
> > > > > > @@ -1647,6 +1648,20 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
> > > > > >     		folio_mark_uptodate(folio);
> > > > > >     	}
> > > > > > +	/*
> > > > > > +	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > > > > > +	 * traditional writeback behavior and facilitates zeroing on file size
> > > > > > +	 * changes without having to swap back in.
> > > > > > +	 */
> > > > > > +	if (folio_next_index(folio) >= end_index) {
> > > > > > +		size_t from = offset_in_folio(folio, i_size);
> > > > > > +
> > > > > > +		if (index >= end_index) {
> > > > > > +			folio_zero_segment(folio, 0, folio_size(folio));
> > > > > > +		} else if (from)
> > > > > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > > > > +	}
> > > > > 
> > > > > As I mentioned before[1], if a large folio is beyond EOF, it will be split
> > > > > in shmem_writeout(), and those small folios beyond EOF will be dropped and
> > > > > freed in __folio_split(). Of course, there's another special case as Hugh
> > > > > mentioned: when there's a 'fallocend' beyond i_size (e.g., fallocate()), it
> > > > > will keep the pages allocated beyond EOF after the split. However, your
> > > > > 'end_index' here does not consider 'fallocend,' so it seems to me that this
> > > > > portion of the code doesn't actually take effect.
> > > > > 
> > > > 
> > > > Hi Boalin,
> > > 
> > > s/Boalin/Baolin :)
> > > 
> > 
> > Sorry, Baolin! ;)
> > 
> > > > 
> > > > So I get that split post-eof folios can fall off depending on fallocate
> > > > status. I'm not sure what you mean by considering fallocend, however.
> > > > ISTM that fallocend contributes to the boundary where we decide to split
> > > > and/or preserve, but i_size is what is relevant for zeroing. It's not
> > > > clear to me if you're suggesting the logic is potentially spurious, or
> > > > this might not actually be zeroing correctly due to falloc interactions.
> > > > Can you clarify the concern please? Thanks.
> > > 
> > > Sorry for not being clear enough (for my quick response yesterday). After
> > > thinking more, I want to divide this into 3 cases to clearly explain the
> > > logic here:
> > > 
> > 
> > No worries. Thanks for breaking it down. Much easier to discuss this
> > way.
> > 
> > > 1. Without fallocate(), if a large folio is beyond EOF (i.e. i_size), it
> > > will be split in shmem_writeout(), and those small folios beyond EOF will be
> > > dropped and freed in __folio_split(). So, your changes should also have no
> > > impact, because after the split, ‘folio_next_index(folio)’ is definitely <=
> > > ‘end_index’. So the logic is correct.
> > > 
> > 
> > Ack, but we still want to zero any post-eof portion of a small folio
> > straddling i_size...
> 
> Yes.
> 
> > > 2. With fallocate(), If a large folio is beyond EOF (i.e. i_size) but
> > > smaller than 'fallocend', the folio will not be split. So, we should zero
> > > the post-EOF part. Because 'index' (now representing 'fallocend') is greater
> > > than 'end_index', you are zeroing the entire large folio, which does not
> > > seem correct to me.
> > > 
> > 
> > Unless CONFIG_THP_SWAP is disabled (no idea how likely that is, seems
> > like an arch thing), in which case it seems we always split (and then
> > the split path will still use fallocend to determine whether to toss or
> > preserve).
> > 
> > But otherwise yes, partial zeroing should be the same for the large
> > folio across EOF case, it just happens to be a larger folio size...
> > 
> > > if (index >= end_index) {
> > > 	folio_zero_segment(folio, 0, folio_size(folio));
> > > } else if ...
> > > 
> > > I think you should only zero the range from 'from' to 'folio_size(folio)' of
> > > this large folio in this case. Right?
> > > 
> > 
> > However index is folio->index here, not fallocend. index is reassigned a
> > bit further down the function just after the block try_split: lands in:
> > 
> >          index = folio->index;
> >          nr_pages = folio_nr_pages(folio);
> > 
> > This wasn't introduced by this patch, FWIW, but I suppose we could make
> > the fallocend block use a local for clarity.
> 
> Thanks for correcting me. I mistakenly assumed that the 'index' represents
> 'fallocend' from your patch:
> 
> +		index = shmem_fallocend(inode, end_index);
> 
> > So given that, the logic is effectively if the folio starts at or beyond
> > the first page size index beyond EOF, zero the whole thing. That seems
> > pretty straightforward to me, so I'm not clear on why we'd need to
> > consider whether the folio is large or not at this point.
> 
> Yes, you are right. After you corrected me, I understand that index refers
> to folio->index.
> 
> > > 3. With fallocate(), If a large folio is beyond EOF (i.e. i_size) and also
> > > beyond 'fallocend', the large folio will be split to small folios. If we
> > > continue attempting to write out these small folios beyond EOF, we need to
> > > zero the entire mall folios at this point. So, the logic looks correct
> > > (because 'index' > 'end_index').
> > > 
> > 
> > Ack..
> > 
> > > Based on the above analysis, I believe the logic should be:
> > > 
> > > if (folio_next_index(folio) >= end_index) {
> > > 	size_t from = offset_in_folio(folio, i_size);
> > > 
> > > 	if (!folio_test_large(folio) && index >= end_index)
> > > 		folio_zero_segment(folio, 0, folio_size(folio));
> > > 	else if (from)
> > > 		folio_zero_segment(folio, from, folio_size(folio));
> > > }
> > > 
> > > The logic here is a bit complex, please correct me if I misunderstood you.
> > > 
> > 
> > Hmm.. so I'm not really sure about the large folio check. Have you
> > reviewed the next patch, by chance? It occurs to me that I probably
> > split these two up wrongly. I probably should have split off the
> > existing !uptodate zeroing into a separate hunk in patch 1 (i.e. as a
> > non functional change, refactoring patch) and then introduce the
> > functional change in patch 2. I'll try that for v3.
> > 
> > But in the meantime, this is the logic after patch 2:
> > 
> > 	/*
> > 	 * Ranges beyond EOF must be zeroed at writeout time. This mirrors
> > 	 * traditional writeback behavior and facilitates zeroing on file size
> > 	 * changes without having to swap back in.
> > 	 */
> > 	if (!folio_test_uptodate(folio) ||
> > 	    folio_next_index(folio) >= end_index) {
> > 		size_t from = offset_in_folio(folio, i_size);
> > 
> > 		if (!folio_test_uptodate(folio) || index >= end_index) {
> > 			folio_zero_segment(folio, 0, folio_size(folio));
> > 			flush_dcache_folio(folio);
> > 			folio_mark_uptodate(folio);
> > 		} else if (from)
> > 			folio_zero_segment(folio, from, folio_size(folio));
> > 	}
> > 
> > So factoring out the preexisting uptodate logic, this looks mostly
> > equivalent to what you posted above with the exception of the large
> > folio check. I could be wrong, but it kind of sounds like that is maybe
> > due to confusion over the index value.. hm?
> 
> Right. My example is wrong due to confusion over the index value. I think
> you are right.
> 
> > 
> > In any event, I'm trying to make this logic as simple and clear as
> > possible. The idea here is basically that any folio being written out
> > that is either !uptodate or at least partially beyond EOF needs zeroing.
> > 
> > The split logic earlier in the function simply dictates what folios make
> > it to this point (to be written out vs. tossed) or not, and otherwise we
> > don't really have to care about state wrt zeroing post-eof folio ranges
> > (particularly if any of that splitting logic happens to change in the
> > future). Let me know if I'm missing something. Thanks again.
> 
> Thanks for your work and explanation. Now I think your patch is correct, and
> I will continue to review the following patches. Feel free to add:
> 
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> 
> One nit: using 'fallocend' instead of 'index' can avoid confusion:)
> 

Great, thanks! Agreed, I'll fold this in for v3..

Brian

> diff --git a/mm/shmem.c b/mm/shmem.c
> index 371af9e322d5..7f7bdb7944cc 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1586,8 +1586,8 @@ int shmem_writeout(struct folio *folio, struct
> swap_iocb **plug,
>          * (unless fallocate has been used to preallocate beyond EOF).
>          */
>         if (folio_test_large(folio)) {
> -               index = shmem_fallocend(inode, end_index);
> -               if ((index > folio->index && index <
> folio_next_index(folio)) ||
> +               pgoff_t fallocend = shmem_fallocend(inode, end_index);
> +               if ((fallocend > folio->index && fallocend <
> folio_next_index(folio)) ||
>                     !IS_ENABLED(CONFIG_THP_SWAP))
>                         split = true;
>         }
>

next prev parent reply	other threads:[~2025-11-20 14:12 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-12 16:25 [PATCH v2 0/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-12 16:25 ` [PATCH v2 1/3] tmpfs: zero post-eof uptodate folios on swapout Brian Foster
2025-11-18  2:33   ` Baolin Wang
2025-11-18 14:39     ` Brian Foster
2025-11-19  3:53       ` Baolin Wang
2025-11-19 14:08         ` Brian Foster
2025-11-20  1:57           ` Baolin Wang
2025-11-20 14:12             ` Brian Foster [this message]
2025-11-12 16:25 ` [PATCH v2 2/3] tmpfs: combine !uptodate and post-eof zeroing logic at swapout Brian Foster
2025-11-20  2:56   ` Baolin Wang
2025-11-20 14:14     ` Brian Foster
2025-11-12 16:25 ` [PATCH v2 3/3] tmpfs: zero post-eof ranges on file extension Brian Foster
2025-11-20  5:57   ` Baolin Wang
2025-11-20 14:21     ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aR8hzf2gbJsFnTgZ@bfoster \
    --to=bfoster@redhat.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.