[PATCH] tmpfs: zero post-eof folio range on file extension

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] tmpfs: zero post-eof folio range on file extension
@ 2025-06-25 18:49 Brian Foster
  2025-06-25 19:21 ` Matthew Wilcox
  2025-07-09  7:57 ` Hugh Dickins
  0 siblings, 2 replies; 18+ messages in thread
From: Brian Foster @ 2025-06-25 18:49 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang

Most traditional filesystems zero the post-eof portion of the eof
folio at writeback time, or when the file size is extended by
truncate or extending writes. This ensures that the previously
post-eof range of the folio is zeroed before it is exposed to the
file.

tmpfs doesn't implement the writeback path the way a traditional
filesystem does, so zeroing behavior won't be exactly the same.
However, it can still perform explicit zeroing from the various
operations that extend a file and expose a post-eof portion of the
eof folio. The current lack of zeroing is observed via failure of
fstests test generic/363 on tmpfs. This test injects post-eof mapped
writes in certain situations to detect gaps in zeroing behavior.

Add a new eof zeroing helper for file extending operations. Look up
the current eof folio, and if one exists, zero the range about to be
exposed. This allows generic/363 to pass on tmpfs.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

Hi all,

This survives the aforemented reproducer, an fstests regression run, and
~100m fsx operations without issues. Let me know if there are any other
recommended tests for tmpfs and I'm happy to run them. Otherwise, a
couple notes as I'm not terribly familiar with tmpfs...

First, I used _get_partial_folio() because we really only want to zero
an eof folio if one has been previously allocated. My understanding is
that lookup path will avoid unnecessary folio allocation in such cases,
but let me know if that's wrong.

Also, it seems that the falloc path leaves newly preallocated folios
!uptodate until they are used. This had me wondering if perhaps
shmem_zero_eof() could just skip out if the eof folio happens to be
!uptodate. Hm?

Thoughts, reviews, flames appreciated.

Brian

 mm/shmem.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 3a5a65b1f41a..4bb96c24fb9e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1077,6 +1077,29 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return folio;
 }

+/*
+ * Zero any post-EOF range of the EOF folio about to be exposed by size
+ * extension.
+ */
+static void shmem_zero_eof(struct inode *inode, loff_t pos)
+{
+	struct folio *folio;
+	loff_t i_size = i_size_read(inode);
+	size_t from, len;
+
+	folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);
+	if (!folio)
+		return;
+
+	/* zero to the end of the folio or start of extending operation */
+	from = offset_in_folio(folio, i_size);
+	len = min_t(loff_t, folio_size(folio) - from, pos - i_size);
+	folio_zero_range(folio, from, len);
+
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
@@ -1302,6 +1325,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
 			return -EPERM;

 		if (newsize != oldsize) {
+			if (newsize > oldsize)
+				shmem_zero_eof(inode, newsize);
 			error = shmem_reacct_size(SHMEM_I(inode)->flags,
 					oldsize, newsize);
 			if (error)
@@ -3464,6 +3489,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ret = file_update_time(file);
 	if (ret)
 		goto unlock;
+	if (iocb->ki_pos > i_size_read(inode))
+		shmem_zero_eof(inode, iocb->ki_pos);
 	ret = generic_perform_write(iocb, from);
 unlock:
 	inode_unlock(inode);
@@ -3791,8 +3818,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		cond_resched();
 	}

-	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+	/*
+	 * The post-eof portion of the eof folio isn't guaranteed to be zeroed
+	 * by fallocate, so zero through the end of the fallocated range
+	 * instead of the start.
+	 */
+	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
+		shmem_zero_eof(inode, offset + len);
 		i_size_write(inode, offset + len);
+	}
 undone:
 	spin_lock(&inode->i_lock);
 	inode->i_private = NULL;
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-25 18:49 [PATCH] tmpfs: zero post-eof folio range on file extension Brian Foster
@ 2025-06-25 19:21 ` Matthew Wilcox
  2025-06-26  5:35   ` Hugh Dickins
  2025-07-09  7:57 ` Hugh Dickins
  1 sibling, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2025-06-25 19:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-mm, Hugh Dickins, Baolin Wang

On Wed, Jun 25, 2025 at 02:49:30PM -0400, Brian Foster wrote:
> Most traditional filesystems zero the post-eof portion of the eof
> folio at writeback time, or when the file size is extended by
> truncate or extending writes. This ensures that the previously
> post-eof range of the folio is zeroed before it is exposed to the
> file.
> 
> tmpfs doesn't implement the writeback path the way a traditional
> filesystem does, so zeroing behavior won't be exactly the same.
> However, it can still perform explicit zeroing from the various
> operations that extend a file and expose a post-eof portion of the
> eof folio. The current lack of zeroing is observed via failure of
> fstests test generic/363 on tmpfs. This test injects post-eof mapped
> writes in certain situations to detect gaps in zeroing behavior.
> 
> Add a new eof zeroing helper for file extending operations. Look up
> the current eof folio, and if one exists, zero the range about to be
> exposed. This allows generic/363 to pass on tmpfs.

This seems like what I did here?

https://lore.kernel.org/linux-fsdevel/20230202204428.3267832-4-willy@infradead.org/

Which fix should we prefer?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-25 19:21 ` Matthew Wilcox
@ 2025-06-26  5:35   ` Hugh Dickins
  2025-06-26 12:55     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-06-26  5:35 UTC (permalink / raw)
  To: Matthew Wilcox, Brian Foster; +Cc: linux-mm, Hugh Dickins, Baolin Wang

On Wed, 25 Jun 2025, Matthew Wilcox wrote:
> On Wed, Jun 25, 2025 at 02:49:30PM -0400, Brian Foster wrote:
> > Most traditional filesystems zero the post-eof portion of the eof
> > folio at writeback time, or when the file size is extended by
> > truncate or extending writes. This ensures that the previously
> > post-eof range of the folio is zeroed before it is exposed to the
> > file.
> > 
> > tmpfs doesn't implement the writeback path the way a traditional
> > filesystem does, so zeroing behavior won't be exactly the same.
> > However, it can still perform explicit zeroing from the various
> > operations that extend a file and expose a post-eof portion of the
> > eof folio. The current lack of zeroing is observed via failure of
> > fstests test generic/363 on tmpfs. This test injects post-eof mapped
> > writes in certain situations to detect gaps in zeroing behavior.
> > 
> > Add a new eof zeroing helper for file extending operations. Look up
> > the current eof folio, and if one exists, zero the range about to be
> > exposed. This allows generic/363 to pass on tmpfs.
> 
> This seems like what I did here?
> 
> https://lore.kernel.org/linux-fsdevel/20230202204428.3267832-4-willy@infradead.org/
> 
> Which fix should we prefer?

Thank you Brian, thank you Matthew.

Of course my immediate reaction is to prefer the smaller patch,
Matthew's, but generic/363 still fails with that one: I need to look
into what each of you is doing (and that mail thread from 2023) and
make up my own mind.

(I'm worried why I had no copy of Matthew's 2023 patch: it's sadly
common for me to bury patches for long periods of time, but not
usually to lose them completely. But that is my worry, not yours.)

I haven't been much concerned by generic/383 failing on tmpfs:
but promise to respect your efforts in due course.

I procrastinate "in due course" because (a) the full correct answwer
will depend on what happens with large folios, and (b) large folio
work in 6.14 changed (I'll say broke) the behaviour of huge=always
at eof (I expected a danger of that, and thought I checked before
6.14-rc settled, but must have messed up my checking somehow).

So there's more (and more urgent) to sort out before settling on
the right generic/383 fix.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-26  5:35   ` Hugh Dickins
@ 2025-06-26 12:55     ` Brian Foster
  2025-06-27  3:21       ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2025-06-26 12:55 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Matthew Wilcox, linux-mm, Baolin Wang

On Wed, Jun 25, 2025 at 10:35:44PM -0700, Hugh Dickins wrote:
> On Wed, 25 Jun 2025, Matthew Wilcox wrote:
> > On Wed, Jun 25, 2025 at 02:49:30PM -0400, Brian Foster wrote:
> > > Most traditional filesystems zero the post-eof portion of the eof
> > > folio at writeback time, or when the file size is extended by
> > > truncate or extending writes. This ensures that the previously
> > > post-eof range of the folio is zeroed before it is exposed to the
> > > file.
> > > 
> > > tmpfs doesn't implement the writeback path the way a traditional
> > > filesystem does, so zeroing behavior won't be exactly the same.
> > > However, it can still perform explicit zeroing from the various
> > > operations that extend a file and expose a post-eof portion of the
> > > eof folio. The current lack of zeroing is observed via failure of
> > > fstests test generic/363 on tmpfs. This test injects post-eof mapped
> > > writes in certain situations to detect gaps in zeroing behavior.
> > > 
> > > Add a new eof zeroing helper for file extending operations. Look up
> > > the current eof folio, and if one exists, zero the range about to be
> > > exposed. This allows generic/363 to pass on tmpfs.
> > 
> > This seems like what I did here?
> > 
> > https://lore.kernel.org/linux-fsdevel/20230202204428.3267832-4-willy@infradead.org/
> > 
> > Which fix should we prefer?
> 

Quite similar, indeed. This is actually about the same as my initial
prototype when I was just trying to confirm the problem for truncate. As
Hugh notes, we still need to cover other size extension paths
(fallocate, buffered write).

> Thank you Brian, thank you Matthew.
> 
> Of course my immediate reaction is to prefer the smaller patch,
> Matthew's, but generic/363 still fails with that one: I need to look
> into what each of you is doing (and that mail thread from 2023) and
> make up my own mind.
> 

FWIW, I started off with something trying to use shmem_undo_range() and
eventually moved toward the current patch because I couldn't get it
working quite right. Explicit zeroing seemed like less complexity than
calling through undo_range() on various operations, primarily due to
less risk of unintentional side effect. It's possible (likely :P) that
was just user error on my part, so now that I have a functional patch I
can revisit that approach if it is preferred.

However one thing I wasn't aware of at the time was the additional
zeroing needed into the target range on fallocate, so that might require
care to avoid not immediately punching out the folios that were
fallocated just prior. I suspect this would mean we'd need a helper or
something to restrict the range to undo to just the eof folio. That
seems like a plausible approach, I'm just not so sure the final result
will end up much smaller or simpler than this patch.

> (I'm worried why I had no copy of Matthew's 2023 patch: it's sadly
> common for me to bury patches for long periods of time, but not
> usually to lose them completely. But that is my worry, not yours.)
> 
> I haven't been much concerned by generic/383 failing on tmpfs:
> but promise to respect your efforts in due course.
> 

It's certainly not the bug of the century. ;) I added the test somewhat
recently because we had bigger zeroing issues on other filesystems and I
noticed we had no decent test coverage.

> I procrastinate "in due course" because (a) the full correct answwer
> will depend on what happens with large folios, and (b) large folio
> work in 6.14 changed (I'll say broke) the behaviour of huge=always
> at eof (I expected a danger of that, and thought I checked before
> 6.14-rc settled, but must have messed up my checking somehow).
> 

Interesting.. I assume huge=always refers to a mount option. I need to
give that a test.

I'm also curious if either of you have any thoughts on the uptodate
question. Does filtering zeroing on uptodate folios ensure we'd zero
only folios that were previously written to?

Thanks for the comments.

Brian

> So there's more (and more urgent) to sort out before settling on
> the right generic/383 fix.
> 
> Thanks,
> Hugh
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-26 12:55     ` Brian Foster
@ 2025-06-27  3:21       ` Baolin Wang
  2025-06-27 11:54         ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-06-27  3:21 UTC (permalink / raw)
  To: Brian Foster, Hugh Dickins; +Cc: Matthew Wilcox, linux-mm



On 2025/6/26 20:55, Brian Foster wrote:
> On Wed, Jun 25, 2025 at 10:35:44PM -0700, Hugh Dickins wrote:
>> On Wed, 25 Jun 2025, Matthew Wilcox wrote:
>>> On Wed, Jun 25, 2025 at 02:49:30PM -0400, Brian Foster wrote:
>>>> Most traditional filesystems zero the post-eof portion of the eof
>>>> folio at writeback time, or when the file size is extended by
>>>> truncate or extending writes. This ensures that the previously
>>>> post-eof range of the folio is zeroed before it is exposed to the
>>>> file.
>>>>
>>>> tmpfs doesn't implement the writeback path the way a traditional
>>>> filesystem does, so zeroing behavior won't be exactly the same.
>>>> However, it can still perform explicit zeroing from the various
>>>> operations that extend a file and expose a post-eof portion of the
>>>> eof folio. The current lack of zeroing is observed via failure of
>>>> fstests test generic/363 on tmpfs. This test injects post-eof mapped
>>>> writes in certain situations to detect gaps in zeroing behavior.
>>>>
>>>> Add a new eof zeroing helper for file extending operations. Look up
>>>> the current eof folio, and if one exists, zero the range about to be
>>>> exposed. This allows generic/363 to pass on tmpfs.
>>>
>>> This seems like what I did here?
>>>
>>> https://lore.kernel.org/linux-fsdevel/20230202204428.3267832-4-willy@infradead.org/
>>>
>>> Which fix should we prefer?
>>
> 
> Quite similar, indeed. This is actually about the same as my initial
> prototype when I was just trying to confirm the problem for truncate. As
> Hugh notes, we still need to cover other size extension paths
> (fallocate, buffered write).
> 
>> Thank you Brian, thank you Matthew.
>>
>> Of course my immediate reaction is to prefer the smaller patch,
>> Matthew's, but generic/363 still fails with that one: I need to look
>> into what each of you is doing (and that mail thread from 2023) and
>> make up my own mind.
>>
> 
> FWIW, I started off with something trying to use shmem_undo_range() and
> eventually moved toward the current patch because I couldn't get it
> working quite right. Explicit zeroing seemed like less complexity than
> calling through undo_range() on various operations, primarily due to
> less risk of unintentional side effect. It's possible (likely :P) that
> was just user error on my part, so now that I have a functional patch I
> can revisit that approach if it is preferred.

I also tried using shmem_truncate_range() to fix this issue, but I 
failed. Ultimately, at least for me, I think the fix is simple and works.

> However one thing I wasn't aware of at the time was the additional
> zeroing needed into the target range on fallocate, so that might require
> care to avoid not immediately punching out the folios that were
> fallocated just prior. I suspect this would mean we'd need a helper or
> something to restrict the range to undo to just the eof folio. That
> seems like a plausible approach, I'm just not so sure the final result
> will end up much smaller or simpler than this patch.
> 
>> (I'm worried why I had no copy of Matthew's 2023 patch: it's sadly
>> common for me to bury patches for long periods of time, but not
>> usually to lose them completely. But that is my worry, not yours.)
>>
>> I haven't been much concerned by generic/383 failing on tmpfs:
>> but promise to respect your efforts in due course.
>>
> 
> It's certainly not the bug of the century. ;) I added the test somewhat
> recently because we had bigger zeroing issues on other filesystems and I
> noticed we had no decent test coverage.
> 
>> I procrastinate "in due course" because (a) the full correct answwer
>> will depend on what happens with large folios, and (b) large folio
>> work in 6.14 changed (I'll say broke) the behaviour of huge=always
>> at eof (I expected a danger of that, and thought I checked before
>> 6.14-rc settled, but must have messed up my checking somehow).
>>
> 
> Interesting.. I assume huge=always refers to a mount option. I need to
> give that a test.

I tested your patch by adding the 'transparent_hugepage_tmpfs=always' 
command line parameter, which will change the default huge policy to 
'huge=always' for tmpfs mounts. Your patch also passed the generic/363 
test with the tmpfs 'huge=always' mount option.

> I'm also curious if either of you have any thoughts on the uptodate
> question. Does filtering zeroing on uptodate folios ensure we'd zero
> only folios that were previously written to?

Yes, I think so. Caller will handle the !uptodate folios. So I change 
your patch to:

static void shmem_zero_eof(struct inode *inode, loff_t pos)
{
         struct folio *folio;
         loff_t i_size = i_size_read(inode);
         size_t from, len;

         folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);
         if (!folio)
                 return;

         if (folio_test_uptodate(folio)) {
                 /* zero to the end of the folio or start of extending 
operation */
                 from = offset_in_folio(folio, i_size);
                 len = min_t(loff_t, folio_size(folio) - from, pos - 
i_size);
                 folio_zero_range(folio, from, len);
         }

         folio_unlock(folio);
         folio_put(folio);
}

>> So there's more (and more urgent) to sort out before settling on
>> the right generic/383 fix.

However, let's still wait to see if Hugh has any better ideas.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-27  3:21       ` Baolin Wang
@ 2025-06-27 11:54         ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2025-06-27 11:54 UTC (permalink / raw)
  To: Baolin Wang; +Cc: Hugh Dickins, Matthew Wilcox, linux-mm

On Fri, Jun 27, 2025 at 11:21:56AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/6/26 20:55, Brian Foster wrote:
> > On Wed, Jun 25, 2025 at 10:35:44PM -0700, Hugh Dickins wrote:
> > > On Wed, 25 Jun 2025, Matthew Wilcox wrote:
> > > > On Wed, Jun 25, 2025 at 02:49:30PM -0400, Brian Foster wrote:
> > > > > Most traditional filesystems zero the post-eof portion of the eof
> > > > > folio at writeback time, or when the file size is extended by
> > > > > truncate or extending writes. This ensures that the previously
> > > > > post-eof range of the folio is zeroed before it is exposed to the
> > > > > file.
> > > > > 
> > > > > tmpfs doesn't implement the writeback path the way a traditional
> > > > > filesystem does, so zeroing behavior won't be exactly the same.
> > > > > However, it can still perform explicit zeroing from the various
> > > > > operations that extend a file and expose a post-eof portion of the
> > > > > eof folio. The current lack of zeroing is observed via failure of
> > > > > fstests test generic/363 on tmpfs. This test injects post-eof mapped
> > > > > writes in certain situations to detect gaps in zeroing behavior.
> > > > > 
> > > > > Add a new eof zeroing helper for file extending operations. Look up
> > > > > the current eof folio, and if one exists, zero the range about to be
> > > > > exposed. This allows generic/363 to pass on tmpfs.
> > > > 
> > > > This seems like what I did here?
> > > > 
> > > > https://lore.kernel.org/linux-fsdevel/20230202204428.3267832-4-willy@infradead.org/
> > > > 
> > > > Which fix should we prefer?
> > > 
> > 
> > Quite similar, indeed. This is actually about the same as my initial
> > prototype when I was just trying to confirm the problem for truncate. As
> > Hugh notes, we still need to cover other size extension paths
> > (fallocate, buffered write).
> > 
> > > Thank you Brian, thank you Matthew.
> > > 
> > > Of course my immediate reaction is to prefer the smaller patch,
> > > Matthew's, but generic/363 still fails with that one: I need to look
> > > into what each of you is doing (and that mail thread from 2023) and
> > > make up my own mind.
> > > 
> > 
> > FWIW, I started off with something trying to use shmem_undo_range() and
> > eventually moved toward the current patch because I couldn't get it
> > working quite right. Explicit zeroing seemed like less complexity than
> > calling through undo_range() on various operations, primarily due to
> > less risk of unintentional side effect. It's possible (likely :P) that
> > was just user error on my part, so now that I have a functional patch I
> > can revisit that approach if it is preferred.
> 
> I also tried using shmem_truncate_range() to fix this issue, but I failed.
> Ultimately, at least for me, I think the fix is simple and works.
> 

Ok, thanks.

> > However one thing I wasn't aware of at the time was the additional
> > zeroing needed into the target range on fallocate, so that might require
> > care to avoid not immediately punching out the folios that were
> > fallocated just prior. I suspect this would mean we'd need a helper or
> > something to restrict the range to undo to just the eof folio. That
> > seems like a plausible approach, I'm just not so sure the final result
> > will end up much smaller or simpler than this patch.
> > 
> > > (I'm worried why I had no copy of Matthew's 2023 patch: it's sadly
> > > common for me to bury patches for long periods of time, but not
> > > usually to lose them completely. But that is my worry, not yours.)
> > > 
> > > I haven't been much concerned by generic/383 failing on tmpfs:
> > > but promise to respect your efforts in due course.
> > > 
> > 
> > It's certainly not the bug of the century. ;) I added the test somewhat
> > recently because we had bigger zeroing issues on other filesystems and I
> > noticed we had no decent test coverage.
> > 
> > > I procrastinate "in due course" because (a) the full correct answwer
> > > will depend on what happens with large folios, and (b) large folio
> > > work in 6.14 changed (I'll say broke) the behaviour of huge=always
> > > at eof (I expected a danger of that, and thought I checked before
> > > 6.14-rc settled, but must have messed up my checking somehow).
> > > 
> > 
> > Interesting.. I assume huge=always refers to a mount option. I need to
> > give that a test.
> 
> I tested your patch by adding the 'transparent_hugepage_tmpfs=always'
> command line parameter, which will change the default huge policy to
> 'huge=always' for tmpfs mounts. Your patch also passed the generic/363 test
> with the tmpfs 'huge=always' mount option.
> 

Thanks. I ran a full auto fstests run with huge=always and an uptodate
tweak yesterday and also didn't see any regressions. My change was
pretty much the same as yours below other than I inverted the logic.

I do see two new test failures with huge=always in generic/285 and
generic/436, but that appears to be related to the mount option and not
the patch. Both of those seem to be seek related, FWIW.

> > I'm also curious if either of you have any thoughts on the uptodate
> > question. Does filtering zeroing on uptodate folios ensure we'd zero
> > only folios that were previously written to?
> 
> Yes, I think so. Caller will handle the !uptodate folios. So I change your
> patch to:
> 
> static void shmem_zero_eof(struct inode *inode, loff_t pos)
> {
>         struct folio *folio;
>         loff_t i_size = i_size_read(inode);
>         size_t from, len;
> 
>         folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);
>         if (!folio)
>                 return;
> 
>         if (folio_test_uptodate(folio)) {
>                 /* zero to the end of the folio or start of extending
> operation */
>                 from = offset_in_folio(folio, i_size);
>                 len = min_t(loff_t, folio_size(folio) - from, pos - i_size);
>                 folio_zero_range(folio, from, len);
>         }
> 
>         folio_unlock(folio);
>         folio_put(folio);
> }
> 
> > > So there's more (and more urgent) to sort out before settling on
> > > the right generic/383 fix.
> 
> However, let's still wait to see if Hugh has any better ideas.
> 

Ack.. I'll give it a bit and then follow up with a v2 later unless I
hear otherwise. Thanks again, appreciate the feedback.

Brian



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-06-25 18:49 [PATCH] tmpfs: zero post-eof folio range on file extension Brian Foster
  2025-06-25 19:21 ` Matthew Wilcox
@ 2025-07-09  7:57 ` Hugh Dickins
  2025-07-10  6:47   ` Baolin Wang
  2025-07-10 12:36   ` Brian Foster
  1 sibling, 2 replies; 18+ messages in thread
From: Hugh Dickins @ 2025-07-09  7:57 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-mm, Hugh Dickins, Baolin Wang, Matthew Wilcox, Usama Arif

Thanks for suggestions, let me come back to comment on your original.

On Wed, 25 Jun 2025, Brian Foster wrote:
>  

Although Matthew's alternative commit was insufficient and wrong
(because use of truncation risks deleting folios already promised
by fallocate), I found his brief commit message very very helpful
in understanding your patch - it makes clear what POSIX promises,
is a better justification than just passing an xfstest you wrote,
and explains why you're right to target places where i_size changes:

POSIX requires that "If the file size is increased, the extended area
shall appear as if it were zero-filled".  It is possible to use mmap to
write past EOF and that data will become visible instead of zeroes.

Please add that paragraph here, right at the head of your commit message.

> Most traditional filesystems zero the post-eof portion of the eof
> folio at writeback time, or when the file size is extended by
> truncate or extending writes. This ensures that the previously
> post-eof range of the folio is zeroed before it is exposed to the
> file.
> 
> tmpfs doesn't implement the writeback path the way a traditional
> filesystem does, so zeroing behavior won't be exactly the same.
> However, it can still perform explicit zeroing from the various
> operations that extend a file and expose a post-eof portion of the
> eof folio. The current lack of zeroing is observed via failure of
> fstests test generic/363 on tmpfs. This test injects post-eof mapped
> writes in certain situations to detect gaps in zeroing behavior.
> 
> Add a new eof zeroing helper for file extending operations. Look up
> the current eof folio, and if one exists, zero the range about to be
> exposed. This allows generic/363 to pass on tmpfs.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> Hi all,
> 
> This survives the aforemented reproducer, an fstests regression run, and
> ~100m fsx operations without issues. Let me know if there are any other
> recommended tests for tmpfs and I'm happy to run them. Otherwise, a
> couple notes as I'm not terribly familiar with tmpfs...
> 
> First, I used _get_partial_folio() because we really only want to zero
> an eof folio if one has been previously allocated. My understanding is
> that lookup path will avoid unnecessary folio allocation in such cases,
> but let me know if that's wrong.
> 
> Also, it seems that the falloc path leaves newly preallocated folios
> !uptodate until they are used. This had me wondering if perhaps
> shmem_zero_eof() could just skip out if the eof folio happens to be
> !uptodate. Hm?

Yes, you were right to think that it's better to skip the !uptodates,
and Baolin made good suggestion there (though I'll unravel it a bit).

> 
> Thoughts, reviews, flames appreciated.

Sorry, much as you'd appreciate a flame, I cannot oblige: I think you've
done a very good job here (but it's not ready yet), and you've done it
in such a way that huge=always passes generic/363 with no trouble.

I did keep on wanting to change this and that of your patch below; but
then later came around to seeing why your choices were better than what
I was going to suggest.

I had difficulty getting deep enough into it, but think I'm there now.
And have identified one missed aspect, which rather changes around what
you should do - I'd have preferred to get into that at the end, but
since it affects what shmem_zero_eof() should look like, I'd better
talk about it here first.

The problem is with huge pages (or large folios) in shmem_writeout():
what goes in as a large folio may there have to be split into small
pages; or it may be swapped out as one large folio, but fragmentation
at swapin time demand that it be split into small pages when swapped in.

So, if there has been swapout since the large folio was modified beyond
EOF, the folio that shmem_zero_eof() brings in does not guarantee what
length needs to be zeroed.

We could set that aside as a deficiency to be fixed later on: that
would not be unreasonable, but I'm guessing that won't satisfy you.

We could zero the maximum (the remainder of PMD size I believe) in
shmem_zero_eof(): looping over small folios within the range, skipping
!uptodate ones (but we do force them uptodate when swapping out, in
order to keep the space reservation).  TBH I've ignored that as a bad
option, but it doesn't seem so bad to me now: ugly, but maybe not bad.

The solution I've had in mind (and pursue in comments below) is to do
the EOF zeroing in shmem_writeout() before it splits; and then avoid
swapin in shmem_zero_eof() when i_size is raised.

That solution partly inspired by the shmem symlink uninit-value bug
https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/
which I haven't rushed to fix, but ought to be fixed along with this one
(by "along with" I don't mean that both have to be fixed in one single
patch, but it makes sense to consider them together).  I was inclined
not to zero the whole page in shmem_symlink(), but zero before swapout.

It worries me that an EOF page might be swapped out and in a thousand
times, but i_size set only once: I'm going for a solution which memsets
a thousand times rather than once?  But if that actually shows up as an
issue in any workload, then we can add a shmem inode flag to say whether
the EOF folio has been exposed via mmap (or symlink) so may need zeroing.

What's your preference?  My comments below assume the latter solution,
but that may be wrong.

> 
> Brian
> 
>  mm/shmem.c | 36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 3a5a65b1f41a..4bb96c24fb9e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1077,6 +1077,29 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
>  	return folio;
>  }
>  
> +/*
> + * Zero any post-EOF range of the EOF folio about to be exposed by size
> + * extension.
> + */
> +static void shmem_zero_eof(struct inode *inode, loff_t pos)

Please change that "pos" to, perhaps, "newsize": I was initiailly quite
confused by that argument to shmem_zero_eof(), until I grasped that
oldsize (really, the important one here) is calculated inside.

> +{
> +	struct folio *folio;
> +	loff_t i_size = i_size_read(inode);
> +	size_t from, len;
> +
> +	folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);

I deceived myself into thinking shmem_get_folio(,,,,SGP_READ) was better,
but no, that doesn't handle large folio spanning EOF in the way needed
here.  It frustrates and depresses me that there's no appropriate SGP_
for this.  It used to be the case that everything must go through
shmem_getpage(), but then large folio truncation came to need that
odd shmem_get_partial_folio() bypass.

Lacking a suitable SGP_, you did well to choose it for further use, but
I now think you should go your own way, duplicating its filemap_get_entry()
and !xa_is_value block (with folio_test_uptodate check added in - while
folio locked I suppose, though I'm not certain that's necessary),
without doing the shmem_get_folio(,,,,SGP_READ) to swapin.

> +	if (!folio)
> +		return;
> +
> +	/* zero to the end of the folio or start of extending operation */
> +	from = offset_in_folio(folio, i_size);

If from == 0, doesn't that mean we're looking at the *next* folio,
and do not need to zero it at all?  A case that would have been ruled
out by a simple alignment check before, in the old days of all-small
pages; but nowadays we cannot tell without looking at the folio itself.

> +	len = min_t(loff_t, folio_size(folio) - from, pos - i_size);
> +	folio_zero_range(folio, from, len);
> +
> +	folio_unlock(folio);
> +	folio_put(folio);
> +}
> +
>  /*
>   * Remove range of pages and swap entries from page cache, and free them.
>   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> @@ -1302,6 +1325,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
>  			return -EPERM;
>  
>  		if (newsize != oldsize) {
> +			if (newsize > oldsize)
> +				shmem_zero_eof(inode, newsize);

Nit, and feel free to disagree, but I'd slightly prefer you to do that
a few lines later, after the error return, before the i_size_write()
and update_mtime.

>  			error = shmem_reacct_size(SHMEM_I(inode)->flags,
>  					oldsize, newsize);
>  			if (error)
> @@ -3464,6 +3489,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  	ret = file_update_time(file);
>  	if (ret)
>  		goto unlock;
> +	if (iocb->ki_pos > i_size_read(inode))
> +		shmem_zero_eof(inode, iocb->ki_pos);

Okay.  I was inclined to ask you to move it to where shmem_write_end()
does its i_size_write() and zeroing to make uptodate, that might be a
more natural locationi.  But came to realize that there are several
reasons why your choice is easier, and any attempt to rearrange likely
to be a waste of your time (not to mention folio deadlock).

>  	ret = generic_perform_write(iocb, from);
>  unlock:
>  	inode_unlock(inode);
> @@ -3791,8 +3818,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  		cond_resched();
>  	}
>  
> -	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> +	/*
> +	 * The post-eof portion of the eof folio isn't guaranteed to be zeroed
> +	 * by fallocate, so zero through the end of the fallocated range
> +	 * instead of the start.
> +	 */

That comment continues to mystify me (the end instead of the start??):
but there's an easy answer, just delete it, the code is easier to
understand without it!

> +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> +		shmem_zero_eof(inode, offset + len);
>  		i_size_write(inode, offset + len);
> +	}
>  undone:
>  	spin_lock(&inode->i_lock);
>  	inode->i_private = NULL;
> -- 
> 2.49.0

I don't have a proposal for how the code in shmem_writeout() should look,
and maybe you won't agree that's where the change should be made.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-09  7:57 ` Hugh Dickins
@ 2025-07-10  6:47   ` Baolin Wang
  2025-07-10 22:20     ` Hugh Dickins
  2025-07-10 12:36   ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-07-10  6:47 UTC (permalink / raw)
  To: Hugh Dickins, Brian Foster; +Cc: linux-mm, Matthew Wilcox, Usama Arif



On 2025/7/9 15:57, Hugh Dickins wrote:
> Thanks for suggestions, let me come back to comment on your original.
> 
> On Wed, 25 Jun 2025, Brian Foster wrote:
>>   
> 
> Although Matthew's alternative commit was insufficient and wrong
> (because use of truncation risks deleting folios already promised
> by fallocate), I found his brief commit message very very helpful
> in understanding your patch - it makes clear what POSIX promises,
> is a better justification than just passing an xfstest you wrote,
> and explains why you're right to target places where i_size changes:
> 
> POSIX requires that "If the file size is increased, the extended area
> shall appear as if it were zero-filled".  It is possible to use mmap to
> write past EOF and that data will become visible instead of zeroes.
> 
> Please add that paragraph here, right at the head of your commit message.
> 
>> Most traditional filesystems zero the post-eof portion of the eof
>> folio at writeback time, or when the file size is extended by
>> truncate or extending writes. This ensures that the previously
>> post-eof range of the folio is zeroed before it is exposed to the
>> file.
>>
>> tmpfs doesn't implement the writeback path the way a traditional
>> filesystem does, so zeroing behavior won't be exactly the same.
>> However, it can still perform explicit zeroing from the various
>> operations that extend a file and expose a post-eof portion of the
>> eof folio. The current lack of zeroing is observed via failure of
>> fstests test generic/363 on tmpfs. This test injects post-eof mapped
>> writes in certain situations to detect gaps in zeroing behavior.
>>
>> Add a new eof zeroing helper for file extending operations. Look up
>> the current eof folio, and if one exists, zero the range about to be
>> exposed. This allows generic/363 to pass on tmpfs.
>>
>> Signed-off-by: Brian Foster <bfoster@redhat.com>
>> ---
>>
>> Hi all,
>>
>> This survives the aforemented reproducer, an fstests regression run, and
>> ~100m fsx operations without issues. Let me know if there are any other
>> recommended tests for tmpfs and I'm happy to run them. Otherwise, a
>> couple notes as I'm not terribly familiar with tmpfs...
>>
>> First, I used _get_partial_folio() because we really only want to zero
>> an eof folio if one has been previously allocated. My understanding is
>> that lookup path will avoid unnecessary folio allocation in such cases,
>> but let me know if that's wrong.
>>
>> Also, it seems that the falloc path leaves newly preallocated folios
>> !uptodate until they are used. This had me wondering if perhaps
>> shmem_zero_eof() could just skip out if the eof folio happens to be
>> !uptodate. Hm?
> 
> Yes, you were right to think that it's better to skip the !uptodates,
> and Baolin made good suggestion there (though I'll unravel it a bit).
> 
>>
>> Thoughts, reviews, flames appreciated.
> 
> Sorry, much as you'd appreciate a flame, I cannot oblige: I think you've
> done a very good job here (but it's not ready yet), and you've done it
> in such a way that huge=always passes generic/363 with no trouble.
> 
> I did keep on wanting to change this and that of your patch below; but
> then later came around to seeing why your choices were better than what
> I was going to suggest.
> 
> I had difficulty getting deep enough into it, but think I'm there now.
> And have identified one missed aspect, which rather changes around what
> you should do - I'd have preferred to get into that at the end, but
> since it affects what shmem_zero_eof() should look like, I'd better
> talk about it here first.
> 
> The problem is with huge pages (or large folios) in shmem_writeout():
> what goes in as a large folio may there have to be split into small
> pages; or it may be swapped out as one large folio, but fragmentation
> at swapin time demand that it be split into small pages when swapped in.

Good point.

> So, if there has been swapout since the large folio was modified beyond
> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> length needs to be zeroed.
> 
> We could set that aside as a deficiency to be fixed later on: that
> would not be unreasonable, but I'm guessing that won't satisfy you.
> 
> We could zero the maximum (the remainder of PMD size I believe) in
> shmem_zero_eof(): looping over small folios within the range, skipping
> !uptodate ones (but we do force them uptodate when swapping out, in
> order to keep the space reservation).  TBH I've ignored that as a bad
> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.

However, IIUC, if the large folios are split in shmem_writeout(), and 
those small folios which beyond EOF will be dropped and freed in 
__split_unmapped_folio(), should we still consider them?

Please correct me if I missed anything.

> The solution I've had in mind (and pursue in comments below) is to do
> the EOF zeroing in shmem_writeout() before it splits; and then avoid
> swapin in shmem_zero_eof() when i_size is raised.
> 
> That solution partly inspired by the shmem symlink uninit-value bug
> https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/
> which I haven't rushed to fix, but ought to be fixed along with this one
> (by "along with" I don't mean that both have to be fixed in one single
> patch, but it makes sense to consider them together).  I was inclined
> not to zero the whole page in shmem_symlink(), but zero before swapout.
> 
> It worries me that an EOF page might be swapped out and in a thousand
> times, but i_size set only once: I'm going for a solution which memsets
> a thousand times rather than once?  But if that actually shows up as an
> issue in any workload, then we can add a shmem inode flag to say whether
> the EOF folio has been exposed via mmap (or symlink) so may need zeroing.
> 
> What's your preference?  My comments below assume the latter solution,
> but that may be wrong.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-09  7:57 ` Hugh Dickins
  2025-07-10  6:47   ` Baolin Wang
@ 2025-07-10 12:36   ` Brian Foster
  2025-07-10 23:02     ` Hugh Dickins
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2025-07-10 12:36 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, Baolin Wang, Matthew Wilcox, Usama Arif

On Wed, Jul 09, 2025 at 12:57:35AM -0700, Hugh Dickins wrote:
> Thanks for suggestions, let me come back to comment on your original.
> 
> On Wed, 25 Jun 2025, Brian Foster wrote:
> >  
> 
> Although Matthew's alternative commit was insufficient and wrong
> (because use of truncation risks deleting folios already promised
> by fallocate), I found his brief commit message very very helpful
> in understanding your patch - it makes clear what POSIX promises,
> is a better justification than just passing an xfstest you wrote,
> and explains why you're right to target places where i_size changes:
> 
> POSIX requires that "If the file size is increased, the extended area
> shall appear as if it were zero-filled".  It is possible to use mmap to
> write past EOF and that data will become visible instead of zeroes.
> 
> Please add that paragraph here, right at the head of your commit message.
> 

Indeed, this is a better description. I'll incorporate it.

> > Most traditional filesystems zero the post-eof portion of the eof
> > folio at writeback time, or when the file size is extended by
> > truncate or extending writes. This ensures that the previously
> > post-eof range of the folio is zeroed before it is exposed to the
> > file.
> > 
> > tmpfs doesn't implement the writeback path the way a traditional
> > filesystem does, so zeroing behavior won't be exactly the same.
> > However, it can still perform explicit zeroing from the various
> > operations that extend a file and expose a post-eof portion of the
> > eof folio. The current lack of zeroing is observed via failure of
> > fstests test generic/363 on tmpfs. This test injects post-eof mapped
> > writes in certain situations to detect gaps in zeroing behavior.
> > 
> > Add a new eof zeroing helper for file extending operations. Look up
> > the current eof folio, and if one exists, zero the range about to be
> > exposed. This allows generic/363 to pass on tmpfs.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > Hi all,
> > 
> > This survives the aforemented reproducer, an fstests regression run, and
> > ~100m fsx operations without issues. Let me know if there are any other
> > recommended tests for tmpfs and I'm happy to run them. Otherwise, a
> > couple notes as I'm not terribly familiar with tmpfs...
> > 
> > First, I used _get_partial_folio() because we really only want to zero
> > an eof folio if one has been previously allocated. My understanding is
> > that lookup path will avoid unnecessary folio allocation in such cases,
> > but let me know if that's wrong.
> > 
> > Also, it seems that the falloc path leaves newly preallocated folios
> > !uptodate until they are used. This had me wondering if perhaps
> > shmem_zero_eof() could just skip out if the eof folio happens to be
> > !uptodate. Hm?
> 
> Yes, you were right to think that it's better to skip the !uptodates,
> and Baolin made good suggestion there (though I'll unravel it a bit).
> 
> > 
> > Thoughts, reviews, flames appreciated.
> 
> Sorry, much as you'd appreciate a flame, I cannot oblige: I think you've
> done a very good job here (but it's not ready yet), and you've done it
> in such a way that huge=always passes generic/363 with no trouble.
> 

Thanks, I appreciate the feedback from you both..

> I did keep on wanting to change this and that of your patch below; but
> then later came around to seeing why your choices were better than what
> I was going to suggest.
> 
> I had difficulty getting deep enough into it, but think I'm there now.
> And have identified one missed aspect, which rather changes around what
> you should do - I'd have preferred to get into that at the end, but
> since it affects what shmem_zero_eof() should look like, I'd better
> talk about it here first.
> 
> The problem is with huge pages (or large folios) in shmem_writeout():
> what goes in as a large folio may there have to be split into small
> pages; or it may be swapped out as one large folio, but fragmentation
> at swapin time demand that it be split into small pages when swapped in.
> 
> So, if there has been swapout since the large folio was modified beyond
> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> length needs to be zeroed.
> 
> We could set that aside as a deficiency to be fixed later on: that
> would not be unreasonable, but I'm guessing that won't satisfy you.
> 

So I was reading through some of this code yesterday and playing around
with forcing swapout of an EOF folio, and I had the same observation as
noted in Baolin's followup: it looks like large folios are always split
across EOF, at which point post-eof folios are dropped because generally
the pagecache doesn't track post-eof folios.

A quick experiment to map write to a 2MB folio in a 1MB sized file shows
that the folio is indeed split on swapout. It looks like the large folio
is not immediately reconstructed on swapin, but rather something in the
background reconstitutes it such that once the post-eof range is
accessible again, it is effectively zeroed. I'm assuming this is not due
to explicit zeroing, but rather a side effect of those post-eof folios
being tossed (vs. swapped) and reallocated, but I could certainly be
wrong about that.

> We could zero the maximum (the remainder of PMD size I believe) in
> shmem_zero_eof(): looping over small folios within the range, skipping
> !uptodate ones (but we do force them uptodate when swapping out, in
> order to keep the space reservation).  TBH I've ignored that as a bad
> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> 
> The solution I've had in mind (and pursue in comments below) is to do
> the EOF zeroing in shmem_writeout() before it splits; and then avoid
> swapin in shmem_zero_eof() when i_size is raised.
> 

Indeed, this is similar to traditional writeback behavior in that fully
post-eof folios are skipped (presumed to be racing with a truncate) and
a straddling EOF folio is partially zeroed at writeback time.

I actually observed this difference in behavior when first looking into
this issue on tmpfs, but I didn't have enough context to draw the
parallel to swapout, so thanks for bringing this up.

> That solution partly inspired by the shmem symlink uninit-value bug
> https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/
> which I haven't rushed to fix, but ought to be fixed along with this one
> (by "along with" I don't mean that both have to be fixed in one single
> patch, but it makes sense to consider them together).  I was inclined
> not to zero the whole page in shmem_symlink(), but zero before swapout.
> 

I'll have to take a closer look at that one..

> It worries me that an EOF page might be swapped out and in a thousand
> times, but i_size set only once: I'm going for a solution which memsets
> a thousand times rather than once?  But if that actually shows up as an
> issue in any workload, then we can add a shmem inode flag to say whether
> the EOF folio has been exposed via mmap (or symlink) so may need zeroing.
> 
> What's your preference?  My comments below assume the latter solution,
> but that may be wrong.
> 

I actually think swapout time zeroing is a nice functional improvement
over this current patch. I'd be less inclined to think that frequent
swap cycles are more problematic than swapping folios in (and presumably
back out) just for partial zeroing due to file changes that don't even
necessarily need the associated data. This is also generally more
consistent with traditional fs/pagecache behavior, which IMO is a good
thing.

So I suppose what I'm saying is that I like the prospective approach to
also zero at shmem_writeout() and instead not swap in folios purely for
partial eof zeroing purposes, just perhaps not for the exact reasons
stated here. Of course the advantage could be that the code looks the
same regardless, so if that folio splitting behavior ever changed in the
future, the zeroing code would hopefully continue to do the right thing.
Thoughts?

> > 
> > Brian
> > 
> >  mm/shmem.c | 36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 3a5a65b1f41a..4bb96c24fb9e 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1077,6 +1077,29 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
> >  	return folio;
> >  }
> >  
> > +/*
> > + * Zero any post-EOF range of the EOF folio about to be exposed by size
> > + * extension.
> > + */
> > +static void shmem_zero_eof(struct inode *inode, loff_t pos)
> 
> Please change that "pos" to, perhaps, "newsize": I was initiailly quite
> confused by that argument to shmem_zero_eof(), until I grasped that
> oldsize (really, the important one here) is calculated inside.
> 

Hmm.. newsize isn't necessarily accurate either as this tends to point
to the start offset of the extending operation, not necessarily the new
inode size. That said, I'm not particular so I'll change it unless
somebody chimes in otherwise.

> > +{
> > +	struct folio *folio;
> > +	loff_t i_size = i_size_read(inode);
> > +	size_t from, len;
> > +
> > +	folio = shmem_get_partial_folio(inode, i_size >> PAGE_SHIFT);
> 
> I deceived myself into thinking shmem_get_folio(,,,,SGP_READ) was better,
> but no, that doesn't handle large folio spanning EOF in the way needed
> here.  It frustrates and depresses me that there's no appropriate SGP_
> for this.  It used to be the case that everything must go through
> shmem_getpage(), but then large folio truncation came to need that
> odd shmem_get_partial_folio() bypass.
> 
> Lacking a suitable SGP_, you did well to choose it for further use, but
> I now think you should go your own way, duplicating its filemap_get_entry()
> and !xa_is_value block (with folio_test_uptodate check added in - while
> folio locked I suppose, though I'm not certain that's necessary),
> without doing the shmem_get_folio(,,,,SGP_READ) to swapin.
> 

Ok, makes sense.

> > +	if (!folio)
> > +		return;
> > +
> > +	/* zero to the end of the folio or start of extending operation */
> > +	from = offset_in_folio(folio, i_size);
> 
> If from == 0, doesn't that mean we're looking at the *next* folio,
> and do not need to zero it at all?  A case that would have been ruled
> out by a simple alignment check before, in the old days of all-small
> pages; but nowadays we cannot tell without looking at the folio itself.
> 
> > +	len = min_t(loff_t, folio_size(folio) - from, pos - i_size);
> > +	folio_zero_range(folio, from, len);
> > +
> > +	folio_unlock(folio);
> > +	folio_put(folio);
> > +}
> > +
> >  /*
> >   * Remove range of pages and swap entries from page cache, and free them.
> >   * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
> > @@ -1302,6 +1325,8 @@ static int shmem_setattr(struct mnt_idmap *idmap,
> >  			return -EPERM;
> >  
> >  		if (newsize != oldsize) {
> > +			if (newsize > oldsize)
> > +				shmem_zero_eof(inode, newsize);
> 
> Nit, and feel free to disagree, but I'd slightly prefer you to do that
> a few lines later, after the error return, before the i_size_write()
> and update_mtime.
> 
> >  			error = shmem_reacct_size(SHMEM_I(inode)->flags,
> >  					oldsize, newsize);
> >  			if (error)
> > @@ -3464,6 +3489,8 @@ static ssize_t shmem_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  	ret = file_update_time(file);
> >  	if (ret)
> >  		goto unlock;
> > +	if (iocb->ki_pos > i_size_read(inode))
> > +		shmem_zero_eof(inode, iocb->ki_pos);
> 
> Okay.  I was inclined to ask you to move it to where shmem_write_end()
> does its i_size_write() and zeroing to make uptodate, that might be a
> more natural locationi.  But came to realize that there are several
> reasons why your choice is easier, and any attempt to rearrange likely
> to be a waste of your time (not to mention folio deadlock).
> 
> >  	ret = generic_perform_write(iocb, from);
> >  unlock:
> >  	inode_unlock(inode);
> > @@ -3791,8 +3818,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
> >  		cond_resched();
> >  	}
> >  
> > -	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
> > +	/*
> > +	 * The post-eof portion of the eof folio isn't guaranteed to be zeroed
> > +	 * by fallocate, so zero through the end of the fallocated range
> > +	 * instead of the start.
> > +	 */
> 
> That comment continues to mystify me (the end instead of the start??):
> but there's an easy answer, just delete it, the code is easier to
> understand without it!
> 

Hah, I admit I often times read back my own comments and wonder what I
was thinking. ;) The purpose of this one was to try to explain why we
pass offset + len rather than the typical offset in this particular
case. I'll think about whether that can be stated more clearly, or
otherwise just drop the comment.

The rest of the feedback sounds reasonable to me at first glance, but
I'll take a closer look at those changes once I have the bigger picture
swapout thing worked out. Thanks again for the review and let me know if
any of this sounds misguided..

Brian

> > +	if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
> > +		shmem_zero_eof(inode, offset + len);
> >  		i_size_write(inode, offset + len);
> > +	}
> >  undone:
> >  	spin_lock(&inode->i_lock);
> >  	inode->i_private = NULL;
> > -- 
> > 2.49.0
> 
> I don't have a proposal for how the code in shmem_writeout() should look,
> and maybe you won't agree that's where the change should be made.
> 
> Thanks,
> Hugh
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-10  6:47   ` Baolin Wang
@ 2025-07-10 22:20     ` Hugh Dickins
  2025-07-11  3:50       ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-07-10 22:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Hugh Dickins, Brian Foster, linux-mm, Matthew Wilcox, Usama Arif

On Thu, 10 Jul 2025, Baolin Wang wrote:
> On 2025/7/9 15:57, Hugh Dickins wrote:
...
> > 
> > The problem is with huge pages (or large folios) in shmem_writeout():
> > what goes in as a large folio may there have to be split into small
> > pages; or it may be swapped out as one large folio, but fragmentation
> > at swapin time demand that it be split into small pages when swapped in.
> 
> Good point.
> 
> > So, if there has been swapout since the large folio was modified beyond
> > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > length needs to be zeroed.
> > 
> > We could set that aside as a deficiency to be fixed later on: that
> > would not be unreasonable, but I'm guessing that won't satisfy you.
> > 
> > We could zero the maximum (the remainder of PMD size I believe) in
> > shmem_zero_eof(): looping over small folios within the range, skipping
> > !uptodate ones (but we do force them uptodate when swapping out, in
> > order to keep the space reservation).  TBH I've ignored that as a bad
> > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> 
> However, IIUC, if the large folios are split in shmem_writeout(), and those
> small folios which beyond EOF will be dropped and freed in
> __split_unmapped_folio(), should we still consider them?

You're absolutely right about the normal case, and thank you for making
that point.  Had I forgotten that when writing?  Or was I already
jumping ahead to the problem case?  I don't recall, but was certainly
wrong for not mentioning it.

The abnormal case is when there's a "fallocend" beyond i_size (or beyond
the small page extent spanning i_size) i.e. fallocate() has promised to
keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
is keeping those pages.

There could well be some optimization, involving fallocend, to avoid
zeroing more than necessary; but I wouldn't want to say what in a hurry,
it's quite confusing!

Hugh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-10 12:36   ` Brian Foster
@ 2025-07-10 23:02     ` Hugh Dickins
  0 siblings, 0 replies; 18+ messages in thread
From: Hugh Dickins @ 2025-07-10 23:02 UTC (permalink / raw)
  To: Brian Foster
  Cc: Hugh Dickins, linux-mm, Baolin Wang, Matthew Wilcox, Usama Arif

On Thu, 10 Jul 2025, Brian Foster wrote:
> On Wed, Jul 09, 2025 at 12:57:35AM -0700, Hugh Dickins wrote:
...
> > 
> > The problem is with huge pages (or large folios) in shmem_writeout():
> > what goes in as a large folio may there have to be split into small
> > pages; or it may be swapped out as one large folio, but fragmentation
> > at swapin time demand that it be split into small pages when swapped in.
> > 
> > So, if there has been swapout since the large folio was modified beyond
> > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > length needs to be zeroed.
> > 
> > We could set that aside as a deficiency to be fixed later on: that
> > would not be unreasonable, but I'm guessing that won't satisfy you.
> > 
> 
> So I was reading through some of this code yesterday and playing around
> with forcing swapout of an EOF folio, and I had the same observation as
> noted in Baolin's followup: it looks like large folios are always split
> across EOF, at which point post-eof folios are dropped because generally
> the pagecache doesn't track post-eof folios.

Generally yes, but as in reply to Baolin, not in the fallocend case.

> 
> A quick experiment to map write to a 2MB folio in a 1MB sized file shows
> that the folio is indeed split on swapout. It looks like the large folio
> is not immediately reconstructed on swapin, but rather something in the
> background reconstitutes it such that once the post-eof range is
> accessible again, it is effectively zeroed. I'm assuming this is not due
> to explicit zeroing, but rather a side effect of those post-eof folios
> being tossed (vs. swapped) and reallocated, but I could certainly be
> wrong about that.

Sounds right: that will have been khugepaged reconstituting it.

> 
> > We could zero the maximum (the remainder of PMD size I believe) in
> > shmem_zero_eof(): looping over small folios within the range, skipping
> > !uptodate ones (but we do force them uptodate when swapping out, in
> > order to keep the space reservation).  TBH I've ignored that as a bad
> > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.

I have to confess that in the meantime I've grown rather to think I was
too obsessed without doing it at swapout, and this "ugly" solution better.
But I expect you'll try it out (in mind or in code) one way and the other,
and make your own decision which way is better.

A realization which pushed me in this direction, not decisive but a push:
there can be other reasons for the huge page getting split, not just
swapout and swapin.  Notably (only?) hole-punch somewhere in that
EOF-spanning huge page.  Could be before the EOF or after: in either
case the huge page is (likely to be) split into small pages, and
shmem_zero_eof()'s folio_size(folio) give too small an estimate of
what might need zeroing.

That could be fixed with a further shmem_zero_eof() call somewhere
in the hole-punching path; but that won't be necessary if
shmem_zero_eof() knows to go beyond small folio size (in the
fallocend case only? perhaps, but I haven't thought it through).

> > 
> > The solution I've had in mind (and pursue in comments below) is to do
> > the EOF zeroing in shmem_writeout() before it splits; and then avoid
> > swapin in shmem_zero_eof() when i_size is raised.
> > 
> 
> Indeed, this is similar to traditional writeback behavior in that fully
> post-eof folios are skipped (presumed to be racing with a truncate) and
> a straddling EOF folio is partially zeroed at writeback time.
> 
> I actually observed this difference in behavior when first looking into
> this issue on tmpfs, but I didn't have enough context to draw the
> parallel to swapout, so thanks for bringing this up.
> 
> > That solution partly inspired by the shmem symlink uninit-value bug
> > https://lore.kernel.org/linux-mm/670793eb.050a0220.8109b.0003.GAE@google.com/
> > which I haven't rushed to fix, but ought to be fixed along with this one
> > (by "along with" I don't mean that both have to be fixed in one single
> > patch, but it makes sense to consider them together).  I was inclined
> > not to zero the whole page in shmem_symlink(), but zero before swapout.
> > 
> 
> I'll have to take a closer look at that one..

And I've grown towards thinking (as I expect everybody else would) that
we should simply zero the rest of the page at shmem_symlink() time.
"An abundance of caution" makes me afraid to add the overhead there,
but maybe the right thing to do is the obvious thing, and make it
more complicated if/when anyone notices and complains.

> 
> > It worries me that an EOF page might be swapped out and in a thousand
> > times, but i_size set only once: I'm going for a solution which memsets
> > a thousand times rather than once?  But if that actually shows up as an
> > issue in any workload, then we can add a shmem inode flag to say whether
> > the EOF folio has been exposed via mmap (or symlink) so may need zeroing.
> > 
> > What's your preference?  My comments below assume the latter solution,
> > but that may be wrong.
> > 
> 
> I actually think swapout time zeroing is a nice functional improvement
> over this current patch. I'd be less inclined to think that frequent
> swap cycles are more problematic than swapping folios in (and presumably
> back out) just for partial zeroing due to file changes that don't even
> necessarily need the associated data. This is also generally more
> consistent with traditional fs/pagecache behavior, which IMO is a good
> thing.
> 
> So I suppose what I'm saying is that I like the prospective approach to
> also zero at shmem_writeout() and instead not swap in folios purely for
> partial eof zeroing purposes, just perhaps not for the exact reasons
> stated here. Of course the advantage could be that the code looks the
> same regardless, so if that folio splitting behavior ever changed in the
> future, the zeroing code would hopefully continue to do the right thing.
> Thoughts?

WHereas I liked the way you do it at well-defined user-call times,
rather than when the kernel behind-the-scenes decides it might want
to swap out the page.  But I did check with ext4, and verified there
that the post-EOF non-data vanishes without user-call intervention,
I assume on writeback as expected.  So although I like your choice
of user-call times, it's not a functional requirement at all.

Looks like we've neatly exchanged positions.
After you, Alphonse!

Hugh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-10 22:20     ` Hugh Dickins
@ 2025-07-11  3:50       ` Baolin Wang
  2025-07-11  7:50         ` Hugh Dickins
  2025-07-11 16:08         ` Brian Foster
  0 siblings, 2 replies; 18+ messages in thread
From: Baolin Wang @ 2025-07-11  3:50 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Brian Foster, linux-mm, Matthew Wilcox, Usama Arif



On 2025/7/11 06:20, Hugh Dickins wrote:
> On Thu, 10 Jul 2025, Baolin Wang wrote:
>> On 2025/7/9 15:57, Hugh Dickins wrote:
> ...
>>>
>>> The problem is with huge pages (or large folios) in shmem_writeout():
>>> what goes in as a large folio may there have to be split into small
>>> pages; or it may be swapped out as one large folio, but fragmentation
>>> at swapin time demand that it be split into small pages when swapped in.
>>
>> Good point.
>>
>>> So, if there has been swapout since the large folio was modified beyond
>>> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
>>> length needs to be zeroed.
>>>
>>> We could set that aside as a deficiency to be fixed later on: that
>>> would not be unreasonable, but I'm guessing that won't satisfy you.
>>>
>>> We could zero the maximum (the remainder of PMD size I believe) in
>>> shmem_zero_eof(): looping over small folios within the range, skipping
>>> !uptodate ones (but we do force them uptodate when swapping out, in
>>> order to keep the space reservation).  TBH I've ignored that as a bad
>>> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
>>
>> However, IIUC, if the large folios are split in shmem_writeout(), and those
>> small folios which beyond EOF will be dropped and freed in
>> __split_unmapped_folio(), should we still consider them?
> 
> You're absolutely right about the normal case, and thank you for making
> that point.  Had I forgotten that when writing?  Or was I already
> jumping ahead to the problem case?  I don't recall, but was certainly
> wrong for not mentioning it.
> 
> The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> the small page extent spanning i_size) i.e. fallocate() has promised to
> keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> is keeping those pages.

Ah, yes, you are right.

> There could well be some optimization, involving fallocend, to avoid
> zeroing more than necessary; but I wouldn't want to say what in a hurry,
> it's quite confusing!

Like you said, not only can a large folio split occur during swapout, 
but it can also happen during a punch hole operation. Moreover, 
considering the abnormal case of fallocate() you mentioned, we should 
find a more common approach to mitigate the impact of fallocate().

For instance, when splitting, we could clear the 'uptodate' flag for 
these EOF small folios that are beyond 'i_size' but less than the 
'fallocend', so that these EOF small folios will be re-initialized if 
they are used again. What do you think?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ce130225a8e5..2ccb442525d1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3546,6 +3546,18 @@ static int __split_unmapped_folio(struct folio 
*folio, int new_order,
                         lru_add_split_folio(origin_folio, release, lruvec,
                                         list);

+                       /*
+                        * fallocate() will keep folios allocated beyond 
EOF, we should
+                        * clear the uptodate flag for these folios to 
re-zero them
+                        * if necessary.
+                        */
+                       if (shmem_mapping(mapping)) {
+                               loff_t i_size = i_size_read(mapping->host);
+
+                               if (i_size < end && release->index >= 
i_size)
+                                       folio_clear_uptodate(release);
+                       }
+
                         /* Some pages can be beyond EOF: drop them from 
cache */
                         if (release->index >= end) {
                                 if (shmem_mapping(mapping))


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-11  3:50       ` Baolin Wang
@ 2025-07-11  7:50         ` Hugh Dickins
  2025-07-11  8:42           ` Baolin Wang
  2025-07-11 16:08         ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-07-11  7:50 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Hugh Dickins, Brian Foster, linux-mm, Matthew Wilcox, Usama Arif

On Fri, 11 Jul 2025, Baolin Wang wrote:
> On 2025/7/11 06:20, Hugh Dickins wrote:
> > On Thu, 10 Jul 2025, Baolin Wang wrote:
> >> On 2025/7/9 15:57, Hugh Dickins wrote:
> > ...
> >>>
> >>> The problem is with huge pages (or large folios) in shmem_writeout():
> >>> what goes in as a large folio may there have to be split into small
> >>> pages; or it may be swapped out as one large folio, but fragmentation
> >>> at swapin time demand that it be split into small pages when swapped in.
> >>
> >> Good point.
> >>
> >>> So, if there has been swapout since the large folio was modified beyond
> >>> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> >>> length needs to be zeroed.
> >>>
> >>> We could set that aside as a deficiency to be fixed later on: that
> >>> would not be unreasonable, but I'm guessing that won't satisfy you.
> >>>
> >>> We could zero the maximum (the remainder of PMD size I believe) in
> >>> shmem_zero_eof(): looping over small folios within the range, skipping
> >>> !uptodate ones (but we do force them uptodate when swapping out, in
> >>> order to keep the space reservation).  TBH I've ignored that as a bad
> >>> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> >>
> >> However, IIUC, if the large folios are split in shmem_writeout(), and those
> >> small folios which beyond EOF will be dropped and freed in
> >> __split_unmapped_folio(), should we still consider them?
> > 
> > You're absolutely right about the normal case, and thank you for making
> > that point.  Had I forgotten that when writing?  Or was I already
> > jumping ahead to the problem case?  I don't recall, but was certainly
> > wrong for not mentioning it.
> > 
> > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > the small page extent spanning i_size) i.e. fallocate() has promised to
> > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > is keeping those pages.
> 
> Ah, yes, you are right.
> 
> > There could well be some optimization, involving fallocend, to avoid
> > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > it's quite confusing!
> 
> Like you said, not only can a large folio split occur during swapout, but it
> can also happen during a punch hole operation. Moreover, considering the
> abnormal case of fallocate() you mentioned, we should find a more common
> approach to mitigate the impact of fallocate().
> 
> For instance, when splitting, we could clear the 'uptodate' flag for these EOF
> small folios that are beyond 'i_size' but less than the 'fallocend', so that
> these EOF small folios will be re-initialized if they are used again. What do
> you think?

First impression: that's a great idea, much better than anything I was
proposing.  Let's hope I don't perceive some drawback overnight and renege.

I don't love your patch below so much, we would probably want to gather
the shmem_mapping() peculiarities together better (and seeing that
repeated i_size_read(): IIRC 32-bit locking doesn't allow it in there).

And there tends to be an assumption (don't ask me where) that a page once
uptodate remains that way until it's freed: maybe no problem before it's
inserted in the xarray (as you have), or maybe better before unfreezing,
or maybe the page lock is already enough.

Those my initial reactions.
Hugh

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ce130225a8e5..2ccb442525d1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3546,6 +3546,18 @@ static int __split_unmapped_folio(struct folio *folio,
> int new_order,
>                         lru_add_split_folio(origin_folio, release, lruvec,
>                                         list);
> 
> +                       /*
> +                        * fallocate() will keep folios allocated beyond EOF,
> we should
> +                        * clear the uptodate flag for these folios to re-zero
> them
> +                        * if necessary.
> +                        */
> +                       if (shmem_mapping(mapping)) {
> +                               loff_t i_size = i_size_read(mapping->host);
> +
> +                               if (i_size < end && release->index >= i_size)
> +                                       folio_clear_uptodate(release);
> +                       }
> +
>                         /* Some pages can be beyond EOF: drop them from cache
> */
>                         if (release->index >= end) {
>                                 if (shmem_mapping(mapping))


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-11  7:50         ` Hugh Dickins
@ 2025-07-11  8:42           ` Baolin Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Baolin Wang @ 2025-07-11  8:42 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Brian Foster, linux-mm, Matthew Wilcox, Usama Arif



On 2025/7/11 15:50, Hugh Dickins wrote:
> On Fri, 11 Jul 2025, Baolin Wang wrote:
>> On 2025/7/11 06:20, Hugh Dickins wrote:
>>> On Thu, 10 Jul 2025, Baolin Wang wrote:
>>>> On 2025/7/9 15:57, Hugh Dickins wrote:
>>> ...
>>>>>
>>>>> The problem is with huge pages (or large folios) in shmem_writeout():
>>>>> what goes in as a large folio may there have to be split into small
>>>>> pages; or it may be swapped out as one large folio, but fragmentation
>>>>> at swapin time demand that it be split into small pages when swapped in.
>>>>
>>>> Good point.
>>>>
>>>>> So, if there has been swapout since the large folio was modified beyond
>>>>> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
>>>>> length needs to be zeroed.
>>>>>
>>>>> We could set that aside as a deficiency to be fixed later on: that
>>>>> would not be unreasonable, but I'm guessing that won't satisfy you.
>>>>>
>>>>> We could zero the maximum (the remainder of PMD size I believe) in
>>>>> shmem_zero_eof(): looping over small folios within the range, skipping
>>>>> !uptodate ones (but we do force them uptodate when swapping out, in
>>>>> order to keep the space reservation).  TBH I've ignored that as a bad
>>>>> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
>>>>
>>>> However, IIUC, if the large folios are split in shmem_writeout(), and those
>>>> small folios which beyond EOF will be dropped and freed in
>>>> __split_unmapped_folio(), should we still consider them?
>>>
>>> You're absolutely right about the normal case, and thank you for making
>>> that point.  Had I forgotten that when writing?  Or was I already
>>> jumping ahead to the problem case?  I don't recall, but was certainly
>>> wrong for not mentioning it.
>>>
>>> The abnormal case is when there's a "fallocend" beyond i_size (or beyond
>>> the small page extent spanning i_size) i.e. fallocate() has promised to
>>> keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
>>> is keeping those pages.
>>
>> Ah, yes, you are right.
>>
>>> There could well be some optimization, involving fallocend, to avoid
>>> zeroing more than necessary; but I wouldn't want to say what in a hurry,
>>> it's quite confusing!
>>
>> Like you said, not only can a large folio split occur during swapout, but it
>> can also happen during a punch hole operation. Moreover, considering the
>> abnormal case of fallocate() you mentioned, we should find a more common
>> approach to mitigate the impact of fallocate().
>>
>> For instance, when splitting, we could clear the 'uptodate' flag for these EOF
>> small folios that are beyond 'i_size' but less than the 'fallocend', so that
>> these EOF small folios will be re-initialized if they are used again. What do
>> you think?
> 
> First impression: that's a great idea, much better than anything I was
> proposing.  Let's hope I don't perceive some drawback overnight and renege.
> 
> I don't love your patch below so much, we would probably want to gather
> the shmem_mapping() peculiarities together better (and seeing that
> repeated i_size_read(): IIRC 32-bit locking doesn't allow it in there).

Absolutely yes, the following code just shows my thought:)

> And there tends to be an assumption (don't ask me where) that a page once
> uptodate remains that way until it's freed: maybe no problem before it's
> inserted in the xarray (as you have), or maybe better before unfreezing,
> or maybe the page lock is already enough.

These EOF small folios are unmapped, frozen, locked, and not yet in the 
xarray. It seems there is no other way to get them (pfn walker could, 
but the page lock or freeze will abort the pfn walker process), so it 
appears to be safe to change the 'uptodate' flag. Anyway, let me still 
do some investigation.

> Those my initial reactions.

Thanks for comments.

>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index ce130225a8e5..2ccb442525d1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3546,6 +3546,18 @@ static int __split_unmapped_folio(struct folio *folio,
>> int new_order,
>>                          lru_add_split_folio(origin_folio, release, lruvec,
>>                                          list);
>>
>> +                       /*
>> +                        * fallocate() will keep folios allocated beyond EOF,
>> we should
>> +                        * clear the uptodate flag for these folios to re-zero
>> them
>> +                        * if necessary.
>> +                        */
>> +                       if (shmem_mapping(mapping)) {
>> +                               loff_t i_size = i_size_read(mapping->host);
>> +
>> +                               if (i_size < end && release->index >= i_size)
>> +                                       folio_clear_uptodate(release);
>> +                       }
>> +
>>                          /* Some pages can be beyond EOF: drop them from cache
>> */
>>                          if (release->index >= end) {
>>                                  if (shmem_mapping(mapping))



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-11  3:50       ` Baolin Wang
  2025-07-11  7:50         ` Hugh Dickins
@ 2025-07-11 16:08         ` Brian Foster
  2025-07-11 20:15           ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2025-07-11 16:08 UTC (permalink / raw)
  To: Baolin Wang; +Cc: Hugh Dickins, linux-mm, Matthew Wilcox, Usama Arif

On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/7/11 06:20, Hugh Dickins wrote:
> > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > ...
> > > > 
> > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > what goes in as a large folio may there have to be split into small
> > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > at swapin time demand that it be split into small pages when swapped in.
> > > 
> > > Good point.
> > > 
> > > > So, if there has been swapout since the large folio was modified beyond
> > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > length needs to be zeroed.
> > > > 
> > > > We could set that aside as a deficiency to be fixed later on: that
> > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > 
> > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > 
> > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > small folios which beyond EOF will be dropped and freed in
> > > __split_unmapped_folio(), should we still consider them?
> > 
> > You're absolutely right about the normal case, and thank you for making
> > that point.  Had I forgotten that when writing?  Or was I already
> > jumping ahead to the problem case?  I don't recall, but was certainly
> > wrong for not mentioning it.
> > 
> > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > the small page extent spanning i_size) i.e. fallocate() has promised to
> > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > is keeping those pages.
> 
> Ah, yes, you are right.
> 
> > There could well be some optimization, involving fallocend, to avoid
> > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > it's quite confusing!
> 
> Like you said, not only can a large folio split occur during swapout, but it
> can also happen during a punch hole operation. Moreover, considering the
> abnormal case of fallocate() you mentioned, we should find a more common
> approach to mitigate the impact of fallocate().
> 
> For instance, when splitting, we could clear the 'uptodate' flag for these
> EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> that these EOF small folios will be re-initialized if they are used again.
> What do you think?
> 
...

Hi Baolin,

So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
a little curious here given that I intended to implement the writeout
zeroing suggestion regardless of that discussion..

I see the hole punch case falls into truncate_inode_[partial_]folio(),
which looks to me like it handles zeroing. The full truncate case just
tosses the folio of course, but the partial case zeroes according to the
target range prior to doing any potential split from that codepath.

That looks kind of similar to what I have prototyped for the
shmem_writeout() case: tail zero the EOF straddling folio before falling
into the split call. [1] Does that not solve the same general issue in
the swapout path as potentially clearing uptodate via the split? I'm
mainly trying to understand if that is just a potential alternative
approach, or if this solves a corner case that I'm missing. Hm?

If the former, I suspect we'd need to tail zero on writeout regardless
of folio size. Given that, and IIUC that clearing uptodate as such will
basically cause the split folios to fall back into the !uptodate -> zero
-> mark_uptodate sequence of shmem_writeout(), I wonder what the
advantage of that is. It feels a bit circular to me when considered
along with the tail zeroing below, but again I'm peeling away at
complexity as I go here.. ;) Thoughts?

Brian

[1] prototype writeout logic:

diff --git a/mm/shmem.c b/mm/shmem.c
index 634e499b6197..535021ae5a2f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	pgoff_t index;
+	loff_t i_size = i_size_read(inode);
+	pgoff_t index = i_size >> PAGE_SHIFT;
 	int nr_pages;
 	bool split = false;
 
@@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
 	if (!total_swap_pages)
 		goto redirty;
 
+	/*
+	 * If the folio straddles EOF, the tail portion must be zeroed on
+	 * every swapout.
+	 */
+	if (folio_test_uptodate(folio) &&
+	    folio->index <= index && folio_next_index(folio) > index) {
+		size_t from = offset_in_folio(folio, i_size);
+		if (from)
+			folio_zero_segment(folio, from, folio_size(folio));
+	}
+
 	/*
 	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
 	 * split when swapping.



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-11 16:08         ` Brian Foster
@ 2025-07-11 20:15           ` Brian Foster
  2025-07-14  3:05             ` Baolin Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2025-07-11 20:15 UTC (permalink / raw)
  To: Baolin Wang; +Cc: Hugh Dickins, linux-mm, Matthew Wilcox, Usama Arif

On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
> On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> > 
> > 
> > On 2025/7/11 06:20, Hugh Dickins wrote:
> > > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > > ...
> > > > > 
> > > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > > what goes in as a large folio may there have to be split into small
> > > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > > at swapin time demand that it be split into small pages when swapped in.
> > > > 
> > > > Good point.
> > > > 
> > > > > So, if there has been swapout since the large folio was modified beyond
> > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > > length needs to be zeroed.
> > > > > 
> > > > > We could set that aside as a deficiency to be fixed later on: that
> > > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > > 
> > > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > > 
> > > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > > small folios which beyond EOF will be dropped and freed in
> > > > __split_unmapped_folio(), should we still consider them?
> > > 
> > > You're absolutely right about the normal case, and thank you for making
> > > that point.  Had I forgotten that when writing?  Or was I already
> > > jumping ahead to the problem case?  I don't recall, but was certainly
> > > wrong for not mentioning it.
> > > 
> > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > > the small page extent spanning i_size) i.e. fallocate() has promised to
> > > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > > is keeping those pages.
> > 
> > Ah, yes, you are right.
> > 
> > > There could well be some optimization, involving fallocend, to avoid
> > > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > > it's quite confusing!
> > 
> > Like you said, not only can a large folio split occur during swapout, but it
> > can also happen during a punch hole operation. Moreover, considering the
> > abnormal case of fallocate() you mentioned, we should find a more common
> > approach to mitigate the impact of fallocate().
> > 
> > For instance, when splitting, we could clear the 'uptodate' flag for these
> > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> > that these EOF small folios will be re-initialized if they are used again.
> > What do you think?
> > 
> ...
> 
> Hi Baolin,
> 
> So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
> a little curious here given that I intended to implement the writeout
> zeroing suggestion regardless of that discussion..
> 
> I see the hole punch case falls into truncate_inode_[partial_]folio(),
> which looks to me like it handles zeroing. The full truncate case just
> tosses the folio of course, but the partial case zeroes according to the
> target range prior to doing any potential split from that codepath.
> 
> That looks kind of similar to what I have prototyped for the
> shmem_writeout() case: tail zero the EOF straddling folio before falling
> into the split call. [1] Does that not solve the same general issue in
> the swapout path as potentially clearing uptodate via the split? I'm
> mainly trying to understand if that is just a potential alternative
> approach, or if this solves a corner case that I'm missing. Hm?
> 

Ok, after playing around a bit I think I see what I was missing. I
misinterpreted that the punch case is only going to zero in the target
range of the punch. So if you have something like a 1M file backed by an
fallocated 2M folio, map write the whole 2M, then punch the last 4k of
the file, you end up with the non-zeroed smaller folios beyond EOF. This
means that even with a zero of the eof folio, a truncate up over those
folios won't be zeroed.

I need to think on it some more, but I take it this means that
essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
swapout (which I think is analogous to your earlier prototype logic) [1]
and 2. shmem_zero_eof() needs to turn into something like
shmem_zero_range().

The latter would zero a range of uptodate folios between current EOF and
the start of the extending operation, rather than just the EOF folio.
This is actually pretty consistent with traditional fs (see
xfs_file_write_zero_eof() for example) behavior. I was originally
operating under assumption that this wasn't necessary for tmpfs given
traditional pagecache post-eof behavior, but that has clearly proven
false.

Brian

[1] I'm also wondering if another option here is to just clear_uptodate
any uptodate folio that fully starts beyond EOF. I.e., if the folio
straddles EOF then partial zero as below, if the folio is beyond EOF
then clear uptodate and let the existing code further down zero it.

> If the former, I suspect we'd need to tail zero on writeout regardless
> of folio size. Given that, and IIUC that clearing uptodate as such will
> basically cause the split folios to fall back into the !uptodate -> zero
> -> mark_uptodate sequence of shmem_writeout(), I wonder what the
> advantage of that is. It feels a bit circular to me when considered
> along with the tail zeroing below, but again I'm peeling away at
> complexity as I go here.. ;) Thoughts?
> 
> Brian
> 
> [1] prototype writeout logic:
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 634e499b6197..535021ae5a2f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>  	struct inode *inode = mapping->host;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> -	pgoff_t index;
> +	loff_t i_size = i_size_read(inode);
> +	pgoff_t index = i_size >> PAGE_SHIFT;
>  	int nr_pages;
>  	bool split = false;
>  
> @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>  	if (!total_swap_pages)
>  		goto redirty;
>  
> +	/*
> +	 * If the folio straddles EOF, the tail portion must be zeroed on
> +	 * every swapout.
> +	 */
> +	if (folio_test_uptodate(folio) &&
> +	    folio->index <= index && folio_next_index(folio) > index) {
> +		size_t from = offset_in_folio(folio, i_size);
> +		if (from)
> +			folio_zero_segment(folio, from, folio_size(folio));
> +	}
> +
>  	/*
>  	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
>  	 * split when swapping.
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-11 20:15           ` Brian Foster
@ 2025-07-14  3:05             ` Baolin Wang
  2025-07-14 14:38               ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Baolin Wang @ 2025-07-14  3:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: Hugh Dickins, linux-mm, Matthew Wilcox, Usama Arif



On 2025/7/12 04:15, Brian Foster wrote:
> On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
>> On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2025/7/11 06:20, Hugh Dickins wrote:
>>>> On Thu, 10 Jul 2025, Baolin Wang wrote:
>>>>> On 2025/7/9 15:57, Hugh Dickins wrote:
>>>> ...
>>>>>>
>>>>>> The problem is with huge pages (or large folios) in shmem_writeout():
>>>>>> what goes in as a large folio may there have to be split into small
>>>>>> pages; or it may be swapped out as one large folio, but fragmentation
>>>>>> at swapin time demand that it be split into small pages when swapped in.
>>>>>
>>>>> Good point.
>>>>>
>>>>>> So, if there has been swapout since the large folio was modified beyond
>>>>>> EOF, the folio that shmem_zero_eof() brings in does not guarantee what
>>>>>> length needs to be zeroed.
>>>>>>
>>>>>> We could set that aside as a deficiency to be fixed later on: that
>>>>>> would not be unreasonable, but I'm guessing that won't satisfy you.
>>>>>>
>>>>>> We could zero the maximum (the remainder of PMD size I believe) in
>>>>>> shmem_zero_eof(): looping over small folios within the range, skipping
>>>>>> !uptodate ones (but we do force them uptodate when swapping out, in
>>>>>> order to keep the space reservation).  TBH I've ignored that as a bad
>>>>>> option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
>>>>>
>>>>> However, IIUC, if the large folios are split in shmem_writeout(), and those
>>>>> small folios which beyond EOF will be dropped and freed in
>>>>> __split_unmapped_folio(), should we still consider them?
>>>>
>>>> You're absolutely right about the normal case, and thank you for making
>>>> that point.  Had I forgotten that when writing?  Or was I already
>>>> jumping ahead to the problem case?  I don't recall, but was certainly
>>>> wrong for not mentioning it.
>>>>
>>>> The abnormal case is when there's a "fallocend" beyond i_size (or beyond
>>>> the small page extent spanning i_size) i.e. fallocate() has promised to
>>>> keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
>>>> is keeping those pages.
>>>
>>> Ah, yes, you are right.
>>>
>>>> There could well be some optimization, involving fallocend, to avoid
>>>> zeroing more than necessary; but I wouldn't want to say what in a hurry,
>>>> it's quite confusing!
>>>
>>> Like you said, not only can a large folio split occur during swapout, but it
>>> can also happen during a punch hole operation. Moreover, considering the
>>> abnormal case of fallocate() you mentioned, we should find a more common
>>> approach to mitigate the impact of fallocate().
>>>
>>> For instance, when splitting, we could clear the 'uptodate' flag for these
>>> EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
>>> that these EOF small folios will be re-initialized if they are used again.
>>> What do you think?
>>>
>> ...
>>
>> Hi Baolin,
>>
>> So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
>> a little curious here given that I intended to implement the writeout
>> zeroing suggestion regardless of that discussion..
>>
>> I see the hole punch case falls into truncate_inode_[partial_]folio(),
>> which looks to me like it handles zeroing. The full truncate case just
>> tosses the folio of course, but the partial case zeroes according to the
>> target range prior to doing any potential split from that codepath.
>>
>> That looks kind of similar to what I have prototyped for the
>> shmem_writeout() case: tail zero the EOF straddling folio before falling
>> into the split call. [1] Does that not solve the same general issue in
>> the swapout path as potentially clearing uptodate via the split? I'm
>> mainly trying to understand if that is just a potential alternative
>> approach, or if this solves a corner case that I'm missing. Hm?
>>
> 
> Ok, after playing around a bit I think I see what I was missing. I
> misinterpreted that the punch case is only going to zero in the target
> range of the punch. So if you have something like a 1M file backed by an
> fallocated 2M folio, map write the whole 2M, then punch the last 4k of
> the file, you end up with the non-zeroed smaller folios beyond EOF. This
> means that even with a zero of the eof folio, a truncate up over those
> folios won't be zeroed.

Right.

> I need to think on it some more, but I take it this means that
> essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
> swapout (which I think is analogous to your earlier prototype logic) [1]
> and 2. shmem_zero_eof() needs to turn into something like
> shmem_zero_range().

Like we discussed, only considering swapout is not enough; it's 
necessary to consider all cases of large folio splits, such as swapout, 
punch hole, migration, shmem shrinker, etc. In the future, if there are 
other cases of splits, the impact on EOF folios will also need to be 
considered (should zero them before split). IMHO, this could lead to 
complexity and uncontrollability.

So my suggestion is to address this issue during the split process, and 
it seems feasible to make EOF small folios not 'uptodate' during the 
split. Anyway, you can investigate further.

> The latter would zero a range of uptodate folios between current EOF and
> the start of the extending operation, rather than just the EOF folio.
> This is actually pretty consistent with traditional fs (see
> xfs_file_write_zero_eof() for example) behavior. I was originally
> operating under assumption that this wasn't necessary for tmpfs given
> traditional pagecache post-eof behavior, but that has clearly proven
> false.
> 
> Brian
> 
> [1] I'm also wondering if another option here is to just clear_uptodate
> any uptodate folio that fully starts beyond EOF. I.e., if the folio
> straddles EOF then partial zero as below, if the folio is beyond EOF
> then clear uptodate and let the existing code further down zero it.
> 
>> If the former, I suspect we'd need to tail zero on writeout regardless
>> of folio size. Given that, and IIUC that clearing uptodate as such will
>> basically cause the split folios to fall back into the !uptodate -> zero
>> -> mark_uptodate sequence of shmem_writeout(), I wonder what the
>> advantage of that is. It feels a bit circular to me when considered
>> along with the tail zeroing below, but again I'm peeling away at
>> complexity as I go here.. ;) Thoughts?
>>
>> Brian
>>
>> [1] prototype writeout logic:
>>
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 634e499b6197..535021ae5a2f 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>>   	struct inode *inode = mapping->host;
>>   	struct shmem_inode_info *info = SHMEM_I(inode);
>>   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>> -	pgoff_t index;
>> +	loff_t i_size = i_size_read(inode);
>> +	pgoff_t index = i_size >> PAGE_SHIFT;
>>   	int nr_pages;
>>   	bool split = false;
>>   
>> @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
>>   	if (!total_swap_pages)
>>   		goto redirty;
>>   
>> +	/*
>> +	 * If the folio straddles EOF, the tail portion must be zeroed on
>> +	 * every swapout.
>> +	 */
>> +	if (folio_test_uptodate(folio) &&
>> +	    folio->index <= index && folio_next_index(folio) > index) {
>> +		size_t from = offset_in_folio(folio, i_size);
>> +		if (from)
>> +			folio_zero_segment(folio, from, folio_size(folio));
>> +	}
>> +
>>   	/*
>>   	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
>>   	 * split when swapping.
>>
>>
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] tmpfs: zero post-eof folio range on file extension
  2025-07-14  3:05             ` Baolin Wang
@ 2025-07-14 14:38               ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2025-07-14 14:38 UTC (permalink / raw)
  To: Baolin Wang; +Cc: Hugh Dickins, linux-mm, Matthew Wilcox, Usama Arif

On Mon, Jul 14, 2025 at 11:05:35AM +0800, Baolin Wang wrote:
> 
> 
> On 2025/7/12 04:15, Brian Foster wrote:
> > On Fri, Jul 11, 2025 at 12:08:16PM -0400, Brian Foster wrote:
> > > On Fri, Jul 11, 2025 at 11:50:05AM +0800, Baolin Wang wrote:
> > > > 
> > > > 
> > > > On 2025/7/11 06:20, Hugh Dickins wrote:
> > > > > On Thu, 10 Jul 2025, Baolin Wang wrote:
> > > > > > On 2025/7/9 15:57, Hugh Dickins wrote:
> > > > > ...
> > > > > > > 
> > > > > > > The problem is with huge pages (or large folios) in shmem_writeout():
> > > > > > > what goes in as a large folio may there have to be split into small
> > > > > > > pages; or it may be swapped out as one large folio, but fragmentation
> > > > > > > at swapin time demand that it be split into small pages when swapped in.
> > > > > > 
> > > > > > Good point.
> > > > > > 
> > > > > > > So, if there has been swapout since the large folio was modified beyond
> > > > > > > EOF, the folio that shmem_zero_eof() brings in does not guarantee what
> > > > > > > length needs to be zeroed.
> > > > > > > 
> > > > > > > We could set that aside as a deficiency to be fixed later on: that
> > > > > > > would not be unreasonable, but I'm guessing that won't satisfy you.
> > > > > > > 
> > > > > > > We could zero the maximum (the remainder of PMD size I believe) in
> > > > > > > shmem_zero_eof(): looping over small folios within the range, skipping
> > > > > > > !uptodate ones (but we do force them uptodate when swapping out, in
> > > > > > > order to keep the space reservation).  TBH I've ignored that as a bad
> > > > > > > option, but it doesn't seem so bad to me now: ugly, but maybe not bad.
> > > > > > 
> > > > > > However, IIUC, if the large folios are split in shmem_writeout(), and those
> > > > > > small folios which beyond EOF will be dropped and freed in
> > > > > > __split_unmapped_folio(), should we still consider them?
> > > > > 
> > > > > You're absolutely right about the normal case, and thank you for making
> > > > > that point.  Had I forgotten that when writing?  Or was I already
> > > > > jumping ahead to the problem case?  I don't recall, but was certainly
> > > > > wrong for not mentioning it.
> > > > > 
> > > > > The abnormal case is when there's a "fallocend" beyond i_size (or beyond
> > > > > the small page extent spanning i_size) i.e. fallocate() has promised to
> > > > > keep pages allocated beyond EOF.  In that case, __split_unmapped_folio()
> > > > > is keeping those pages.
> > > > 
> > > > Ah, yes, you are right.
> > > > 
> > > > > There could well be some optimization, involving fallocend, to avoid
> > > > > zeroing more than necessary; but I wouldn't want to say what in a hurry,
> > > > > it's quite confusing!
> > > > 
> > > > Like you said, not only can a large folio split occur during swapout, but it
> > > > can also happen during a punch hole operation. Moreover, considering the
> > > > abnormal case of fallocate() you mentioned, we should find a more common
> > > > approach to mitigate the impact of fallocate().
> > > > 
> > > > For instance, when splitting, we could clear the 'uptodate' flag for these
> > > > EOF small folios that are beyond 'i_size' but less than the 'fallocend', so
> > > > that these EOF small folios will be re-initialized if they are used again.
> > > > What do you think?
> > > > 
> > > ...
> > > 
> > > Hi Baolin,
> > > 
> > > So I'm still digesting Hugh's clarification wrt the falloc case, but I'm
> > > a little curious here given that I intended to implement the writeout
> > > zeroing suggestion regardless of that discussion..
> > > 
> > > I see the hole punch case falls into truncate_inode_[partial_]folio(),
> > > which looks to me like it handles zeroing. The full truncate case just
> > > tosses the folio of course, but the partial case zeroes according to the
> > > target range prior to doing any potential split from that codepath.
> > > 
> > > That looks kind of similar to what I have prototyped for the
> > > shmem_writeout() case: tail zero the EOF straddling folio before falling
> > > into the split call. [1] Does that not solve the same general issue in
> > > the swapout path as potentially clearing uptodate via the split? I'm
> > > mainly trying to understand if that is just a potential alternative
> > > approach, or if this solves a corner case that I'm missing. Hm?
> > > 
> > 
> > Ok, after playing around a bit I think I see what I was missing. I
> > misinterpreted that the punch case is only going to zero in the target
> > range of the punch. So if you have something like a 1M file backed by an
> > fallocated 2M folio, map write the whole 2M, then punch the last 4k of
> > the file, you end up with the non-zeroed smaller folios beyond EOF. This
> > means that even with a zero of the eof folio, a truncate up over those
> > folios won't be zeroed.
> 
> Right.
> 
> > I need to think on it some more, but I take it this means that
> > essentially 1. any uptodate range/folio beyond EOF needs to be zeroed on
> > swapout (which I think is analogous to your earlier prototype logic) [1]
> > and 2. shmem_zero_eof() needs to turn into something like
> > shmem_zero_range().
> 
> Like we discussed, only considering swapout is not enough; it's necessary to
> consider all cases of large folio splits, such as swapout, punch hole,
> migration, shmem shrinker, etc. In the future, if there are other cases of
> splits, the impact on EOF folios will also need to be considered (should
> zero them before split). IMHO, this could lead to complexity and
> uncontrollability.
> 

Ok. FWIW, the purpose of the swap time zeroing in this case is not
necessarily to be a solution purely on its own. Rather (and to Hugh's
earlier point about the zeroing needing to cover a range vs. just
relying on the eof folio size), it's probably more ideal if that eof
zeroing code can assume post-eof swapped out folios are always zeroed.

But anyways, I'll try to shoot for something like that for a v2. I also
want to see if I can figure a way for a bit more thorough testing. We
can revisit from there if there are better options and/or furher gaps to
consider. Thanks again for the comments.

Brian

> So my suggestion is to address this issue during the split process, and it
> seems feasible to make EOF small folios not 'uptodate' during the split.
> Anyway, you can investigate further.
> 
> > The latter would zero a range of uptodate folios between current EOF and
> > the start of the extending operation, rather than just the EOF folio.
> > This is actually pretty consistent with traditional fs (see
> > xfs_file_write_zero_eof() for example) behavior. I was originally
> > operating under assumption that this wasn't necessary for tmpfs given
> > traditional pagecache post-eof behavior, but that has clearly proven
> > false.
> > 
> > Brian
> > 
> > [1] I'm also wondering if another option here is to just clear_uptodate
> > any uptodate folio that fully starts beyond EOF. I.e., if the folio
> > straddles EOF then partial zero as below, if the folio is beyond EOF
> > then clear uptodate and let the existing code further down zero it.
> > 
> > > If the former, I suspect we'd need to tail zero on writeout regardless
> > > of folio size. Given that, and IIUC that clearing uptodate as such will
> > > basically cause the split folios to fall back into the !uptodate -> zero
> > > -> mark_uptodate sequence of shmem_writeout(), I wonder what the
> > > advantage of that is. It feels a bit circular to me when considered
> > > along with the tail zeroing below, but again I'm peeling away at
> > > complexity as I go here.. ;) Thoughts?
> > > 
> > > Brian
> > > 
> > > [1] prototype writeout logic:
> > > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 634e499b6197..535021ae5a2f 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1579,7 +1579,8 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	struct inode *inode = mapping->host;
> > >   	struct shmem_inode_info *info = SHMEM_I(inode);
> > >   	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> > > -	pgoff_t index;
> > > +	loff_t i_size = i_size_read(inode);
> > > +	pgoff_t index = i_size >> PAGE_SHIFT;
> > >   	int nr_pages;
> > >   	bool split = false;
> > > @@ -1592,6 +1593,17 @@ int shmem_writeout(struct folio *folio, struct writeback_control *wbc)
> > >   	if (!total_swap_pages)
> > >   		goto redirty;
> > > +	/*
> > > +	 * If the folio straddles EOF, the tail portion must be zeroed on
> > > +	 * every swapout.
> > > +	 */
> > > +	if (folio_test_uptodate(folio) &&
> > > +	    folio->index <= index && folio_next_index(folio) > index) {
> > > +		size_t from = offset_in_folio(folio, i_size);
> > > +		if (from)
> > > +			folio_zero_segment(folio, from, folio_size(folio));
> > > +	}
> > > +
> > >   	/*
> > >   	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
> > >   	 * split when swapping.
> > > 
> > > 
> > 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-07-14 14:35 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 18:49 [PATCH] tmpfs: zero post-eof folio range on file extension Brian Foster
2025-06-25 19:21 ` Matthew Wilcox
2025-06-26  5:35   ` Hugh Dickins
2025-06-26 12:55     ` Brian Foster
2025-06-27  3:21       ` Baolin Wang
2025-06-27 11:54         ` Brian Foster
2025-07-09  7:57 ` Hugh Dickins
2025-07-10  6:47   ` Baolin Wang
2025-07-10 22:20     ` Hugh Dickins
2025-07-11  3:50       ` Baolin Wang
2025-07-11  7:50         ` Hugh Dickins
2025-07-11  8:42           ` Baolin Wang
2025-07-11 16:08         ` Brian Foster
2025-07-11 20:15           ` Brian Foster
2025-07-14  3:05             ` Baolin Wang
2025-07-14 14:38               ` Brian Foster
2025-07-10 12:36   ` Brian Foster
2025-07-10 23:02     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).