Regression in generic/749 with 8k fsblock size on 6.18-rc1

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Regression in generic/749 with 8k fsblock size on 6.18-rc1
@ 2025-10-14 17:52 Darrick J. Wong
  2025-10-15  7:39 ` Kirill A. Shutemov
  2025-10-15 15:59 ` Kiryl Shutsemau
  0 siblings, 2 replies; 12+ messages in thread
From: Darrick J. Wong @ 2025-10-14 17:52 UTC (permalink / raw)
  To: kirill, akpm; +Cc: linux-mm, linux-fsdevel, xfs, Matthew Wilcox

Hi there,

On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
with the following:

--- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
+++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
@@ -1,2 +1,10 @@
 QA output created by 749
+Expected SIGBUS when mmap() reading beyond page boundary
+Expected SIGBUS when mmap() writing beyond page boundary
+Expected SIGBUS when mmap() reading beyond page boundary
+Expected SIGBUS when mmap() writing beyond page boundary
+Expected SIGBUS when mmap() reading beyond page boundary
+Expected SIGBUS when mmap() writing beyond page boundary
+Expected SIGBUS when mmap() reading beyond page boundary
+Expected SIGBUS when mmap() writing beyond page boundary
 Silence is golden

This test creates small files of various sizes, maps the EOF block, and
checks that you can read and write to the mmap'd page up to (but not
beyond) the next page boundary.

For 8k fsblock filesystems on x86, the pagecache creates a single 8k
folio to cache the entire fsblock containing EOF.  If EOF is in the
first 4096 bytes of that 8k fsblock, then it should be possible to do a
mmap read/write of the first 4k, but not the second 4k.  Memory accesses
to the second 4096 bytes should produce a SIGBUS.

I think the changes introduced in the two patches:

 * mm/fault: Try to map the entire file folio in finish_fault()
 * mm/filemap: Map entire large folio faultaround

break that SIGBUS behavior by mapping the entire 8k folio into the
process.

Reverting thes two patches on an 6.18-rc1 kernel makes the regression go
away, but only by clumsily reverting to the 6.17 behavior where the
pagecache touched each base page of a large folio instead of doing
something to the whole folio at once.  I don't know what would be a good
solution, since you only need to do page-at-a-time for the EOF page, but
there's not really a good way to coordinate with i_size updates.

Did your testing also demonstrate this regression?

--D

[1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/749?h=for-next

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-14 17:52 Regression in generic/749 with 8k fsblock size on 6.18-rc1 Darrick J. Wong
@ 2025-10-15  7:39 ` Kirill A. Shutemov
  2025-10-15 17:45   ` Darrick J. Wong
  2025-10-15 15:59 ` Kiryl Shutsemau
  1 sibling, 1 reply; 12+ messages in thread
From: Kirill A. Shutemov @ 2025-10-15  7:39 UTC (permalink / raw)
  To: Darrick J. Wong, akpm; +Cc: linux-mm, linux-fsdevel, xfs, Matthew Wilcox

On Tue, Oct 14, 2025, at 18:52, Darrick J. Wong wrote:
> Did your testing also demonstrate this regression?

I have not reproduced the issue yet.

Could you check if this patch makes a difference:

https://gist.github.com/kiryl/a2c71057bec332240216cc425aca791a

-- 
Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-14 17:52 Regression in generic/749 with 8k fsblock size on 6.18-rc1 Darrick J. Wong
  2025-10-15  7:39 ` Kirill A. Shutemov
@ 2025-10-15 15:59 ` Kiryl Shutsemau
  2025-10-15 17:57   ` Darrick J. Wong
  1 sibling, 1 reply; 12+ messages in thread
From: Kiryl Shutsemau @ 2025-10-15 15:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: akpm, linux-mm, linux-fsdevel, xfs, Matthew Wilcox

On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> Hi there,
> 
> On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> with the following:
> 
> --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> @@ -1,2 +1,10 @@
>  QA output created by 749
> +Expected SIGBUS when mmap() reading beyond page boundary
> +Expected SIGBUS when mmap() writing beyond page boundary
> +Expected SIGBUS when mmap() reading beyond page boundary
> +Expected SIGBUS when mmap() writing beyond page boundary
> +Expected SIGBUS when mmap() reading beyond page boundary
> +Expected SIGBUS when mmap() writing beyond page boundary
> +Expected SIGBUS when mmap() reading beyond page boundary
> +Expected SIGBUS when mmap() writing beyond page boundary
>  Silence is golden
> 
> This test creates small files of various sizes, maps the EOF block, and
> checks that you can read and write to the mmap'd page up to (but not
> beyond) the next page boundary.
> 
> For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> folio to cache the entire fsblock containing EOF.  If EOF is in the
> first 4096 bytes of that 8k fsblock, then it should be possible to do a
> mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> to the second 4096 bytes should produce a SIGBUS.

Does anybody actually relies on this behaviour (beyond xfstests)?

I think this behaviour existed before the recent changes, but it was
less prominent.

Like, tmpfs with huge=always would fault-in PMD if there's order-9 folio
in page cache regardless of i_size.

See filemap_map_pages->filemap_map_pmd() path.

I believe the same happens for large folios in other filesystems.

Some of this behaviour is hidden by truncate path trying to split large
folios, split PMD and unmap a range of PTEs. But split can fail, so we
cannot rely on this for correctness.

I would like to understand more about expectations in real workload
before commit to a fix.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-15  7:39 ` Kirill A. Shutemov
@ 2025-10-15 17:45   ` Darrick J. Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2025-10-15 17:45 UTC (permalink / raw)
  To: Kirill A. Shutemov; +Cc: akpm, linux-mm, linux-fsdevel, xfs, Matthew Wilcox

On Wed, Oct 15, 2025 at 08:39:53AM +0100, Kirill A. Shutemov wrote:
> On Tue, Oct 14, 2025, at 18:52, Darrick J. Wong wrote:
> > Did your testing also demonstrate this regression?
> 
> I have not reproduced the issue yet.
> 
> Could you check if this patch makes a difference:
> 
> https://gist.github.com/kiryl/a2c71057bec332240216cc425aca791a

Yes, it does make the test failure go away:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 alder-mtr00 6.18.0-rc1-xfsx #rc1 SMP PREEMPT_DYNAMIC Wed Oct 15 10:34:11 PDT 2025
MKFS_OPTIONS  -- -f -b size=8192, /dev/sdf
MOUNT_OPTIONS -- -o uquota,gquota,pquota, /dev/sdf /opt

generic/749        9s
Ran: generic/749
Passed all 1 tests

Is it valid to i_size_read() in the two places you add them?  I /think/
the folio is locked in the filemap.c hunk.  I'm not as sure about the
finish_fault changes.  If the EOF folio's locked then I think it's the
case that anything trying to change the file size will block until the
folio lock drops.

<shrug> Thanks for your help, in any case :)

--D

> -- 
> Kiryl Shutsemau / Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-15 15:59 ` Kiryl Shutsemau
@ 2025-10-15 17:57   ` Darrick J. Wong
  2025-10-16 10:22     ` Kiryl Shutsemau
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2025-10-15 17:57 UTC (permalink / raw)
  To: Kiryl Shutsemau; +Cc: akpm, linux-mm, linux-fsdevel, xfs, Matthew Wilcox

On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > Hi there,
> > 
> > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > with the following:
> > 
> > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > @@ -1,2 +1,10 @@
> >  QA output created by 749
> > +Expected SIGBUS when mmap() reading beyond page boundary
> > +Expected SIGBUS when mmap() writing beyond page boundary
> > +Expected SIGBUS when mmap() reading beyond page boundary
> > +Expected SIGBUS when mmap() writing beyond page boundary
> > +Expected SIGBUS when mmap() reading beyond page boundary
> > +Expected SIGBUS when mmap() writing beyond page boundary
> > +Expected SIGBUS when mmap() reading beyond page boundary
> > +Expected SIGBUS when mmap() writing beyond page boundary
> >  Silence is golden
> > 
> > This test creates small files of various sizes, maps the EOF block, and
> > checks that you can read and write to the mmap'd page up to (but not
> > beyond) the next page boundary.
> > 
> > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > to the second 4096 bytes should produce a SIGBUS.
> 
> Does anybody actually relies on this behaviour (beyond xfstests)?

Beats me, but the mmap manpage says:

       SIGBUS Attempted access to a page of the buffer that  lies  be‐
              yond  the end of the mapped file.  For an explanation of
              the treatment of the bytes in the page that  corresponds
              to  the  end  of a mapped file that is not a multiple of
              the page size, see NOTES.

POSIX 2024 says:

The system shall always zero-fill any partial page at the end of an
object. Further, the system shall never write out any modified portions
of the last page of an object which are beyond its end. References
within the address range starting at pa and continuing for len bytes to
whole pages following the end of an object shall result in delivery of a
SIGBUS signal.

https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/mmap.html#tag_17_345

From both I would surmise that it's a reasonable expectation that you
can't map basepages beyond EOF and have page faults on those pages
succeed.

> I think this behaviour existed before the recent changes, but it was
> less prominent.
> 
> Like, tmpfs with huge=always would fault-in PMD if there's order-9 folio
> in page cache regardless of i_size.
> 
> See filemap_map_pages->filemap_map_pmd() path.
> 
> I believe the same happens for large folios in other filesystems.

<shrug> The kernel SIGBUS'd as expected in 6.17.  For the 8k fsblock
case there indeed was a large folio caching the EOF, but then we were
also installing 4k PTE mappings.

(I'm not sure what happens if you actually have a PMD-sized page since
those are a little hard to force.)

> Some of this behaviour is hidden by truncate path trying to split large
> folios, split PMD and unmap a range of PTEs. But split can fail, so we
> cannot rely on this for correctness.
> 
> I would like to understand more about expectations in real workload
> before commit to a fix.

Yeah, I dislike the incongruities between byte-stream files vs mmapping
pages.  All the post-EOF zeroing logic is constantly getting broken in
subtle weird ways.

willy? :D

--D

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-15 17:57   ` Darrick J. Wong
@ 2025-10-16 10:22     ` Kiryl Shutsemau
  2025-10-16 22:33       ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Kiryl Shutsemau @ 2025-10-16 10:22 UTC (permalink / raw)
  To: Darrick J. Wong, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang
  Cc: akpm, linux-mm, linux-fsdevel, xfs

On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > Hi there,
> > > 
> > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > with the following:
> > > 
> > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > @@ -1,2 +1,10 @@
> > >  QA output created by 749
> > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > +Expected SIGBUS when mmap() writing beyond page boundary
> > >  Silence is golden
> > > 
> > > This test creates small files of various sizes, maps the EOF block, and
> > > checks that you can read and write to the mmap'd page up to (but not
> > > beyond) the next page boundary.
> > > 
> > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > to the second 4096 bytes should produce a SIGBUS.
> > 
> > Does anybody actually relies on this behaviour (beyond xfstests)?
> 
> Beats me, but the mmap manpage says:
...
> POSIX 2024 says:
...
> From both I would surmise that it's a reasonable expectation that you
> can't map basepages beyond EOF and have page faults on those pages
> succeed.

<Added folks form the commit that introduced generic/749>

Modern kernel with large folios blurs the line of what is the page.

I don't want play spec lawyer. Let's look at real workloads.

If there's anything that actually relies on this SIGBUS corner case,
let's see how we can fix the kernel. But it will cost some CPU cycles.

If it only broke syntactic test case, I'm inclined to say WONTFIX.

Any opinions?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-16 10:22     ` Kiryl Shutsemau
@ 2025-10-16 22:33       ` Dave Chinner
  2025-10-17 14:28         ` Kiryl Shutsemau
  2025-10-21 17:02         ` Luis Chamberlain
  0 siblings, 2 replies; 12+ messages in thread
From: Dave Chinner @ 2025-10-16 22:33 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Darrick J. Wong, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Thu, Oct 16, 2025 at 11:22:00AM +0100, Kiryl Shutsemau wrote:
> On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> > On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > > Hi there,
> > > > 
> > > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > > with the following:
> > > > 
> > > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > > @@ -1,2 +1,10 @@
> > > >  QA output created by 749
> > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > >  Silence is golden
> > > > 
> > > > This test creates small files of various sizes, maps the EOF block, and
> > > > checks that you can read and write to the mmap'd page up to (but not
> > > > beyond) the next page boundary.
> > > > 
> > > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > > to the second 4096 bytes should produce a SIGBUS.
> > > 
> > > Does anybody actually relies on this behaviour (beyond xfstests)?
> > 
> > Beats me, but the mmap manpage says:
> ...
> > POSIX 2024 says:
> ...
> > From both I would surmise that it's a reasonable expectation that you
> > can't map basepages beyond EOF and have page faults on those pages
> > succeed.
> 
> <Added folks form the commit that introduced generic/749>
> 
> Modern kernel with large folios blurs the line of what is the page.
> 
> I don't want play spec lawyer. Let's look at real workloads.

Or, more importantly, consider the security-related implications of
the change....

> If there's anything that actually relies on this SIGBUS corner case,
> let's see how we can fix the kernel. But it will cost some CPU cycles.
> 
> If it only broke syntactic test case, I'm inclined to say WONTFIX.
> 
> Any opinions?

Mapping beyond EOF ranges into userspace address spaces is a
potential security risk. If there is ever a zeroing-beyond-EOF bug
related to large folios (history tells us we are *guaranteed* to
screw this up somewhere in future), then allowing mapping all the
way to the end of the large folio could expose a -lot more- stale
kernel data to userspace than just what the tail of a PAGE_SIZE
faulted region would expose.

Hence allowing applications to successfully fault a (unpredictable)
distance far beyond EOF because the page cache used a large folio
spanning EOF seems, to me, to be a very undesirable behaviour to
expose to userspace.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-16 22:33       ` Dave Chinner
@ 2025-10-17 14:28         ` Kiryl Shutsemau
  2025-10-17 16:02           ` Darrick J. Wong
  2025-10-17 17:14           ` Matthew Wilcox
  2025-10-21 17:02         ` Luis Chamberlain
  1 sibling, 2 replies; 12+ messages in thread
From: Kiryl Shutsemau @ 2025-10-17 14:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Fri, Oct 17, 2025 at 09:33:15AM +1100, Dave Chinner wrote:
> On Thu, Oct 16, 2025 at 11:22:00AM +0100, Kiryl Shutsemau wrote:
> > On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > > > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > > > Hi there,
> > > > > 
> > > > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > > > with the following:
> > > > > 
> > > > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > > > @@ -1,2 +1,10 @@
> > > > >  QA output created by 749
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > >  Silence is golden
> > > > > 
> > > > > This test creates small files of various sizes, maps the EOF block, and
> > > > > checks that you can read and write to the mmap'd page up to (but not
> > > > > beyond) the next page boundary.
> > > > > 
> > > > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > > > to the second 4096 bytes should produce a SIGBUS.
> > > > 
> > > > Does anybody actually relies on this behaviour (beyond xfstests)?
> > > 
> > > Beats me, but the mmap manpage says:
> > ...
> > > POSIX 2024 says:
> > ...
> > > From both I would surmise that it's a reasonable expectation that you
> > > can't map basepages beyond EOF and have page faults on those pages
> > > succeed.
> > 
> > <Added folks form the commit that introduced generic/749>
> > 
> > Modern kernel with large folios blurs the line of what is the page.
> > 
> > I don't want play spec lawyer. Let's look at real workloads.
> 
> Or, more importantly, consider the security-related implications of
> the change....
> 
> > If there's anything that actually relies on this SIGBUS corner case,
> > let's see how we can fix the kernel. But it will cost some CPU cycles.
> > 
> > If it only broke syntactic test case, I'm inclined to say WONTFIX.
> > 
> > Any opinions?
> 
> Mapping beyond EOF ranges into userspace address spaces is a
> potential security risk. If there is ever a zeroing-beyond-EOF bug
> related to large folios (history tells us we are *guaranteed* to
> screw this up somewhere in future), then allowing mapping all the
> way to the end of the large folio could expose a -lot more- stale
> kernel data to userspace than just what the tail of a PAGE_SIZE
> faulted region would expose.

Could you point me to the details on a zeroing-beyond-EOF bug?
I don't have context here.

But if it is, as you saying, *guaranteed* to happen again, maybe we
should slap __GFP_ZERO on page cache allocations? It will address the
problem at the root.

Although, I think you are being dramatic about "*guaranteed*"...

If we solved problem of zeroing upto PAGE_SIZE border, I don't see
why zeroing upto folio_size() border any conceptually different.
Might require some bug squeezing, sure.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-17 14:28         ` Kiryl Shutsemau
@ 2025-10-17 16:02           ` Darrick J. Wong
  2025-10-17 17:00             ` Kiryl Shutsemau
  2025-10-17 17:14           ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2025-10-17 16:02 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Chinner, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Fri, Oct 17, 2025 at 03:28:32PM +0100, Kiryl Shutsemau wrote:
> On Fri, Oct 17, 2025 at 09:33:15AM +1100, Dave Chinner wrote:
> > On Thu, Oct 16, 2025 at 11:22:00AM +0100, Kiryl Shutsemau wrote:
> > > On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > > > > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > > > > Hi there,
> > > > > > 
> > > > > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > > > > with the following:
> > > > > > 
> > > > > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > > > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > > > > @@ -1,2 +1,10 @@
> > > > > >  QA output created by 749
> > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > >  Silence is golden
> > > > > > 
> > > > > > This test creates small files of various sizes, maps the EOF block, and
> > > > > > checks that you can read and write to the mmap'd page up to (but not
> > > > > > beyond) the next page boundary.
> > > > > > 
> > > > > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > > > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > > > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > > > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > > > > to the second 4096 bytes should produce a SIGBUS.
> > > > > 
> > > > > Does anybody actually relies on this behaviour (beyond xfstests)?
> > > > 
> > > > Beats me, but the mmap manpage says:
> > > ...
> > > > POSIX 2024 says:
> > > ...
> > > > From both I would surmise that it's a reasonable expectation that you
> > > > can't map basepages beyond EOF and have page faults on those pages
> > > > succeed.
> > > 
> > > <Added folks form the commit that introduced generic/749>
> > > 
> > > Modern kernel with large folios blurs the line of what is the page.
> > > 
> > > I don't want play spec lawyer. Let's look at real workloads.
> > 
> > Or, more importantly, consider the security-related implications of
> > the change....
> > 
> > > If there's anything that actually relies on this SIGBUS corner case,
> > > let's see how we can fix the kernel. But it will cost some CPU cycles.
> > > 
> > > If it only broke syntactic test case, I'm inclined to say WONTFIX.
> > > 
> > > Any opinions?
> > 
> > Mapping beyond EOF ranges into userspace address spaces is a
> > potential security risk. If there is ever a zeroing-beyond-EOF bug
> > related to large folios (history tells us we are *guaranteed* to
> > screw this up somewhere in future), then allowing mapping all the
> > way to the end of the large folio could expose a -lot more- stale
> > kernel data to userspace than just what the tail of a PAGE_SIZE
> > faulted region would expose.
> 
> Could you point me to the details on a zeroing-beyond-EOF bug?
> I don't have context here.

Create a file whose size is neither aligned to PAGE_SIZE nor the fs
block size.  The pagecache only maps full folios, so the last folio in
the pagecache will have EOF in the middle of it.

So what do you put in the folio beyond EOF?  Most Linux filesystems
write zeroes to the post-EOF bytes at some point before writing the
block out to disk so that we don't persist random stale kernel memory.

Now you want to mmap that EOF folio into a userspace process.  It was
stupid to allow that because the contents of the folio beyond EOF are
undefined.  But we're stuck with this stupid API.

So now we need to zero the post-EOF folio contents before taking the
first fault on the mmap region, because we don't want the userspace
program to be able to load random stale kernel memory.

We also don't want programs to be able to store information in the mmap
region beyond EOF to prevent abuse, so writeback has to zero the post
EOF contents before writing the pagecache to disk.

> But if it is, as you saying, *guaranteed* to happen again, maybe we
> should slap __GFP_ZERO on page cache allocations? It will address the
> problem at the root.

Weren't you complaining upthread about spending CPU cycles?  GFP_ZERO
on every page loaded into the pagecache isn't free either.

> Although, I think you are being dramatic about "*guaranteed*"...

He's not, post-EOF folio zeroing has broken in weird subtle ways every
1-2 years for the nearly 20 years I've worked in filesystems.

> If we solved problem of zeroing upto PAGE_SIZE border, I don't see
> why zeroing upto folio_size() border any conceptually different.
> Might require some bug squeezing, sure.

We already do that, but that's not the issue here.

The issue here is that you are *breaking* XFS behavior that is
documented in the mmap manpage.  This worked as documented in 6.17, and
now it doesn't work.

--D

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-17 16:02           ` Darrick J. Wong
@ 2025-10-17 17:00             ` Kiryl Shutsemau
  0 siblings, 0 replies; 12+ messages in thread
From: Kiryl Shutsemau @ 2025-10-17 17:00 UTC (permalink / raw)
  To: Darrick J. Wong, Linus Torvalds
  Cc: Dave Chinner, Matthew Wilcox, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Fri, Oct 17, 2025 at 09:02:41AM -0700, Darrick J. Wong wrote:
> On Fri, Oct 17, 2025 at 03:28:32PM +0100, Kiryl Shutsemau wrote:
> > On Fri, Oct 17, 2025 at 09:33:15AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 16, 2025 at 11:22:00AM +0100, Kiryl Shutsemau wrote:
> > > > On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > > > > > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > > > > > Hi there,
> > > > > > > 
> > > > > > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > > > > > with the following:
> > > > > > > 
> > > > > > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > > > > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > > > > > @@ -1,2 +1,10 @@
> > > > > > >  QA output created by 749
> > > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > > >  Silence is golden
> > > > > > > 
> > > > > > > This test creates small files of various sizes, maps the EOF block, and
> > > > > > > checks that you can read and write to the mmap'd page up to (but not
> > > > > > > beyond) the next page boundary.
> > > > > > > 
> > > > > > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > > > > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > > > > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > > > > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > > > > > to the second 4096 bytes should produce a SIGBUS.
> > > > > > 
> > > > > > Does anybody actually relies on this behaviour (beyond xfstests)?
> > > > > 
> > > > > Beats me, but the mmap manpage says:
> > > > ...
> > > > > POSIX 2024 says:
> > > > ...
> > > > > From both I would surmise that it's a reasonable expectation that you
> > > > > can't map basepages beyond EOF and have page faults on those pages
> > > > > succeed.
> > > > 
> > > > <Added folks form the commit that introduced generic/749>
> > > > 
> > > > Modern kernel with large folios blurs the line of what is the page.
> > > > 
> > > > I don't want play spec lawyer. Let's look at real workloads.
> > > 
> > > Or, more importantly, consider the security-related implications of
> > > the change....
> > > 
> > > > If there's anything that actually relies on this SIGBUS corner case,
> > > > let's see how we can fix the kernel. But it will cost some CPU cycles.
> > > > 
> > > > If it only broke syntactic test case, I'm inclined to say WONTFIX.
> > > > 
> > > > Any opinions?
> > > 
> > > Mapping beyond EOF ranges into userspace address spaces is a
> > > potential security risk. If there is ever a zeroing-beyond-EOF bug
> > > related to large folios (history tells us we are *guaranteed* to
> > > screw this up somewhere in future), then allowing mapping all the
> > > way to the end of the large folio could expose a -lot more- stale
> > > kernel data to userspace than just what the tail of a PAGE_SIZE
> > > faulted region would expose.
> > 
> > Could you point me to the details on a zeroing-beyond-EOF bug?
> > I don't have context here.
> 
> Create a file whose size is neither aligned to PAGE_SIZE nor the fs
> block size.  The pagecache only maps full folios, so the last folio in
> the pagecache will have EOF in the middle of it.
> 
> So what do you put in the folio beyond EOF?  Most Linux filesystems
> write zeroes to the post-EOF bytes at some point before writing the
> block out to disk so that we don't persist random stale kernel memory.
> 
> Now you want to mmap that EOF folio into a userspace process.  It was
> stupid to allow that because the contents of the folio beyond EOF are
> undefined.  But we're stuck with this stupid API.
> 
> So now we need to zero the post-EOF folio contents before taking the
> first fault on the mmap region, because we don't want the userspace
> program to be able to load random stale kernel memory.
> 
> We also don't want programs to be able to store information in the mmap
> region beyond EOF to prevent abuse, so writeback has to zero the post
> EOF contents before writing the pagecache to disk.
>
> > But if it is, as you saying, *guaranteed* to happen again, maybe we
> > should slap __GFP_ZERO on page cache allocations? It will address the
> > problem at the root.
> 
> Weren't you complaining upthread about spending CPU cycles?  GFP_ZERO
> on every page loaded into the pagecache isn't free either.

+Linus.

True. __GFP_ZERO is stupid solution.

I think the folio has to be fully populated on read up from backing
storage. Before it is marked uptodate. If it crosses i_size, the tail
has to be zeroed. No additional overhead for folios fully with i_size.

But if you insist that is inevitably going to be broken, __GFP_ZERO
would solve problem with data leaking at the root.


Whether to zero the memory again on writeback is less critical in my
view. It could only have whatever legitimate user wrote there and is not
a data leak. Or am I wrong?

> > Although, I think you are being dramatic about "*guaranteed*"...
> 
> He's not, post-EOF folio zeroing has broken in weird subtle ways every
> 1-2 years for the nearly 20 years I've worked in filesystems.
> 
> > If we solved problem of zeroing upto PAGE_SIZE border, I don't see
> > why zeroing upto folio_size() border any conceptually different.
> > Might require some bug squeezing, sure.
> 
> We already do that, but that's not the issue here.
> 
> The issue here is that you are *breaking* XFS behavior that is
> documented in the mmap manpage.  This worked as documented in 6.17, and
> now it doesn't work.

As I described, it was broken, but in a less obvious way. Order-9 folios
are mapped as PMD regardless of i_size before my recent changes. They
*usually* get split on truncate, but it is not guaranteed because split
can fail.

We can "fix" this too by giving up mapping folios as PMD (or coalesced
PTEs) if they cross i_size boundary.

I think it is bad trade off. It will require more work in page fault and
reduce TLB hit rate.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-17 14:28         ` Kiryl Shutsemau
  2025-10-17 16:02           ` Darrick J. Wong
@ 2025-10-17 17:14           ` Matthew Wilcox
  1 sibling, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2025-10-17 17:14 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Dave Chinner, Darrick J. Wong, Luis Chamberlain, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Fri, Oct 17, 2025 at 03:28:32PM +0100, Kiryl Shutsemau wrote:
> If we solved problem of zeroing upto PAGE_SIZE border, I don't see
> why zeroing upto folio_size() border any conceptually different.
> Might require some bug squeezing, sure.

I'm travelling right now and don't want to dig my way through the POSIX
spec to lawyer about this.  Last time I looked at this problem, I came
away convinced that it was a POSIX requirement that page faults beyond
the page which contains EOF must signal.

Even if not, it's a QoI issue and we've invested significant effort
keeping this guarantee.  Please just fix the bug.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Regression in generic/749 with 8k fsblock size on 6.18-rc1
  2025-10-16 22:33       ` Dave Chinner
  2025-10-17 14:28         ` Kiryl Shutsemau
@ 2025-10-21 17:02         ` Luis Chamberlain
  1 sibling, 0 replies; 12+ messages in thread
From: Luis Chamberlain @ 2025-10-21 17:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kiryl Shutsemau, Darrick J. Wong, Matthew Wilcox, Pankaj Raghav,
	Zorro Lang, akpm, linux-mm, linux-fsdevel, xfs

On Fri, Oct 17, 2025 at 09:33:15AM +1100, Dave Chinner wrote:
> On Thu, Oct 16, 2025 at 11:22:00AM +0100, Kiryl Shutsemau wrote:
> > On Wed, Oct 15, 2025 at 10:57:26AM -0700, Darrick J. Wong wrote:
> > > On Wed, Oct 15, 2025 at 04:59:03PM +0100, Kiryl Shutsemau wrote:
> > > > On Tue, Oct 14, 2025 at 10:52:14AM -0700, Darrick J. Wong wrote:
> > > > > Hi there,
> > > > > 
> > > > > On 6.18-rc1, generic/749[1] running on XFS with an 8k fsblock size fails
> > > > > with the following:
> > > > > 
> > > > > --- /run/fstests/bin/tests/generic/749.out	2025-07-15 14:45:15.170416031 -0700
> > > > > +++ /var/tmp/fstests/generic/749.out.bad	2025-10-13 17:48:53.079872054 -0700
> > > > > @@ -1,2 +1,10 @@
> > > > >  QA output created by 749
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > > +Expected SIGBUS when mmap() reading beyond page boundary
> > > > > +Expected SIGBUS when mmap() writing beyond page boundary
> > > > >  Silence is golden
> > > > > 
> > > > > This test creates small files of various sizes, maps the EOF block, and
> > > > > checks that you can read and write to the mmap'd page up to (but not
> > > > > beyond) the next page boundary.
> > > > > 
> > > > > For 8k fsblock filesystems on x86, the pagecache creates a single 8k
> > > > > folio to cache the entire fsblock containing EOF.  If EOF is in the
> > > > > first 4096 bytes of that 8k fsblock, then it should be possible to do a
> > > > > mmap read/write of the first 4k, but not the second 4k.  Memory accesses
> > > > > to the second 4096 bytes should produce a SIGBUS.
> > > > 
> > > > Does anybody actually relies on this behaviour (beyond xfstests)?
> > > 
> > > Beats me, but the mmap manpage says:
> > ...
> > > POSIX 2024 says:
> > ...
> > > From both I would surmise that it's a reasonable expectation that you
> > > can't map basepages beyond EOF and have page faults on those pages
> > > succeed.
> > 
> > <Added folks form the commit that introduced generic/749>
> > 
> > Modern kernel with large folios blurs the line of what is the page.
> > 
> > I don't want play spec lawyer. Let's look at real workloads.
> 
> Or, more importantly, consider the security-related implications of
> the change....
> 
> > If there's anything that actually relies on this SIGBUS corner case,
> > let's see how we can fix the kernel. But it will cost some CPU cycles.
> > 
> > If it only broke syntactic test case, I'm inclined to say WONTFIX.
> > 
> > Any opinions?
> 
> Mapping beyond EOF ranges into userspace address spaces is a
> potential security risk. If there is ever a zeroing-beyond-EOF bug
> related to large folios (history tells us we are *guaranteed* to
> screw this up somewhere in future), then allowing mapping all the
> way to the end of the large folio could expose a -lot more- stale
> kernel data to userspace than just what the tail of a PAGE_SIZE
> faulted region would expose.
> 
> Hence allowing applications to successfully fault a (unpredictable)
> distance far beyond EOF because the page cache used a large folio
> spanning EOF seems, to me, to be a very undesirable behaviour to
> expose to userspace.

I think in retrospect, having been involved in carefully crafting
this test, this was certainly an overlooked and clearly valuable use
case for the test which should be documented as otherwise others may
stumble upon it and easily fight it.

So extending the test docs to cover this concern is valuable.

  Luis

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-10-21 17:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-14 17:52 Regression in generic/749 with 8k fsblock size on 6.18-rc1 Darrick J. Wong
2025-10-15  7:39 ` Kirill A. Shutemov
2025-10-15 17:45   ` Darrick J. Wong
2025-10-15 15:59 ` Kiryl Shutsemau
2025-10-15 17:57   ` Darrick J. Wong
2025-10-16 10:22     ` Kiryl Shutsemau
2025-10-16 22:33       ` Dave Chinner
2025-10-17 14:28         ` Kiryl Shutsemau
2025-10-17 16:02           ` Darrick J. Wong
2025-10-17 17:00             ` Kiryl Shutsemau
2025-10-17 17:14           ` Matthew Wilcox
2025-10-21 17:02         ` Luis Chamberlain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).