[RFC, PATCH 0/2] Large folios vs. SIGBUS semantics

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
@ 2025-10-20 16:30 Kiryl Shutsemau
  2025-10-20 16:30 ` [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-20 16:30 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

I do NOT want the patches in this patchset to be applied. Instead, I
would like to discuss the semantics of large folios versus SIGBUS.

## Background

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

This definition is simple if all pages are PAGE_SIZE in size, but with
large folios in the picture, it is no longer the case.

## Problem

Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
failed due to missing SIGBUS. This was caused by my recent changes that
try to fault in the whole folio where possible:

	19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
	357b92761d94 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016. The kernel happily maps
PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving
a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
could be useful for the workload.

Generic/749 was introduced last year with reference to POSIX, but no
real workloads were mentioned. It also acknowledged the tmpfs deviation
from the test case.

POSIX indeed says[3]:

	References within the address range starting at pa and
	continuing for len bytes to whole pages following the end of an
	object shall result in delivery of a SIGBUS signal.

Do we care about adhering strictly to this in absence of real workloads
that relies on this semantics?

I think it valuable to allow kernel to map memory with a larger chunks
-- whole folio -- to get TLB benefits (from both huge pages and TLB
coalescing). I value TLB hit rate over POSIX wording.

Any opinions?

See also discussion in the thread[1] with the report.

[1] https://lore.kernel.org/all/20251014175214.GW6188@frogsfrogsfrogs
[2] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/tests/generic/749?h=for-next&id=e4a6b119e5229599eac96235fb7e683b8a8bdc53
[3] https://pubs.opengroup.org/onlinepubs/9799919799/

Kiryl Shutsemau (2):
  mm/memory: Do not populate page table entries beyond i_size.
  mm/truncate: Unmap large folio on split failure

 mm/filemap.c  | 18 ++++++++++--------
 mm/memory.c   | 12 ++++++++++--
 mm/truncate.c | 29 ++++++++++++++++++++++++++---
 3 files changed, 46 insertions(+), 13 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size.
  2025-10-20 16:30 [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Kiryl Shutsemau
@ 2025-10-20 16:30 ` Kiryl Shutsemau
  2025-10-20 16:30 ` [PATCH 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
  2025-10-20 23:28 ` [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Dave Chinner
  2 siblings, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-20 16:30 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Not-yet-signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mm/filemap.c | 18 ++++++++++--------
 mm/memory.c  | 12 ++++++++++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 13f0259d993c..0d251f6ab480 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3681,7 +3681,8 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned short *mmap_miss)
+			unsigned long *rss, unsigned short *mmap_miss,
+			pgoff_t file_end)
 {
 	unsigned int ref_from_caller = 1;
 	vm_fault_t ret = 0;
@@ -3697,7 +3698,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	 */
 	addr0 = addr - start * PAGE_SIZE;
 	if (folio_within_vma(folio, vmf->vma) &&
-	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
+	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&
+	    file_end >= folio_next_index(folio)) {
 		vmf->pte -= start;
 		page -= start;
 		addr = addr0;
@@ -3817,7 +3819,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	if (!folio)
 		goto out;
 
-	if (filemap_map_pmd(vmf, folio, start_pgoff)) {
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	end_pgoff = min(end_pgoff, file_end);
+
+	if (file_end >= folio_next_index(folio) &&
+	    filemap_map_pmd(vmf, folio, start_pgoff)) {
 		ret = VM_FAULT_NOPAGE;
 		goto out;
 	}
@@ -3830,10 +3836,6 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
-	if (end_pgoff > file_end)
-		end_pgoff = file_end;
-
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
@@ -3850,7 +3852,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		else
 			ret |= filemap_map_folio_range(vmf, folio,
 					xas.xa_index - folio->index, addr,
-					nr_pages, &rss, &mmap_miss);
+					nr_pages, &rss, &mmap_miss, file_end);
 
 		folio_unlock(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..dfa5b437c9d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5480,6 +5480,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	int type, nr_pages;
 	unsigned long addr;
 	bool needs_fallback = false;
+	pgoff_t file_end = -1UL;
 
 fallback:
 	addr = vmf->address;
@@ -5501,8 +5502,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (vma->vm_file) {
+		struct inode *inode = vma->vm_file->f_mapping->host;
+		file_end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	}
+
 	if (pmd_none(*vmf->pmd)) {
-		if (folio_test_pmd_mappable(folio)) {
+		if (folio_test_pmd_mappable(folio) &&
+		    file_end >= folio_next_index(folio)) {
 			ret = do_set_pmd(vmf, folio, page);
 			if (ret != VM_FAULT_FALLBACK)
 				return ret;
@@ -5533,7 +5540,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		if (unlikely(vma_off < idx ||
 			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
 			    pte_off < idx ||
-			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
+			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE ||
+			    file_end < folio_next_index(folio))) {
 			nr_pages = 1;
 		} else {
 			/* Now we can set mappings for the whole large folio. */
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-20 16:30 [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Kiryl Shutsemau
  2025-10-20 16:30 ` [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
@ 2025-10-20 16:30 ` Kiryl Shutsemau
  2025-10-20 23:28 ` [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Dave Chinner
  2 siblings, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-20 16:30 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

This behavior might not be respected on truncation.

During truncation, the kernel splits a large folio in order to reclaim
memory. As a side effect, it unmaps the folio and destroys PMD mappings
of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
are preserved.

However, if the split fails, PMD mappings are preserved and the user
will not receive SIGBUS on any accesses within the PMD.

Unmap the folio on split failure. It will lead to refault as PTEs and
preserve SIGBUS semantics.

Not-yet-signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/truncate.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 91eb92a5ce4f..cdb698b5f7fa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -177,6 +177,28 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
+static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at)
+{
+	enum ttu_flags ttu_flags =
+		TTU_RMAP_LOCKED |
+		TTU_SYNC |
+		TTU_BATCH_FLUSH |
+		TTU_SPLIT_HUGE_PMD |
+		TTU_IGNORE_MLOCK;
+	int ret;
+
+	ret = try_folio_split(folio, split_at, NULL);
+
+	/*
+	 * If the split fails, unmap the folio, so it will be refaulted
+	 * with PTEs to respect SIGBUS semantics.
+	 */
+	if (ret)
+		try_to_unmap(folio, ttu_flags);
+
+	return ret;
+}
+
 /*
  * Handle partial folios.  The folio may be entirely within the
  * range if a split has raced with us.  If not, we zero the part of the
@@ -224,7 +246,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		return true;
 
 	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
-	if (!try_folio_split(folio, split_at, NULL)) {
+	if (!try_folio_split_or_unmap(folio, split_at)) {
 		/*
 		 * try to split at offset + length to make sure folios within
 		 * the range can be dropped, especially to avoid memory waste
@@ -249,12 +271,13 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 			goto out;
 
 		/*
+		 * Split the folio.
+		 *
 		 * make sure folio2 is large and does not change its mapping.
-		 * Its split result does not matter here.
 		 */
 		if (folio_test_large(folio2) &&
 		    folio2->mapping == folio->mapping)
-			try_folio_split(folio2, split_at2, NULL);
+			try_folio_split_or_unmap(folio2, split_at2);
 
 		folio_unlock(folio2);
 out:
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-20 16:30 [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Kiryl Shutsemau
  2025-10-20 16:30 ` [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
  2025-10-20 16:30 ` [PATCH 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
@ 2025-10-20 23:28 ` Dave Chinner
  2025-10-21  6:12   ` Christoph Hellwig
  2025-10-21  6:16   ` Kiryl Shutsemau
  2 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2025-10-20 23:28 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Mon, Oct 20, 2025 at 05:30:52PM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> I do NOT want the patches in this patchset to be applied. Instead, I
> would like to discuss the semantics of large folios versus SIGBUS.
> 
> ## Background
> 
> Accessing memory within a VMA, but beyond i_size rounded up to the next
> page size, is supposed to generate SIGBUS.
> 
> This definition is simple if all pages are PAGE_SIZE in size, but with
> large folios in the picture, it is no longer the case.
> 
> ## Problem
> 
> Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
> failed due to missing SIGBUS. This was caused by my recent changes that
> try to fault in the whole folio where possible:
> 
> 	19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> 	357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> These changes did not consider i_size when setting up PTEs, leading to
> xfstest breakage.
> 
> However, the problem has been present in the kernel for a long time -
> since huge tmpfs was introduced in 2016. The kernel happily maps
> PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
> allocates PMD-size folios on any writes.

The tmpfs huge=always specific behaviour is not how regular
filesystems have behaved. It is niche, special case functionality
that has weird behaviours and, as such, it most definitely does not
set the standards for how all other filesystems should behave.

> I considered this corner case when I implemented a large tmpfs, and my
> conclusion was that no one in their right mind should rely on receiving
> a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
> could be useful for the workload.

Lacking the imagination or knowledge to understand why a behaviour
exists does not mean that behaviour is unnecessary or that it should
be removed. It just means you didn't ask the people who knew wy the
functionality exists...

> Generic/749 was introduced last year with reference to POSIX, but no
> real workloads were mentioned. It also acknowledged the tmpfs deviation
> from the test case.
> 
> POSIX indeed says[3]:
> 
> 	References within the address range starting at pa and
> 	continuing for len bytes to whole pages following the end of an
> 	object shall result in delivery of a SIGBUS signal.
> 
> Do we care about adhering strictly to this in absence of real workloads
> that relies on this semantics?

We've already told you that we do, because mapping beyond EOF has
implications for the impact on how much stale data exposure occur
when the next set of truncate/mmap() bugs are introduced.

> I think it valuable to allow kernel to map memory with a larger chunks
> -- whole folio -- to get TLB benefits (from both huge pages and TLB
> coalescing). I value TLB hit rate over POSIX wording.

Feel free to do that for tmpfs, but for persistent filesystems the
existing POSIX SIGBUS behaviour needs to remain.

> Any opinions?

There are solid historic reasons for the existing behaviour and for
keeping it unchanged.  You aren't allowed to handwave them away
because you don't understand or care about them.

In critical paths like truncate, correctness and safety come first.
Performance is only a secondary consideration.  The overlap of
mmap() and truncate() is an area where we have had many, many bugs
and, at minimum, the current POSIX behaviour largely shields us from
serious stale data exposure events when those bugs (inevitably)
occur.

Fundamentally, we really don't care about the mapping/tlb
performance of the PTE fragments at EOF. Anyone using files large
enough to notice the TLB overhead improvements from mapping large
folios is not going to notice that the EOF mapping has a slightly
higher TLB miss overhead than everywhere else in the file.

Please jsut fix the regression.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-20 23:28 ` [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Dave Chinner
@ 2025-10-21  6:12   ` Christoph Hellwig
  2025-10-21  6:17     ` Kiryl Shutsemau
  2025-10-21  6:16   ` Kiryl Shutsemau
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2025-10-21  6:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> Fundamentally, we really don't care about the mapping/tlb
> performance of the PTE fragments at EOF. Anyone using files large
> enough to notice the TLB overhead improvements from mapping large
> folios is not going to notice that the EOF mapping has a slightly
> higher TLB miss overhead than everywhere else in the file.
> 
> Please jsut fix the regression.

Yeah.  I'm not even sure why we're having this discussion.  The
behavior is mandated, we have test cases for it and there is
literally no practical upside in changing the behavior from what
we've done forever and what is mandated in Posix.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-20 23:28 ` [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Dave Chinner
  2025-10-21  6:12   ` Christoph Hellwig
@ 2025-10-21  6:16   ` Kiryl Shutsemau
  2025-10-23 10:35     ` Kiryl Shutsemau
  2025-10-23 11:38     ` Dave Chinner
  1 sibling, 2 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-21  6:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> In critical paths like truncate, correctness and safety come first.
> Performance is only a secondary consideration.  The overlap of
> mmap() and truncate() is an area where we have had many, many bugs
> and, at minimum, the current POSIX behaviour largely shields us from
> serious stale data exposure events when those bugs (inevitably)
> occur.

How do you prevent writes via GUP racing with truncate()?

Something like this:

	CPU0				CPU1
fd = open("file")
p = mmap(fd)
whatever_syscall(p)
  get_user_pages(p, &page)
  				truncate("file");
  <write to page>
  put_page(page);

The GUP can pin a page in the middle of a large folio well beyond the
truncation point. The folio will not be split on truncation due to the
elevated pin.

I don't think this issue can be fundamentally fixed as long as we allow
GUP for file-backed memory.

If the filesystem side cannot handle a non-zeroed tail of a large folio,
this SIGBUS semantics only hides the issue instead of addressing it.

And the race above does not seem to be far-fetched to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-21  6:12   ` Christoph Hellwig
@ 2025-10-21  6:17     ` Kiryl Shutsemau
  0 siblings, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-21  6:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On Mon, Oct 20, 2025 at 11:12:40PM -0700, Christoph Hellwig wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > Fundamentally, we really don't care about the mapping/tlb
> > performance of the PTE fragments at EOF. Anyone using files large
> > enough to notice the TLB overhead improvements from mapping large
> > folios is not going to notice that the EOF mapping has a slightly
> > higher TLB miss overhead than everywhere else in the file.
> > 
> > Please jsut fix the regression.
> 
> Yeah.  I'm not even sure why we're having this discussion.  The
> behavior is mandated, we have test cases for it and there is
> literally no practical upside in changing the behavior from what
> we've done forever and what is mandated in Posix.

Okay, will fix.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size.
@ 2025-10-21  6:35 Kiryl Shutsemau
  2025-10-21 12:08 ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-21  6:35 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mm/filemap.c | 18 ++++++++++--------
 mm/memory.c  | 12 ++++++++++--
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 13f0259d993c..0d251f6ab480 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3681,7 +3681,8 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned short *mmap_miss)
+			unsigned long *rss, unsigned short *mmap_miss,
+			pgoff_t file_end)
 {
 	unsigned int ref_from_caller = 1;
 	vm_fault_t ret = 0;
@@ -3697,7 +3698,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	 */
 	addr0 = addr - start * PAGE_SIZE;
 	if (folio_within_vma(folio, vmf->vma) &&
-	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
+	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&
+	    file_end >= folio_next_index(folio)) {
 		vmf->pte -= start;
 		page -= start;
 		addr = addr0;
@@ -3817,7 +3819,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	if (!folio)
 		goto out;
 
-	if (filemap_map_pmd(vmf, folio, start_pgoff)) {
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	end_pgoff = min(end_pgoff, file_end);
+
+	if (file_end >= folio_next_index(folio) &&
+	    filemap_map_pmd(vmf, folio, start_pgoff)) {
 		ret = VM_FAULT_NOPAGE;
 		goto out;
 	}
@@ -3830,10 +3836,6 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
-	if (end_pgoff > file_end)
-		end_pgoff = file_end;
-
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
@@ -3850,7 +3852,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		else
 			ret |= filemap_map_folio_range(vmf, folio,
 					xas.xa_index - folio->index, addr,
-					nr_pages, &rss, &mmap_miss);
+					nr_pages, &rss, &mmap_miss, file_end);
 
 		folio_unlock(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..dfa5b437c9d9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5480,6 +5480,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	int type, nr_pages;
 	unsigned long addr;
 	bool needs_fallback = false;
+	pgoff_t file_end = -1UL;
 
 fallback:
 	addr = vmf->address;
@@ -5501,8 +5502,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (vma->vm_file) {
+		struct inode *inode = vma->vm_file->f_mapping->host;
+		file_end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	}
+
 	if (pmd_none(*vmf->pmd)) {
-		if (folio_test_pmd_mappable(folio)) {
+		if (folio_test_pmd_mappable(folio) &&
+		    file_end >= folio_next_index(folio)) {
 			ret = do_set_pmd(vmf, folio, page);
 			if (ret != VM_FAULT_FALLBACK)
 				return ret;
@@ -5533,7 +5540,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		if (unlikely(vma_off < idx ||
 			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
 			    pte_off < idx ||
-			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
+			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE ||
+			    file_end < folio_next_index(folio))) {
 			nr_pages = 1;
 		} else {
 			/* Now we can set mappings for the whole large folio. */
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size.
  2025-10-21  6:35 [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
@ 2025-10-21 12:08 ` David Hildenbrand
  2025-10-21 12:28   ` Kiryl Shutsemau
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-10-21 12:08 UTC (permalink / raw)
  To: Kiryl Shutsemau, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Hugh Dickins
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

On 21.10.25 08:35, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>

Subject: I'd drop the trailing "."

> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> Recent changes attempted to fault in full folio where possible. They did
> not respect i_size, which led to populating PTEs beyond i_size and
> breaking SIGBUS semantics.
> 
> Darrick reported generic/749 breakage because of this.
> 
> However, the problem existed before the recent changes. With huge=always
> tmpfs, any write to a file leads to PMD-size allocation. Following the
> fault-in of the folio will install PMD mapping regardless of i_size.

Right, there are some legacy oddities with shmem in that area (e.g., 
"within_size" vs. "always" THP allocation control).

Let me CC Hugh: the behavior for shmem seems to date back to 2016.

> 
> Fix filemap_map_pages() and finish_fault() to not install:
>    - PTEs beyond i_size;
>    - PMD mappings across i_size;

Makes sense to me.


[...]

> +++ b/mm/memory.c
> @@ -5480,6 +5480,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   	int type, nr_pages;
>   	unsigned long addr;
>   	bool needs_fallback = false;
> +	pgoff_t file_end = -1UL;
>   
>   fallback:
>   	addr = vmf->address;
> @@ -5501,8 +5502,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   			return ret;
>   	}
>   
> +	if (vma->vm_file) {
> +		struct inode *inode = vma->vm_file->f_mapping->host;

empty line pleae

> +		file_end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +	}
> +
>   	if (pmd_none(*vmf->pmd)) {
> -		if (folio_test_pmd_mappable(folio)) {
> +		if (folio_test_pmd_mappable(folio) &&
> +		    file_end >= folio_next_index(folio)) {
>   			ret = do_set_pmd(vmf, folio, page);
>   			if (ret != VM_FAULT_FALLBACK)
>   				return ret;
> @@ -5533,7 +5540,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>   		if (unlikely(vma_off < idx ||
>   			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
>   			    pte_off < idx ||
> -			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
> +			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE ||

While at it you could fix the double space before the ">".

> +			    file_end < folio_next_index(folio))) {
>   			nr_pages = 1;
>   		} else {
>   			/* Now we can set mappings for the whole large folio. */

Nothing else jumped at me.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size.
  2025-10-21 12:08 ` David Hildenbrand
@ 2025-10-21 12:28   ` Kiryl Shutsemau
  0 siblings, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-21 12:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, linux-mm, linux-fsdevel,
	linux-kernel

On Tue, Oct 21, 2025 at 02:08:44PM +0200, David Hildenbrand wrote:
> On 21.10.25 08:35, Kiryl Shutsemau wrote:
> > From: Kiryl Shutsemau <kas@kernel.org>
> 
> Subject: I'd drop the trailing "."

Ack.

> > 
> > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > supposed to generate SIGBUS.
> > 
> > Recent changes attempted to fault in full folio where possible. They did
> > not respect i_size, which led to populating PTEs beyond i_size and
> > breaking SIGBUS semantics.
> > 
> > Darrick reported generic/749 breakage because of this.
> > 
> > However, the problem existed before the recent changes. With huge=always
> > tmpfs, any write to a file leads to PMD-size allocation. Following the
> > fault-in of the folio will install PMD mapping regardless of i_size.
> 
> Right, there are some legacy oddities with shmem in that area (e.g.,
> "within_size" vs. "always" THP allocation control).
> 
> Let me CC Hugh: the behavior for shmem seems to date back to 2016.

Yes, it is my huge tmpfs implementation that introduced this.

And Hugh is on CC.

> > 
> > Fix filemap_map_pages() and finish_fault() to not install:
> >    - PTEs beyond i_size;
> >    - PMD mappings across i_size;
> 
> Makes sense to me.
> 
> 
> [...]
> 
> > +++ b/mm/memory.c
> > @@ -5480,6 +5480,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >   	int type, nr_pages;
> >   	unsigned long addr;
> >   	bool needs_fallback = false;
> > +	pgoff_t file_end = -1UL;
> >   fallback:
> >   	addr = vmf->address;
> > @@ -5501,8 +5502,14 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >   			return ret;
> >   	}
> > +	if (vma->vm_file) {
> > +		struct inode *inode = vma->vm_file->f_mapping->host;
> 
> empty line pleae

Ack.

> 
> > +		file_end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > +	}
> > +
> >   	if (pmd_none(*vmf->pmd)) {
> > -		if (folio_test_pmd_mappable(folio)) {
> > +		if (folio_test_pmd_mappable(folio) &&
> > +		    file_end >= folio_next_index(folio)) {
> >   			ret = do_set_pmd(vmf, folio, page);
> >   			if (ret != VM_FAULT_FALLBACK)
> >   				return ret;
> > @@ -5533,7 +5540,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
> >   		if (unlikely(vma_off < idx ||
> >   			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
> >   			    pte_off < idx ||
> > -			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
> > +			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE ||
> 
> While at it you could fix the double space before the ">".

Okay.


> > +			    file_end < folio_next_index(folio))) {
> >   			nr_pages = 1;
> >   		} else {
> >   			/* Now we can set mappings for the whole large folio. */
> 
> Nothing else jumped at me.
> 
> -- 
> Cheers
> 
> David / dhildenb
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-21  6:16   ` Kiryl Shutsemau
@ 2025-10-23 10:35     ` Kiryl Shutsemau
  2025-10-23 11:38     ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23 10:35 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Oct 21, 2025 at 07:16:33AM +0100, Kiryl Shutsemau wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > In critical paths like truncate, correctness and safety come first.
> > Performance is only a secondary consideration.  The overlap of
> > mmap() and truncate() is an area where we have had many, many bugs
> > and, at minimum, the current POSIX behaviour largely shields us from
> > serious stale data exposure events when those bugs (inevitably)
> > occur.
> 
> How do you prevent writes via GUP racing with truncate()?
> 
> Something like this:
> 
> 	CPU0				CPU1
> fd = open("file")
> p = mmap(fd)
> whatever_syscall(p)
>   get_user_pages(p, &page)
>   				truncate("file");
>   <write to page>
>   put_page(page);
> 
> The GUP can pin a page in the middle of a large folio well beyond the
> truncation point. The folio will not be split on truncation due to the
> elevated pin.
> 
> I don't think this issue can be fundamentally fixed as long as we allow
> GUP for file-backed memory.
> 
> If the filesystem side cannot handle a non-zeroed tail of a large folio,
> this SIGBUS semantics only hides the issue instead of addressing it.
> 
> And the race above does not seem to be far-fetched to me.

Any comments?

Jan, I remember you worked a lot on making GUP semantics sanish for file
pages. Any clues if I imagine a problem here?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-21  6:16   ` Kiryl Shutsemau
  2025-10-23 10:35     ` Kiryl Shutsemau
@ 2025-10-23 11:38     ` Dave Chinner
  2025-10-23 15:48       ` Andreas Dilger
  1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2025-10-23 11:38 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> > In critical paths like truncate, correctness and safety come first.
> > Performance is only a secondary consideration.  The overlap of
> > mmap() and truncate() is an area where we have had many, many bugs
> > and, at minimum, the current POSIX behaviour largely shields us from
> > serious stale data exposure events when those bugs (inevitably)
> > occur.
> 
> How do you prevent writes via GUP racing with truncate()?
> 
> Something like this:
> 
> 	CPU0				CPU1
> fd = open("file")
> p = mmap(fd)
> whatever_syscall(p)
>   get_user_pages(p, &page)
>   				truncate("file");
>   <write to page>
>   put_page(page);

Forget about truncate, go look at the comment above
writable_file_mapping_allowed() about using GUP this way.

i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
spent the past 15+ years telling people that it is unfixably broken
and they will crash their kernel or corrupt there data if they do
this.

This is not supported functionality because real world production
use ends up exposing problems with sync and background writeback
races, truncate races, fallocate() races, writes into holes, writes
into preallocated regions, writes over shared extents that require
copy-on-write, etc, etc, ad nausiem.

If anyone is using filebacked mappings like this, then when it
breaks they get to keep all the broken pieces to themselves.

> The GUP can pin a page in the middle of a large folio well beyond the
> truncation point. The folio will not be split on truncation due to the
> elevated pin.
> 
> I don't think this issue can be fundamentally fixed as long as we allow
> GUP for file-backed memory.

Yup, but that's the least of the problems with GUP on file-backed
pages...

> If the filesystem side cannot handle a non-zeroed tail of a large folio,
> this SIGBUS semantics only hides the issue instead of addressing it.

The objections raised have not related to whether a filesystem
"cannot handle" this case or not. The concerns are about a change of
behaviour in a well known, widely documented API, as well as the
significant increase in surface area of potential data exposure it
would enable should there be Yet Another Truncate Bug Again Once
More.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-23 11:38     ` Dave Chinner
@ 2025-10-23 15:48       ` Andreas Dilger
  2025-10-24  6:50         ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Andreas Dilger @ 2025-10-23 15:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2851 bytes --]


> On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
>> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
>>> In critical paths like truncate, correctness and safety come first.
>>> Performance is only a secondary consideration.  The overlap of
>>> mmap() and truncate() is an area where we have had many, many bugs
>>> and, at minimum, the current POSIX behaviour largely shields us from
>>> serious stale data exposure events when those bugs (inevitably)
>>> occur.
>> 
>> How do you prevent writes via GUP racing with truncate()?
>> 
>> Something like this:
>> 
>> 	CPU0				CPU1
>> fd = open("file")
>> p = mmap(fd)
>> whatever_syscall(p)
>>  get_user_pages(p, &page)
>>  				truncate("file");
>>  <write to page>
>>  put_page(page);
> 
> Forget about truncate, go look at the comment above
> writable_file_mapping_allowed() about using GUP this way.
> 
> i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
> spent the past 15+ years telling people that it is unfixably broken
> and they will crash their kernel or corrupt there data if they do
> this.
> 
> This is not supported functionality because real world production
> use ends up exposing problems with sync and background writeback
> races, truncate races, fallocate() races, writes into holes, writes
> into preallocated regions, writes over shared extents that require
> copy-on-write, etc, etc, ad nausiem.
> 
> If anyone is using filebacked mappings like this, then when it
> breaks they get to keep all the broken pieces to themselves.

Should ftruncate("file") return ETXTBUSY in this case, so that users
and applications know this doesn't work/isn't safe?  Unfortunately,
today's application developers barely even know how IO is done, so
there is little chance that they would understand subtleties like this.

Cheers, Andreas

>> The GUP can pin a page in the middle of a large folio well beyond the
>> truncation point. The folio will not be split on truncation due to the
>> elevated pin.
>> 
>> I don't think this issue can be fundamentally fixed as long as we allow
>> GUP for file-backed memory.
> 
> Yup, but that's the least of the problems with GUP on file-backed
> pages...
> 
>> If the filesystem side cannot handle a non-zeroed tail of a large folio,
>> this SIGBUS semantics only hides the issue instead of addressing it.
> 
> The objections raised have not related to whether a filesystem
> "cannot handle" this case or not. The concerns are about a change of
> behaviour in a well known, widely documented API, as well as the
> significant increase in surface area of potential data exposure it
> would enable should there be Yet Another Truncate Bug Again Once
> More.
> 
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
> 


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-23 15:48       ` Andreas Dilger
@ 2025-10-24  6:50         ` Dave Chinner
  2025-10-24  7:43           ` David Hildenbrand
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2025-10-24  6:50 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Kiryl Shutsemau, Andrew Morton, David Hildenbrand, Hugh Dickins,
	Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
> > On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
> >> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
> >>> In critical paths like truncate, correctness and safety come first.
> >>> Performance is only a secondary consideration.  The overlap of
> >>> mmap() and truncate() is an area where we have had many, many bugs
> >>> and, at minimum, the current POSIX behaviour largely shields us from
> >>> serious stale data exposure events when those bugs (inevitably)
> >>> occur.
> >> 
> >> How do you prevent writes via GUP racing with truncate()?
> >> 
> >> Something like this:
> >> 
> >> 	CPU0				CPU1
> >> fd = open("file")
> >> p = mmap(fd)
> >> whatever_syscall(p)
> >>  get_user_pages(p, &page)
> >>  				truncate("file");
> >>  <write to page>
> >>  put_page(page);
> > 
> > Forget about truncate, go look at the comment above
> > writable_file_mapping_allowed() about using GUP this way.
> > 
> > i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
> > spent the past 15+ years telling people that it is unfixably broken
> > and they will crash their kernel or corrupt there data if they do
> > this.
> > 
> > This is not supported functionality because real world production
> > use ends up exposing problems with sync and background writeback
> > races, truncate races, fallocate() races, writes into holes, writes
> > into preallocated regions, writes over shared extents that require
> > copy-on-write, etc, etc, ad nausiem.
> > 
> > If anyone is using filebacked mappings like this, then when it
> > breaks they get to keep all the broken pieces to themselves.
> 
> Should ftruncate("file") return ETXTBUSY in this case, so that users
> and applications know this doesn't work/isn't safe?

No, it is better to block waiting for the GUP to release the
reference (see below), but the general problem is that we cannot
reliably discriminate GUP references from other page cache based
references just by looking at the folio resident in the page cache.

However, when FSDAX is being used, trucate does, in fact, block
waiting for GUP references to be release. fsdax does not use page
references to track in use pages - the filesystem metadata tracks
allocated and free pages, not the mm/ subsystem. There are no
page cache references to the pages, because there is no page
cache. Hence we can use the difference between the map count and the
reference count to determine if there are any references we cannot
forcibly unmap (e.g. GUP) just by looking at the backing store folio
state.

Hence we can block truncate on non mapcount references via the
layout lease hooks like so: i.e.:

->setattr
 xfs_vn_setattr
   xfs_break_layouts(BREAK_UNMAP)
      xfs_break_dax_layouts()
        dax_break_layout_inode()
	  dax_break_layout()
	    page = dax_layout_busy_page_range()
	      page = dax_busy_page()
		 /* page returned if it is held by GUP */
	    wait_page_idle(page)
	         /* blocks until extra ref counts go away */

and only when all the non-mapcount page references are gone across
the truncate range is the truncate allowed to proceed.

IIRC, we decided to block truncate and other operations that need
backing store access exclusion rather than returned an error because
nobody expects operations like truncate to randomly fail like this.
Such behaviour would likely break applications in unexpected ways,
so it was decided to play it safe and block until the ref goes away.

This is one of the reasons for FOLL_LONGTERM  being added - we
can't allow longterm pinning of file-backed fsadax pages (e.g. RDMA
regions using filebacked mappings) because then operations like
truncate can be blocked for hours/days/weeks. This situation is
checked via vma_is_fsdax() in mm/gup.c::check_vma_flags()...

> Unfortunately,
> today's application developers barely even know how IO is done, so
> there is little chance that they would understand subtleties like this.

I think that even the experienced developers who know how to do IO
struggle to understand this sort of thing. Most kernel developers
run screaming from GUP before it drives them insane, too. :/

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics
  2025-10-24  6:50         ` Dave Chinner
@ 2025-10-24  7:43           ` David Hildenbrand
  0 siblings, 0 replies; 15+ messages in thread
From: David Hildenbrand @ 2025-10-24  7:43 UTC (permalink / raw)
  To: Dave Chinner, Andreas Dilger
  Cc: Kiryl Shutsemau, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	linux-mm, linux-fsdevel, linux-kernel

On 24.10.25 08:50, Dave Chinner wrote:
> On Thu, Oct 23, 2025 at 09:48:58AM -0600, Andreas Dilger wrote:
>>> On Oct 23, 2025, at 5:38 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Tue, Oct 21, 2025 at 07:16:26AM +0100, Kiryl Shutsemau wrote:
>>>> On Tue, Oct 21, 2025 at 10:28:02AM +1100, Dave Chinner wrote:
>>>>> In critical paths like truncate, correctness and safety come first.
>>>>> Performance is only a secondary consideration.  The overlap of
>>>>> mmap() and truncate() is an area where we have had many, many bugs
>>>>> and, at minimum, the current POSIX behaviour largely shields us from
>>>>> serious stale data exposure events when those bugs (inevitably)
>>>>> occur.
>>>>
>>>> How do you prevent writes via GUP racing with truncate()?
>>>>
>>>> Something like this:
>>>>
>>>> 	CPU0				CPU1
>>>> fd = open("file")
>>>> p = mmap(fd)
>>>> whatever_syscall(p)
>>>>   get_user_pages(p, &page)
>>>>   				truncate("file");
>>>>   <write to page>
>>>>   put_page(page);
>>>
>>> Forget about truncate, go look at the comment above
>>> writable_file_mapping_allowed() about using GUP this way.
>>>
>>> i.e. file-backed mmap/GUP is a known broken anti-pattern. We've
>>> spent the past 15+ years telling people that it is unfixably broken
>>> and they will crash their kernel or corrupt there data if they do
>>> this.
>>>
>>> This is not supported functionality because real world production
>>> use ends up exposing problems with sync and background writeback
>>> races, truncate races, fallocate() races, writes into holes, writes
>>> into preallocated regions, writes over shared extents that require
>>> copy-on-write, etc, etc, ad nausiem.
>>>
>>> If anyone is using filebacked mappings like this, then when it
>>> breaks they get to keep all the broken pieces to themselves.
>>
>> Should ftruncate("file") return ETXTBUSY in this case, so that users
>> and applications know this doesn't work/isn't safe?
> 
> No, it is better to block waiting for the GUP to release the
> reference (see below), but the general problem is that we cannot
> reliably discriminate GUP references from other page cache based
> references just by looking at the folio resident in the page cache.

Right. folio_maybe_dma_pinned() can have false positives for small 
folios, but also temporarily for large folios (speculative pins from 
GUP-fast).

In the future it might get more reliable at least for small folios when 
we are able to have a dedicated pincount.

(there is still the issue that some mechanisms that should be using 
pin_user_pages() are still using get_user_pages())

> 
> However, when FSDAX is being used, trucate does, in fact, block
> waiting for GUP references to be release. fsdax does not use page
> references to track in use pages - the filesystem metadata tracks
> allocated and free pages, not the mm/ subsystem. There are no
> page cache references to the pages, because there is no page
> cache. Hence we can use the difference between the map count and the
> reference count to determine if there are any references we cannot
> forcibly unmap (e.g. GUP) just by looking at the backing store folio
> state.

We can do the same for other folios as well. See folio_expected_ref_count().

Unexpected references can be from GUP, lru caches or other temporary 
ones from page migration etc.

As we document for folio_expected_ref_count() it's racy for mapped 
folios, though: "Calling this function on a mapped folio will not result 
in a stable result, because nothing stops additional page table mappings 
from coming (e.g.,fork()) or going (e.g., munmap())."

It only works reliably on unmapped folios when holding the folio lock.


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-10-24  7:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 16:30 [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Kiryl Shutsemau
2025-10-20 16:30 ` [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
2025-10-20 16:30 ` [PATCH 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
2025-10-20 23:28 ` [RFC, PATCH 0/2] Large folios vs. SIGBUS semantics Dave Chinner
2025-10-21  6:12   ` Christoph Hellwig
2025-10-21  6:17     ` Kiryl Shutsemau
2025-10-21  6:16   ` Kiryl Shutsemau
2025-10-23 10:35     ` Kiryl Shutsemau
2025-10-23 11:38     ` Dave Chinner
2025-10-23 15:48       ` Andreas Dilger
2025-10-24  6:50         ` Dave Chinner
2025-10-24  7:43           ` David Hildenbrand
  -- strict thread matches above, loose matches on Subject: below --
2025-10-21  6:35 [PATCH 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
2025-10-21 12:08 ` David Hildenbrand
2025-10-21 12:28   ` Kiryl Shutsemau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).