[PATCHv2 0/2] Fix SIGBUS semantics with large folios

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv2 0/2] Fix SIGBUS semantics with large folios
@ 2025-10-23  9:32 Kiryl Shutsemau
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23  9:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
failed due to missing SIGBUS. This was caused by my recent changes that
try to fault in the whole folio where possible:

        19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
        357b92761d94 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016. The kernel happily maps
PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving
a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
could be useful for the workload.

But apparently filesystem folks care a lot about preserving strict
SIGBUS semantics.

Generic/749 was introduced last year with reference to POSIX, but no
real workloads were mentioned. It also acknowledged the tmpfs deviation
from the test case.

POSIX indeed says[3]:

        References within the address range starting at pa and
        continuing for len bytes to whole pages following the end of an
        object shall result in delivery of a SIGBUS signal.

The patchset fixes the regression introduced by recent changes as well
as more subtle SIGBUS breakage due to split failure on truncation.

v2:
 - Fix try_to_unmap() flags;
 - Add warning if try_to_unmap() fails to unmap the folio;
 - Adjust comments and commit messages;
 - Whitespace fixes;
v1:
 - Drop RFC;
 - Add Signed-off-bys;

[1] https://lore.kernel.org/all/20251014175214.GW6188@frogsfrogsfrogs
[2]
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/tests/generic/749?h=for-next&id=e4a6b119e5
229599eac96235fb7e683b8a8bdc53
[3] https://pubs.opengroup.org/onlinepubs/9799919799/
Kiryl Shutsemau (2):
  mm/memory: Do not populate page table entries beyond i_size
  mm/truncate: Unmap large folio on split failure

 mm/filemap.c  | 18 ++++++++++--------
 mm/memory.c   | 13 +++++++++++--
 mm/truncate.c | 31 +++++++++++++++++++++++++------
 3 files changed, 46 insertions(+), 16 deletions(-)

-- 
2.50.1

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23  9:32 [PATCHv2 0/2] Fix SIGBUS semantics with large folios Kiryl Shutsemau
@ 2025-10-23  9:32 ` Kiryl Shutsemau
  2025-10-23 20:49   ` Andrew Morton
                     ` (2 more replies)
  2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
  2025-10-23 17:47 ` [PATCHv2 0/2] Fix SIGBUS semantics with large folios Darrick J. Wong
  2 siblings, 3 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23  9:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible. They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes. With huge=always
tmpfs, any write to a file leads to PMD-size allocation. Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Reported-by: "Darrick J. Wong" <djwong@kernel.org>
---
 mm/filemap.c | 18 ++++++++++--------
 mm/memory.c  | 13 +++++++++++--
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 13f0259d993c..0d251f6ab480 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3681,7 +3681,8 @@ static struct folio *next_uptodate_folio(struct xa_state *xas,
 static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			struct folio *folio, unsigned long start,
 			unsigned long addr, unsigned int nr_pages,
-			unsigned long *rss, unsigned short *mmap_miss)
+			unsigned long *rss, unsigned short *mmap_miss,
+			pgoff_t file_end)
 {
 	unsigned int ref_from_caller = 1;
 	vm_fault_t ret = 0;
@@ -3697,7 +3698,8 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	 */
 	addr0 = addr - start * PAGE_SIZE;
 	if (folio_within_vma(folio, vmf->vma) &&
-	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
+	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&
+	    file_end >= folio_next_index(folio)) {
 		vmf->pte -= start;
 		page -= start;
 		addr = addr0;
@@ -3817,7 +3819,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	if (!folio)
 		goto out;
 
-	if (filemap_map_pmd(vmf, folio, start_pgoff)) {
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	end_pgoff = min(end_pgoff, file_end);
+
+	if (file_end >= folio_next_index(folio) &&
+	    filemap_map_pmd(vmf, folio, start_pgoff)) {
 		ret = VM_FAULT_NOPAGE;
 		goto out;
 	}
@@ -3830,10 +3836,6 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		goto out;
 	}
 
-	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
-	if (end_pgoff > file_end)
-		end_pgoff = file_end;
-
 	folio_type = mm_counter_file(folio);
 	do {
 		unsigned long end;
@@ -3850,7 +3852,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		else
 			ret |= filemap_map_folio_range(vmf, folio,
 					xas.xa_index - folio->index, addr,
-					nr_pages, &rss, &mmap_miss);
+					nr_pages, &rss, &mmap_miss, file_end);
 
 		folio_unlock(folio);
 	} while ((folio = next_uptodate_folio(&xas, mapping, end_pgoff)) != NULL);
diff --git a/mm/memory.c b/mm/memory.c
index 74b45e258323..9bbe59e6922f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5480,6 +5480,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 	int type, nr_pages;
 	unsigned long addr;
 	bool needs_fallback = false;
+	pgoff_t file_end = -1UL;
 
 fallback:
 	addr = vmf->address;
@@ -5501,8 +5502,15 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 			return ret;
 	}
 
+	if (vma->vm_file) {
+		struct inode *inode = vma->vm_file->f_mapping->host;
+
+		file_end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	}
+
 	if (pmd_none(*vmf->pmd)) {
-		if (folio_test_pmd_mappable(folio)) {
+		if (folio_test_pmd_mappable(folio) &&
+		    file_end >= folio_next_index(folio)) {
 			ret = do_set_pmd(vmf, folio, page);
 			if (ret != VM_FAULT_FALLBACK)
 				return ret;
@@ -5533,7 +5541,8 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
 		if (unlikely(vma_off < idx ||
 			    vma_off + (nr_pages - idx) > vma_pages(vma) ||
 			    pte_off < idx ||
-			    pte_off + (nr_pages - idx)  > PTRS_PER_PTE)) {
+			    pte_off + (nr_pages - idx) > PTRS_PER_PTE ||
+			    file_end < folio_next_index(folio))) {
 			nr_pages = 1;
 		} else {
 			/* Now we can set mappings for the whole large folio. */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
@ 2025-10-23 20:49   ` Andrew Morton
  2025-10-23 20:54     ` David Hildenbrand
  2025-10-24 15:42   ` David Hildenbrand
  2025-10-27  8:20   ` Hugh Dickins
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2025-10-23 20:49 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Thu, 23 Oct 2025 10:32:50 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:

> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> Recent changes attempted to fault in full folio where possible. They did
> not respect i_size, which led to populating PTEs beyond i_size and
> breaking SIGBUS semantics.
> 
> Darrick reported generic/749 breakage because of this.
> 
> However, the problem existed before the recent changes. With huge=always
> tmpfs, any write to a file leads to PMD-size allocation. Following the
> fault-in of the folio will install PMD mapping regardless of i_size.
> 
> Fix filemap_map_pages() and finish_fault() to not install:
>   - PTEs beyond i_size;
>   - PMD mappings across i_size;
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")

Sep 28 2025

> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")

Sep 28 2025

> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")

Jul 26 2016

eek, what's this one doing here?  Are we asking -stable maintainers
to backport this patch into nine years worth of kernels?

I'll remove this Fixes: line for now...



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23 20:49   ` Andrew Morton
@ 2025-10-23 20:54     ` David Hildenbrand
  2025-10-23 21:36       ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-23 20:54 UTC (permalink / raw)
  To: Andrew Morton, Kiryl Shutsemau
  Cc: Hugh Dickins, Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On 23.10.25 22:49, Andrew Morton wrote:
> On Thu, 23 Oct 2025 10:32:50 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:
> 
>> From: Kiryl Shutsemau <kas@kernel.org>
>>
>> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
>> supposed to generate SIGBUS.
>>
>> Recent changes attempted to fault in full folio where possible. They did
>> not respect i_size, which led to populating PTEs beyond i_size and
>> breaking SIGBUS semantics.
>>
>> Darrick reported generic/749 breakage because of this.
>>
>> However, the problem existed before the recent changes. With huge=always
>> tmpfs, any write to a file leads to PMD-size allocation. Following the
>> fault-in of the folio will install PMD mapping regardless of i_size.
>>
>> Fix filemap_map_pages() and finish_fault() to not install:
>>    - PTEs beyond i_size;
>>    - PMD mappings across i_size;
>>
>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> 
> Sep 28 2025
> 
>> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> Sep 28 2025
> 
>> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> 
> Jul 26 2016
> 
> eek, what's this one doing here?  Are we asking -stable maintainers
> to backport this patch into nine years worth of kernels?
> 
> I'll remove this Fixes: line for now...

Ehm, why? It sure is a fix for that. We can indicate to which stable 
versions we want to have ti backported.

And yes, it might be all stable kernels.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23 20:54     ` David Hildenbrand
@ 2025-10-23 21:36       ` Andrew Morton
  2025-10-24  9:26         ` Kiryl Shutsemau
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2025-10-23 21:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kiryl Shutsemau, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Thu, 23 Oct 2025 22:54:49 +0200 David Hildenbrand <david@redhat.com> wrote:

> >> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> > 
> > Sep 28 2025
> > 
> >> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> > 
> > Sep 28 2025
> > 
> >> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > 
> > Jul 26 2016
> > 
> > eek, what's this one doing here?  Are we asking -stable maintainers
> > to backport this patch into nine years worth of kernels?
> > 
> > I'll remove this Fixes: line for now...
> 
> Ehm, why?

Because the Sep 28 2025 Fixes: totally fooled me and because this
doesn't apply to 6.17, let alone to 6.ancient.

> It sure is a fix for that. We can indicate to which stable 
> versions we want to have ti backported.
> 
> And yes, it might be all stable kernels.

No probs, thanks for clarifying.  I'll restore the

	Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
	Cc; <stable@vger.kernel.org>

and shall let others sort out the backporting issues.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23 21:36       ` Andrew Morton
@ 2025-10-24  9:26         ` Kiryl Shutsemau
  2025-10-26  4:54           ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-24  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Oct 23, 2025 at 02:36:24PM -0700, Andrew Morton wrote:
> On Thu, 23 Oct 2025 22:54:49 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
> > >> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> > > 
> > > Sep 28 2025
> > > 
> > >> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> > > 
> > > Sep 28 2025
> > > 
> > >> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > > 
> > > Jul 26 2016
> > > 
> > > eek, what's this one doing here?  Are we asking -stable maintainers
> > > to backport this patch into nine years worth of kernels?
> > > 
> > > I'll remove this Fixes: line for now...
> > 
> > Ehm, why?
> 
> Because the Sep 28 2025 Fixes: totally fooled me and because this
> doesn't apply to 6.17, let alone to 6.ancient.
> 
> > It sure is a fix for that. We can indicate to which stable 
> > versions we want to have ti backported.
> > 
> > And yes, it might be all stable kernels.
> 
> No probs, thanks for clarifying.  I'll restore the
> 
> 	Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> 	Cc; <stable@vger.kernel.org>
> 
> and shall let others sort out the backporting issues.

One possible way to relax backporting requirements is only to backport
to kernels where we can have writable file mapping to filesystem with a
backing storage (non-shmem).

Maybe

Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-24  9:26         ` Kiryl Shutsemau
@ 2025-10-26  4:54           ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2025-10-26  4:54 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Fri, 24 Oct 2025 10:26:05 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:

> > Because the Sep 28 2025 Fixes: totally fooled me and because this
> > doesn't apply to 6.17, let alone to 6.ancient.
> > 
> > > It sure is a fix for that. We can indicate to which stable 
> > > versions we want to have ti backported.
> > > 
> > > And yes, it might be all stable kernels.
> > 
> > No probs, thanks for clarifying.  I'll restore the
> > 
> > 	Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > 	Cc; <stable@vger.kernel.org>
> > 
> > and shall let others sort out the backporting issues.
> 
> One possible way to relax backporting requirements is only to backport
> to kernels where we can have writable file mapping to filesystem with a
> backing storage (non-shmem).
> 
> Maybe
> 
> Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")

OK, thanks, I changed it to

Link: https://lkml.kernel.org/r/20251023093251.54146-2-kirill@shutemov.name
Fixes: 01c70267053d ("fs: add a filesystem flag for THPs")
Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
...
Cc: <stable@vger.kernel.org>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
  2025-10-23 20:49   ` Andrew Morton
@ 2025-10-24 15:42   ` David Hildenbrand
  2025-10-24 19:32     ` Kirill A. Shutemov
  2025-10-27  8:20   ` Hugh Dickins
  2 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-24 15:42 UTC (permalink / raw)
  To: Kiryl Shutsemau, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On 23.10.25 11:32, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> Recent changes attempted to fault in full folio where possible. They did
> not respect i_size, which led to populating PTEs beyond i_size and
> breaking SIGBUS semantics.
> 
> Darrick reported generic/749 breakage because of this.
> 
> However, the problem existed before the recent changes. With huge=always
> tmpfs, any write to a file leads to PMD-size allocation. Following the
> fault-in of the folio will install PMD mapping regardless of i_size.
> 
> Fix filemap_map_pages() and finish_fault() to not install:
>    - PTEs beyond i_size;
>    - PMD mappings across i_size;
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Reported-by: "Darrick J. Wong" <djwong@kernel.org>
> ---

Some of the code in here might deserve some cleanups IMHO :)

[...]

>   	addr0 = addr - start * PAGE_SIZE;
>   	if (folio_within_vma(folio, vmf->vma) &&
> -	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
> +	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&

Isn't this just testing whether addr0 is aligned to folio_size(folio)? 
(given that we don't support folios > PMD_SIZE), like

	IS_ALIGNED(addr0, folio_size(folio))

Anyhow, unrelated to this patch ...



Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-24 15:42   ` David Hildenbrand
@ 2025-10-24 19:32     ` Kirill A. Shutemov
  2025-10-27  9:34       ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Kirill A. Shutemov @ 2025-10-24 19:32 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau



On Fri, Oct 24, 2025, at 16:42, David Hildenbrand wrote:
> On 23.10.25 11:32, Kiryl Shutsemau wrote:
>>   	addr0 = addr - start * PAGE_SIZE;
>>   	if (folio_within_vma(folio, vmf->vma) &&
>> -	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
>> +	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&
>
> Isn't this just testing whether addr0 is aligned to folio_size(folio)? 
> (given that we don't support folios > PMD_SIZE), like
>
> 	IS_ALIGNED(addr0, folio_size(folio))

Actually, no. VMA can be not aligned to folio_size().

-- 
Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-24 19:32     ` Kirill A. Shutemov
@ 2025-10-27  9:34       ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-27  9:34 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On 24.10.25 21:32, Kirill A. Shutemov wrote:
> 
> 
> On Fri, Oct 24, 2025, at 16:42, David Hildenbrand wrote:
>> On 23.10.25 11:32, Kiryl Shutsemau wrote:
>>>    	addr0 = addr - start * PAGE_SIZE;
>>>    	if (folio_within_vma(folio, vmf->vma) &&
>>> -	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK)) {
>>> +	    (addr0 & PMD_MASK) == ((addr0 + folio_size(folio) - 1) & PMD_MASK) &&
>>
>> Isn't this just testing whether addr0 is aligned to folio_size(folio)?
>> (given that we don't support folios > PMD_SIZE), like
>>
>> 	IS_ALIGNED(addr0, folio_size(folio))
> 
> Actually, no. VMA can be not aligned to folio_size().

Ah, I missed that we can also have folio sizes besides PMD_SIZE here.

So it's all about testing whether the complete folio would be mapped by 
a single page table.

(a helper would be nice, but cannot immediately come up with a good name)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
  2025-10-23 20:49   ` Andrew Morton
  2025-10-24 15:42   ` David Hildenbrand
@ 2025-10-27  8:20   ` Hugh Dickins
  2025-10-27  9:14     ` Kiryl Shutsemau
  2025-10-27  9:22     ` David Hildenbrand
  2 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2025-10-27  8:20 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On Thu, 23 Oct 2025, Kiryl Shutsemau wrote:

> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> Recent changes attempted to fault in full folio where possible. They did
> not respect i_size, which led to populating PTEs beyond i_size and
> breaking SIGBUS semantics.
> 
> Darrick reported generic/749 breakage because of this.
> 
> However, the problem existed before the recent changes. With huge=always
> tmpfs, any write to a file leads to PMD-size allocation. Following the
> fault-in of the folio will install PMD mapping regardless of i_size.
> 
> Fix filemap_map_pages() and finish_fault() to not install:
>   - PTEs beyond i_size;
>   - PMD mappings across i_size;

Sorry for coming in late as usual, and complicating matters.

> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")

ACK to restoring the correct POSIX behaviour to those filesystems
which are being given large folios beyond EOF transparently,
without any huge= mount option to permit it.

> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")

But NAK to regressing the intentional behaviour of huge=always
on shmem/tmpfs: the page size, whenever possible, is PMD-sized.  In
6.18-rc huge=always is currently (thanks to Baolin) behaving correctly
again, as it had done for nine years: I insist we do not re-break it.

Andrew, please drop this version (and no need to worry about backports).

I'm guessing that yet another ugly shmem_file() or shmem_mapping()
exception should be good enough - I doubt you need to consider the
huge= option, just go by whether there is a huge folio already there -
though that would have an implication for the following patch.

(But what do I mean by "huge folio" above?  Do I mean large or do
I mean pmd_mappable?  It's the huge=always pmd_mappable folios I
care not to break, the mTHPy ones can be argued either way.)

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-27  8:20   ` Hugh Dickins
@ 2025-10-27  9:14     ` Kiryl Shutsemau
  2025-10-27  9:22     ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-27  9:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Oct 27, 2025 at 01:20:42AM -0700, Hugh Dickins wrote:
> On Thu, 23 Oct 2025, Kiryl Shutsemau wrote:
> 
> > From: Kiryl Shutsemau <kas@kernel.org>
> > 
> > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > supposed to generate SIGBUS.
> > 
> > Recent changes attempted to fault in full folio where possible. They did
> > not respect i_size, which led to populating PTEs beyond i_size and
> > breaking SIGBUS semantics.
> > 
> > Darrick reported generic/749 breakage because of this.
> > 
> > However, the problem existed before the recent changes. With huge=always
> > tmpfs, any write to a file leads to PMD-size allocation. Following the
> > fault-in of the folio will install PMD mapping regardless of i_size.
> > 
> > Fix filemap_map_pages() and finish_fault() to not install:
> >   - PTEs beyond i_size;
> >   - PMD mappings across i_size;
> 
> Sorry for coming in late as usual, and complicating matters.
> 
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> > Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
> > Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> ACK to restoring the correct POSIX behaviour to those filesystems
> which are being given large folios beyond EOF transparently,
> without any huge= mount option to permit it.
> 
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> 
> But NAK to regressing the intentional behaviour of huge=always
> on shmem/tmpfs: the page size, whenever possible, is PMD-sized.  In
> 6.18-rc huge=always is currently (thanks to Baolin) behaving correctly
> again, as it had done for nine years: I insist we do not re-break it.
> 
> Andrew, please drop this version (and no need to worry about backports).
> 
> I'm guessing that yet another ugly shmem_file() or shmem_mapping()
> exception should be good enough - I doubt you need to consider the
> huge= option, just go by whether there is a huge folio already there -
> though that would have an implication for the following patch.
> 
> (But what do I mean by "huge folio" above?  Do I mean large or do
> I mean pmd_mappable?  It's the huge=always pmd_mappable folios I
> care not to break, the mTHPy ones can be argued either way.)

I assume you want the same exception for the second patch as well?

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-27  8:20   ` Hugh Dickins
  2025-10-27  9:14     ` Kiryl Shutsemau
@ 2025-10-27  9:22     ` David Hildenbrand
  2025-10-29  8:31       ` Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-10-27  9:22 UTC (permalink / raw)
  To: Hugh Dickins, Kiryl Shutsemau
  Cc: Andrew Morton, Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On 27.10.25 09:20, Hugh Dickins wrote:
> On Thu, 23 Oct 2025, Kiryl Shutsemau wrote:
> 
>> From: Kiryl Shutsemau <kas@kernel.org>
>>
>> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
>> supposed to generate SIGBUS.
>>
>> Recent changes attempted to fault in full folio where possible. They did
>> not respect i_size, which led to populating PTEs beyond i_size and
>> breaking SIGBUS semantics.
>>
>> Darrick reported generic/749 breakage because of this.
>>
>> However, the problem existed before the recent changes. With huge=always
>> tmpfs, any write to a file leads to PMD-size allocation. Following the
>> fault-in of the folio will install PMD mapping regardless of i_size.
>>
>> Fix filemap_map_pages() and finish_fault() to not install:
>>    - PTEs beyond i_size;
>>    - PMD mappings across i_size;
> 
> Sorry for coming in late as usual, and complicating matters.
> 

No problem, we CCed you on earlier versions to get your input, and we 
were speculating that shmem behavior might be intended (one way or the 
other).

>>
>> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
>> Fixes: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
>> Fixes: 357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> ACK to restoring the correct POSIX behaviour to those filesystems
> which are being given large folios beyond EOF transparently,
> without any huge= mount option to permit it.
> 
>> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> 
> But NAK to regressing the intentional behaviour of huge=always
> on shmem/tmpfs: the page size, whenever possible, is PMD-sized.  In
> 6.18-rc huge=always is currently (thanks to Baolin) behaving correctly
> again, as it had done for nine years: I insist we do not re-break it.

Just so we are on the same page: this is not about which folio sizes we 
allocate (like what Baolin fixed) but what/how much to map.

I guess this patch here would imply the following changes

1) A file with a size that is not PMD aligned will have the last 
(unaligned part) not mapped by PMDs.

2) Once growing a file, the previously-last-part would not be mapped by 
PMDs.


Of course, we would have only mapped the last part of the file by PMDs 
if the VMA would have been large enough in the first place. I'm curious, 
is that something that is commonly done by applications with shmem files 
(map beyond eof)?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-27  9:22     ` David Hildenbrand
@ 2025-10-29  8:31       ` Hugh Dickins
  2025-10-29 10:11         ` Kiryl Shutsemau
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2025-10-29  8:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Kiryl Shutsemau, Andrew Morton, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On Mon, 27 Oct 2025, David Hildenbrand wrote:
...
> 
> Just so we are on the same page: this is not about which folio sizes we
> allocate (like what Baolin fixed) but what/how much to map.
> 
> I guess this patch here would imply the following changes
> 
> 1) A file with a size that is not PMD aligned will have the last (unaligned
> part) not mapped by PMDs.
> 
> 2) Once growing a file, the previously-last-part would not be mapped by PMDs.

Yes, the v2 patch was so, and the v3 patch fixes it.

khugepaged might have fixed it up later on, I suppose.

Hmm, does hpage_collapse_scan_file() or collapse_pte_mapped_thp()
want a modification, to prevent reinserting a PMD after a failed
non-shmem truncation folio_split?  And collapse_file() after a
successful non-shmem truncation folio_split?

Conversely, shouldn't MADV_COLLAPSE be happy to give you a PMD
if the map size permits, even when spanning EOF?

> 
> Of course, we would have only mapped the last part of the file by PMDs if the
> VMA would have been large enough in the first place. I'm curious, is that
> something that is commonly done by applications with shmem files (map beyond
> eof)?

Setting aside the very common case of mapping a fraction of PAGE_SIZE
beyond EOF...

I do not know whether it's common to map a >= PAGE_SIZE fraction of
HPAGE_PMD_SIZE beyond EOF, but it has often been sensible to do so.
For example, imagine (using x86_64 numbers) a 4MiB map of a 3MiB
file on huge tmpfs, requiring two TLB entries for the whole file.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-29  8:31       ` Hugh Dickins
@ 2025-10-29 10:11         ` Kiryl Shutsemau
  2025-10-30  5:59           ` Hugh Dickins
  0 siblings, 1 reply; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-29 10:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Oct 29, 2025 at 01:31:45AM -0700, Hugh Dickins wrote:
> On Mon, 27 Oct 2025, David Hildenbrand wrote:
> ...
> > 
> > Just so we are on the same page: this is not about which folio sizes we
> > allocate (like what Baolin fixed) but what/how much to map.
> > 
> > I guess this patch here would imply the following changes
> > 
> > 1) A file with a size that is not PMD aligned will have the last (unaligned
> > part) not mapped by PMDs.
> > 
> > 2) Once growing a file, the previously-last-part would not be mapped by PMDs.
> 
> Yes, the v2 patch was so, and the v3 patch fixes it.
> 
> khugepaged might have fixed it up later on, I suppose.
> 
> Hmm, does hpage_collapse_scan_file() or collapse_pte_mapped_thp()
> want a modification, to prevent reinserting a PMD after a failed
> non-shmem truncation folio_split?  And collapse_file() after a
> successful non-shmem truncation folio_split?

I operated from an assumption that file collapse is still lazy as I
wrote it back it the days and doesn't install PMDs. It *seems* to be
true for khugepaged, but not MADV_COLLAPSE.

Hm...

> Conversely, shouldn't MADV_COLLAPSE be happy to give you a PMD
> if the map size permits, even when spanning EOF?

Filesystem folks say allowing the folio to be mapped beyond
round_up(i_size, PAGE_SIZE) is a correctness issue, not only POSIX
violation.

I consider dropping 'install_pmd' from collapse_pte_mapped_thp() so the
fault path is source of truth of whether PMD can be installed or not.

Objections?

> > Of course, we would have only mapped the last part of the file by PMDs if the
> > VMA would have been large enough in the first place. I'm curious, is that
> > something that is commonly done by applications with shmem files (map beyond
> > eof)?
> 
> Setting aside the very common case of mapping a fraction of PAGE_SIZE
> beyond EOF...
> 
> I do not know whether it's common to map a >= PAGE_SIZE fraction of
> HPAGE_PMD_SIZE beyond EOF, but it has often been sensible to do so.
> For example, imagine (using x86_64 numbers) a 4MiB map of a 3MiB
> file on huge tmpfs, requiring two TLB entries for the whole file.

I am all for ignoring POSIX here. But I am in minority.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-29 10:11         ` Kiryl Shutsemau
@ 2025-10-30  5:59           ` Hugh Dickins
  2025-10-30 17:08             ` Kiryl Shutsemau
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2025-10-30  5:59 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Hugh Dickins, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel

On Wed, 29 Oct 2025, Kiryl Shutsemau wrote:
> On Wed, Oct 29, 2025 at 01:31:45AM -0700, Hugh Dickins wrote:
> > On Mon, 27 Oct 2025, David Hildenbrand wrote:
> > ...
> > > 
> > > Just so we are on the same page: this is not about which folio sizes we
> > > allocate (like what Baolin fixed) but what/how much to map.
> > > 
> > > I guess this patch here would imply the following changes
> > > 
> > > 1) A file with a size that is not PMD aligned will have the last (unaligned
> > > part) not mapped by PMDs.
> > > 
> > > 2) Once growing a file, the previously-last-part would not be mapped by PMDs.
> > 
> > Yes, the v2 patch was so, and the v3 patch fixes it.
> > 
> > khugepaged might have fixed it up later on, I suppose.
> > 
> > Hmm, does hpage_collapse_scan_file() or collapse_pte_mapped_thp()
> > want a modification, to prevent reinserting a PMD after a failed
> > non-shmem truncation folio_split?  And collapse_file() after a
> > successful non-shmem truncation folio_split?
> 
> I operated from an assumption that file collapse is still lazy as I
> wrote it back it the days and doesn't install PMDs. It *seems* to be
> true for khugepaged, but not MADV_COLLAPSE.
> 
> Hm...
> 
> > Conversely, shouldn't MADV_COLLAPSE be happy to give you a PMD
> > if the map size permits, even when spanning EOF?
> 
> Filesystem folks say allowing the folio to be mapped beyond
> round_up(i_size, PAGE_SIZE) is a correctness issue, not only POSIX
> violation.
> 
> I consider dropping 'install_pmd' from collapse_pte_mapped_thp() so the
> fault path is source of truth of whether PMD can be installed or not.

(Didn't you yourself just recently enhance that?)

> 
> Objections?

Yes, I would probably object (or perhaps want to allow until EOF);
but now it looks to me like we can agree no change is needed there.

I was mistaken in raising those khugepaged/MADV_COLLAPSE doubts,
because file_thp_enabled(vma) is checked in the !shmem !anonymous
!dax case, and file_thp_enabled(vma) still limits to
CONFIG_READ_ONLY_THP_FOR_FS=y, refusing to allow collapse if anyone
has the file open for writing (and you cannot truncate or hole-punch
without write permission); and pagecache is invalidated afterwards
if there are any THPs when reopened for writing (presumably for
page_mkwrite()-ish consistency reasons, which you interestingly
pointed to in another mail where I had worried about ENOSPC after
split failure).

But shmem is simple, does not use page_mkwrite(), and is fine to
continue with install_pmd here, just as it's fine to continue
with huge page spanning EOF as you're now allowing in v3.

But please double check my conclusion there, it's so easy to
get lost in the maze of hugepage permissions and prohibitions.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size
  2025-10-30  5:59           ` Hugh Dickins
@ 2025-10-30 17:08             ` Kiryl Shutsemau
  0 siblings, 0 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-30 17:08 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Hildenbrand, Andrew Morton, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Oct 29, 2025 at 10:59:24PM -0700, Hugh Dickins wrote:
> On Wed, 29 Oct 2025, Kiryl Shutsemau wrote:
> > On Wed, Oct 29, 2025 at 01:31:45AM -0700, Hugh Dickins wrote:
> > > On Mon, 27 Oct 2025, David Hildenbrand wrote:
> > > ...
> > > > 
> > > > Just so we are on the same page: this is not about which folio sizes we
> > > > allocate (like what Baolin fixed) but what/how much to map.
> > > > 
> > > > I guess this patch here would imply the following changes
> > > > 
> > > > 1) A file with a size that is not PMD aligned will have the last (unaligned
> > > > part) not mapped by PMDs.
> > > > 
> > > > 2) Once growing a file, the previously-last-part would not be mapped by PMDs.
> > > 
> > > Yes, the v2 patch was so, and the v3 patch fixes it.
> > > 
> > > khugepaged might have fixed it up later on, I suppose.
> > > 
> > > Hmm, does hpage_collapse_scan_file() or collapse_pte_mapped_thp()
> > > want a modification, to prevent reinserting a PMD after a failed
> > > non-shmem truncation folio_split?  And collapse_file() after a
> > > successful non-shmem truncation folio_split?
> > 
> > I operated from an assumption that file collapse is still lazy as I
> > wrote it back it the days and doesn't install PMDs. It *seems* to be
> > true for khugepaged, but not MADV_COLLAPSE.
> > 
> > Hm...
> > 
> > > Conversely, shouldn't MADV_COLLAPSE be happy to give you a PMD
> > > if the map size permits, even when spanning EOF?
> > 
> > Filesystem folks say allowing the folio to be mapped beyond
> > round_up(i_size, PAGE_SIZE) is a correctness issue, not only POSIX
> > violation.
> > 
> > I consider dropping 'install_pmd' from collapse_pte_mapped_thp() so the
> > fault path is source of truth of whether PMD can be installed or not.
> 
> (Didn't you yourself just recently enhance that?)

I failed to adjust my mental model :P

> > 
> > Objections?
> 
> Yes, I would probably object (or perhaps want to allow until EOF);
> but now it looks to me like we can agree no change is needed there.
> 
> I was mistaken in raising those khugepaged/MADV_COLLAPSE doubts,
> because file_thp_enabled(vma) is checked in the !shmem !anonymous
> !dax case, and file_thp_enabled(vma) still limits to
> CONFIG_READ_ONLY_THP_FOR_FS=y, refusing to allow collapse if anyone
> has the file open for writing (and you cannot truncate or hole-punch
> without write permission); and pagecache is invalidated afterwards
> if there are any THPs when reopened for writing (presumably for
> page_mkwrite()-ish consistency reasons, which you interestingly
> pointed to in another mail where I had worried about ENOSPC after
> split failure).
> 
> But shmem is simple, does not use page_mkwrite(), and is fine to
> continue with install_pmd here, just as it's fine to continue
> with huge page spanning EOF as you're now allowing in v3.
> 
> But please double check my conclusion there, it's so easy to
> get lost in the maze of hugepage permissions and prohibitions.

Your analysis looks correct to me.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-23  9:32 [PATCHv2 0/2] Fix SIGBUS semantics with large folios Kiryl Shutsemau
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
@ 2025-10-23  9:32 ` Kiryl Shutsemau
  2025-10-23 20:56   ` Andrew Morton
                     ` (2 more replies)
  2025-10-23 17:47 ` [PATCHv2 0/2] Fix SIGBUS semantics with large folios Darrick J. Wong
  2 siblings, 3 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-23  9:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

From: Kiryl Shutsemau <kas@kernel.org>

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

This behavior might not be respected on truncation.

During truncation, the kernel splits a large folio in order to reclaim
memory. As a side effect, it unmaps the folio and destroys PMD mappings
of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
are preserved.

However, if the split fails, PMD mappings are preserved and the user
will not receive SIGBUS on any accesses within the PMD.

Unmap the folio on split failure. It will lead to refault as PTEs and
preserve SIGBUS semantics.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/truncate.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 91eb92a5ce4f..304c383ccbf0 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -177,6 +177,28 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
+static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at)
+{
+	enum ttu_flags ttu_flags =
+		TTU_SYNC |
+		TTU_SPLIT_HUGE_PMD |
+		TTU_IGNORE_MLOCK;
+	int ret;
+
+	ret = try_folio_split(folio, split_at, NULL);
+
+	/*
+	 * If the split fails, unmap the folio, so it will be refaulted
+	 * with PTEs to respect SIGBUS semantics.
+	 */
+	if (ret) {
+		try_to_unmap(folio, ttu_flags);
+		WARN_ON(folio_mapped(folio));
+	}
+
+	return ret;
+}
+
 /*
  * Handle partial folios.  The folio may be entirely within the
  * range if a split has raced with us.  If not, we zero the part of the
@@ -224,7 +246,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		return true;
 
 	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
-	if (!try_folio_split(folio, split_at, NULL)) {
+	if (!try_folio_split_or_unmap(folio, split_at)) {
 		/*
 		 * try to split at offset + length to make sure folios within
 		 * the range can be dropped, especially to avoid memory waste
@@ -248,13 +270,10 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		if (!folio_trylock(folio2))
 			goto out;
 
-		/*
-		 * make sure folio2 is large and does not change its mapping.
-		 * Its split result does not matter here.
-		 */
+		/* make sure folio2 is large and does not change its mapping */
 		if (folio_test_large(folio2) &&
 		    folio2->mapping == folio->mapping)
-			try_folio_split(folio2, split_at2, NULL);
+			try_folio_split_or_unmap(folio2, split_at2);
 
 		folio_unlock(folio2);
 out:
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
@ 2025-10-23 20:56   ` Andrew Morton
  2025-10-24  9:05     ` Kiryl Shutsemau
  2025-10-24 15:43   ` David Hildenbrand
  2025-10-27 10:10   ` Hugh Dickins
  2 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2025-10-23 20:56 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Thu, 23 Oct 2025 10:32:51 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:

> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> This behavior might not be respected on truncation.
> 
> During truncation, the kernel splits a large folio in order to reclaim
> memory. As a side effect, it unmaps the folio and destroys PMD mappings
> of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
> are preserved.
> 
> However, if the split fails, PMD mappings are preserved and the user
> will not receive SIGBUS on any accesses within the PMD.
> 
> Unmap the folio on split failure. It will lead to refault as PTEs and
> preserve SIGBUS semantics.

This conflicts significantly with mm-hotfixes's
https://lore.kernel.org/all/20251017013630.139907-1-ziy@nvidia.com/T/#u,
whcih is cc:stable.

What do do here?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-23 20:56   ` Andrew Morton
@ 2025-10-24  9:05     ` Kiryl Shutsemau
  0 siblings, 0 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-24  9:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Hugh Dickins, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Thu, Oct 23, 2025 at 01:56:44PM -0700, Andrew Morton wrote:
> On Thu, 23 Oct 2025 10:32:51 +0100 Kiryl Shutsemau <kirill@shutemov.name> wrote:
> 
> > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > supposed to generate SIGBUS.
> > 
> > This behavior might not be respected on truncation.
> > 
> > During truncation, the kernel splits a large folio in order to reclaim
> > memory. As a side effect, it unmaps the folio and destroys PMD mappings
> > of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
> > are preserved.
> > 
> > However, if the split fails, PMD mappings are preserved and the user
> > will not receive SIGBUS on any accesses within the PMD.
> > 
> > Unmap the folio on split failure. It will lead to refault as PTEs and
> > preserve SIGBUS semantics.
> 
> This conflicts significantly with mm-hotfixes's
> https://lore.kernel.org/all/20251017013630.139907-1-ziy@nvidia.com/T/#u,
> whcih is cc:stable.
> 
> What do do here?

The patch below applies cleanly onto mm-everything.

Let me now if you want solve the conflict other way around. I can rebase
Zi's patch on top my patchset.

From 3ebc2c6690928def2b123e5f44014c02011cfc65 Mon Sep 17 00:00:00 2001
From: Kiryl Shutsemau <kas@kernel.org>
Date: Mon, 20 Oct 2025 14:08:21 +0100
Subject: [PATCH] mm/truncate: Unmap large folio on split failure

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

This behavior might not be respected on truncation.

During truncation, the kernel splits a large folio in order to reclaim
memory. As a side effect, it unmaps the folio and destroys PMD mappings
of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
are preserved.

However, if the split fails, PMD mappings are preserved and the user
will not receive SIGBUS on any accesses within the PMD.

Unmap the folio on split failure. It will lead to refault as PTEs and
preserve SIGBUS semantics.

Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
---
 mm/truncate.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 9210cf808f5c..6936b8e88e72 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -177,6 +177,29 @@ int truncate_inode_folio(struct address_space *mapping, struct folio *folio)
 	return 0;
 }
 
+static int try_folio_split_or_unmap(struct folio *folio, struct page *split_at,
+				    unsigned long min_order)
+{
+	enum ttu_flags ttu_flags =
+		TTU_SYNC |
+		TTU_SPLIT_HUGE_PMD |
+		TTU_IGNORE_MLOCK;
+	int ret;
+
+	ret = try_folio_split_to_order(folio, split_at, min_order);
+
+	/*
+	 * If the split fails, unmap the folio, so it will be refaulted
+	 * with PTEs to respect SIGBUS semantics.
+	 */
+	if (ret) {
+		try_to_unmap(folio, ttu_flags);
+		WARN_ON(folio_mapped(folio));
+	}
+
+	return ret;
+}
+
 /*
  * Handle partial folios.  The folio may be entirely within the
  * range if a split has raced with us.  If not, we zero the part of the
@@ -226,7 +249,7 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 
 	min_order = mapping_min_folio_order(folio->mapping);
 	split_at = folio_page(folio, PAGE_ALIGN_DOWN(offset) / PAGE_SIZE);
-	if (!try_folio_split_to_order(folio, split_at, min_order)) {
+	if (!try_folio_split_or_unmap(folio, split_at, min_order)) {
 		/*
 		 * try to split at offset + length to make sure folios within
 		 * the range can be dropped, especially to avoid memory waste
@@ -250,13 +273,10 @@ bool truncate_inode_partial_folio(struct folio *folio, loff_t start, loff_t end)
 		if (!folio_trylock(folio2))
 			goto out;
 
-		/*
-		 * make sure folio2 is large and does not change its mapping.
-		 * Its split result does not matter here.
-		 */
+		/* make sure folio2 is large and does not change its mapping */
 		if (folio_test_large(folio2) &&
 		    folio2->mapping == folio->mapping)
-			try_folio_split_to_order(folio2, split_at2, min_order);
+			try_folio_split_or_unmap(folio2, split_at2, min_order);
 
 		folio_unlock(folio2);
 out:
-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
  2025-10-23 20:56   ` Andrew Morton
@ 2025-10-24 15:43   ` David Hildenbrand
  2025-10-27 10:10   ` Hugh Dickins
  2 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-24 15:43 UTC (permalink / raw)
  To: Kiryl Shutsemau, Andrew Morton, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner
  Cc: Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On 23.10.25 11:32, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> This behavior might not be respected on truncation.
> 
> During truncation, the kernel splits a large folio in order to reclaim
> memory. As a side effect, it unmaps the folio and destroys PMD mappings
> of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
> are preserved.
> 
> However, if the split fails, PMD mappings are preserved and the user
> will not receive SIGBUS on any accesses within the PMD.
> 
> Unmap the folio on split failure. It will lead to refault as PTEs and
> preserve SIGBUS semantics.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> ---

Thanks!

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
  2025-10-23 20:56   ` Andrew Morton
  2025-10-24 15:43   ` David Hildenbrand
@ 2025-10-27 10:10   ` Hugh Dickins
  2025-10-27 10:38     ` David Hildenbrand
  2025-10-27 10:40     ` Kiryl Shutsemau
  2 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2025-10-27 10:10 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

On Thu, 23 Oct 2025, Kiryl Shutsemau wrote:

> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> supposed to generate SIGBUS.
> 
> This behavior might not be respected on truncation.
> 
> During truncation, the kernel splits a large folio in order to reclaim
> memory. As a side effect, it unmaps the folio and destroys PMD mappings
> of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
> are preserved.
> 
> However, if the split fails, PMD mappings are preserved and the user
> will not receive SIGBUS on any accesses within the PMD.
> 
> Unmap the folio on split failure. It will lead to refault as PTEs and
> preserve SIGBUS semantics.
> 
> Signed-off-by: Kiryl Shutsemau <kas@kernel.org>

It's taking me too long to understand what truncate_inode_partial_folio()
had become before your changes, to be very sure of your changes to it.

But if your commit does indeed achieve what's intended, then I have
no objection to it applying to shmem/tmpfs as well as other filesystems:
we always hope that a split will succeed, so I don't mind you tightening
up what is done when it fails.

However, a few points that have occurred to me...

If 1/2's exception for shmem/tmpfs huge=always does the simple thing,
of just judging by whether a huge page is already there in the file
(without reference to mount option), which I think is okay: then
this 2/2 will not be doing anything useful on shmem/tmpfs, because
when the split fails, the huge page will remain, and after 2/2's
unmap it will just get remapped by PMD again afterwards, so why
bother to unmap at all in the shmem/tmpfs case?.

But it's arguable whether it would then be worth making an
exception for shmem/tmpfs here in 2/2 - how much do we care about
optimizing failed splits?  At least a comment I guess, but you
might prefer to do it quite differently.

Aside from shmem/tmpfs, it does seem to me that this patch is
doing more work than it needs to (but how many lines of source
do we want to add to avoid doing work in the failed split case?):

The intent is to enable SIGBUS beyond EOF: but the changes are
being applied unnecessarily to hole-punch in addition to truncation.

Does the folio2 part actually need to unmap again?  And if it does,
then what about when its trylock failed?  But it's hole-punch anyway.

And a final nit: I'd delete that WARN_ON(folio_mapped(folio)) myself,
all it could ever achieve is perhaps a very rare syzbot report which
nobody would want to spend time on.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-27 10:10   ` Hugh Dickins
@ 2025-10-27 10:38     ` David Hildenbrand
  2025-10-27 10:40     ` Kiryl Shutsemau
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-10-27 10:38 UTC (permalink / raw)
  To: Hugh Dickins, Kiryl Shutsemau
  Cc: Andrew Morton, Matthew Wilcox, Alexander Viro, Christian Brauner,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel,
	Kiryl Shutsemau

> 
> And a final nit: I'd delete that WARN_ON(folio_mapped(folio)) myself,
> all it could ever achieve is perhaps a very rare syzbot report which
> nobody would want to spend time on.

I think that's the crucial part: what are we supposed to do if both 
splitting and unmapping fails? Silently violate SIGBUS semantics like we 
did before this patch?

So far our understanding was that for file pages, unmapping should not 
fail, but I am not 100% sure if that's really the case.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-27 10:10   ` Hugh Dickins
  2025-10-27 10:38     ` David Hildenbrand
@ 2025-10-27 10:40     ` Kiryl Shutsemau
  2025-10-29  9:12       ` Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-27 10:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Mon, Oct 27, 2025 at 03:10:29AM -0700, Hugh Dickins wrote:
> On Thu, 23 Oct 2025, Kiryl Shutsemau wrote:
> 
> > From: Kiryl Shutsemau <kas@kernel.org>
> > 
> > Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
> > supposed to generate SIGBUS.
> > 
> > This behavior might not be respected on truncation.
> > 
> > During truncation, the kernel splits a large folio in order to reclaim
> > memory. As a side effect, it unmaps the folio and destroys PMD mappings
> > of the folio. The folio will be refaulted as PTEs and SIGBUS semantics
> > are preserved.
> > 
> > However, if the split fails, PMD mappings are preserved and the user
> > will not receive SIGBUS on any accesses within the PMD.
> > 
> > Unmap the folio on split failure. It will lead to refault as PTEs and
> > preserve SIGBUS semantics.
> > 
> > Signed-off-by: Kiryl Shutsemau <kas@kernel.org>
> 
> It's taking me too long to understand what truncate_inode_partial_folio()
> had become before your changes, to be very sure of your changes to it.
> 
> But if your commit does indeed achieve what's intended, then I have
> no objection to it applying to shmem/tmpfs as well as other filesystems:
> we always hope that a split will succeed, so I don't mind you tightening
> up what is done when it fails.
> 
> However, a few points that have occurred to me...
> 
> If 1/2's exception for shmem/tmpfs huge=always does the simple thing,
> of just judging by whether a huge page is already there in the file
> (without reference to mount option), which I think is okay: then
> this 2/2 will not be doing anything useful on shmem/tmpfs, because
> when the split fails, the huge page will remain, and after 2/2's
> unmap it will just get remapped by PMD again afterwards, so why
> bother to unmap at all in the shmem/tmpfs case?.
> 
> But it's arguable whether it would then be worth making an
> exception for shmem/tmpfs here in 2/2 - how much do we care about
> optimizing failed splits?  At least a comment I guess, but you
> might prefer to do it quite differently.

It is easy enough to skip unmap for shmem.

> Aside from shmem/tmpfs, it does seem to me that this patch is
> doing more work than it needs to (but how many lines of source
> do we want to add to avoid doing work in the failed split case?):
> 
> The intent is to enable SIGBUS beyond EOF: but the changes are
> being applied unnecessarily to hole-punch in addition to truncation.

I am not sure much it should apply to hole-punch. Filesystem folks talk
about writing to a folio beyond round_up(i_size, PAGE_SIZE) being
problematic for correctness. I have no clue if the same applies to
writing to hole-punched parts of the folio.

Dave, any comments?

Hm. But if it is problematic it has be caught on fault. We don't do
this. It will be silently mapped.

> Does the folio2 part actually need to unmap again?  And if it does,
> then what about when its trylock failed?  But it's hole-punch anyway.

I don't think we can do much for !trylock case, unless we a willing to
upgrade it to folio_lock(). try_to_unmap() requires the folio to be
locked or we will race with fault.

Maybe folio_lock() is not too bad here for freshly split folio.

> And a final nit: I'd delete that WARN_ON(folio_mapped(folio)) myself,
> all it could ever achieve is perhaps a very rare syzbot report which
> nobody would want to spend time on.

David asked for it. I am fine either way.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-27 10:40     ` Kiryl Shutsemau
@ 2025-10-29  9:12       ` Hugh Dickins
  2025-10-29 10:21         ` Kiryl Shutsemau
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2025-10-29  9:12 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Hugh Dickins, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Darrick J. Wong,
	Dave Chinner, linux-mm, linux-fsdevel, linux-kernel

On Mon, 27 Oct 2025, Kiryl Shutsemau wrote:
> On Mon, Oct 27, 2025 at 03:10:29AM -0700, Hugh Dickins wrote:
...
> 
> > Aside from shmem/tmpfs, it does seem to me that this patch is
> > doing more work than it needs to (but how many lines of source
> > do we want to add to avoid doing work in the failed split case?):
> > 
> > The intent is to enable SIGBUS beyond EOF: but the changes are
> > being applied unnecessarily to hole-punch in addition to truncation.
> 
> I am not sure much it should apply to hole-punch. Filesystem folks talk
> about writing to a folio beyond round_up(i_size, PAGE_SIZE) being
> problematic for correctness. I have no clue if the same applies to
> writing to hole-punched parts of the folio.
> 
> Dave, any comments?
> 
> Hm. But if it is problematic it has be caught on fault. We don't do
> this. It will be silently mapped.

There are strict rules about what happens beyond i_size, hence this
patch.  But hole-punch has no persistent "i_size" to define it, and
silently remapping in a fresh zeroed page is the correct behaviour.

So the patch is making more work than is needed for hole-punch.

But I am thinking there of the view from above, from userspace.
If I think of the view from below, from the filesystem, then I'm
not at all sure how a filesystem is expected to deal with a failed
folio_split - and that goes beyond this patch, and therefore I
don't think it should concern you in this patch.

If a filesystem is asked to punch a hole, but mm cannot split the
folio which covers that hole, then page cache and filesystem are
left out of synch?  And if filesystem thinks one block has been
freed, and it's the last block in that filesystem, and it's then
given out to someone else, then our unsplit folio hides an ENOSPC?

Maybe this has all been well thought out, and each large folio
filesystem deals with it appropriately somehow; but I wouldn't
know, since a tmpfs is simply backed by its page cache.

Hugh

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-29  9:12       ` Hugh Dickins
@ 2025-10-29 10:21         ` Kiryl Shutsemau
  2025-10-29 15:19           ` Darrick J. Wong
  0 siblings, 1 reply; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-29 10:21 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Rik van Riel, Harry Yoo, Johannes Weiner, Shakeel Butt,
	Baolin Wang, Darrick J. Wong, Dave Chinner, linux-mm,
	linux-fsdevel, linux-kernel

On Wed, Oct 29, 2025 at 02:12:48AM -0700, Hugh Dickins wrote:
> On Mon, 27 Oct 2025, Kiryl Shutsemau wrote:
> > On Mon, Oct 27, 2025 at 03:10:29AM -0700, Hugh Dickins wrote:
> ...
> > 
> > > Aside from shmem/tmpfs, it does seem to me that this patch is
> > > doing more work than it needs to (but how many lines of source
> > > do we want to add to avoid doing work in the failed split case?):
> > > 
> > > The intent is to enable SIGBUS beyond EOF: but the changes are
> > > being applied unnecessarily to hole-punch in addition to truncation.
> > 
> > I am not sure much it should apply to hole-punch. Filesystem folks talk
> > about writing to a folio beyond round_up(i_size, PAGE_SIZE) being
> > problematic for correctness. I have no clue if the same applies to
> > writing to hole-punched parts of the folio.
> > 
> > Dave, any comments?
> > 
> > Hm. But if it is problematic it has be caught on fault. We don't do
> > this. It will be silently mapped.
> 
> There are strict rules about what happens beyond i_size, hence this
> patch.  But hole-punch has no persistent "i_size" to define it, and
> silently remapping in a fresh zeroed page is the correct behaviour.

I missed that we seems to be issuing vm_ops->page_mkwrite() on remaping
the page, so it is not completely silent for filesystem and can do its
thing to re-allocate metadata (or whatever) after hole-punch.

So, I see unmap on punch-hole being justified.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-29 10:21         ` Kiryl Shutsemau
@ 2025-10-29 15:19           ` Darrick J. Wong
  2025-10-29 17:10             ` Kiryl Shutsemau
  0 siblings, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2025-10-29 15:19 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Hugh Dickins, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Dave Chinner,
	linux-mm, linux-fsdevel, linux-kernel

On Wed, Oct 29, 2025 at 10:21:53AM +0000, Kiryl Shutsemau wrote:
> On Wed, Oct 29, 2025 at 02:12:48AM -0700, Hugh Dickins wrote:
> > On Mon, 27 Oct 2025, Kiryl Shutsemau wrote:
> > > On Mon, Oct 27, 2025 at 03:10:29AM -0700, Hugh Dickins wrote:
> > ...
> > > 
> > > > Aside from shmem/tmpfs, it does seem to me that this patch is
> > > > doing more work than it needs to (but how many lines of source
> > > > do we want to add to avoid doing work in the failed split case?):
> > > > 
> > > > The intent is to enable SIGBUS beyond EOF: but the changes are
> > > > being applied unnecessarily to hole-punch in addition to truncation.
> > > 
> > > I am not sure much it should apply to hole-punch. Filesystem folks talk
> > > about writing to a folio beyond round_up(i_size, PAGE_SIZE) being
> > > problematic for correctness. I have no clue if the same applies to
> > > writing to hole-punched parts of the folio.
> > > 
> > > Dave, any comments?
> > > 
> > > Hm. But if it is problematic it has be caught on fault. We don't do
> > > this. It will be silently mapped.
> > 
> > There are strict rules about what happens beyond i_size, hence this
> > patch.  But hole-punch has no persistent "i_size" to define it, and
> > silently remapping in a fresh zeroed page is the correct behaviour.
> 
> I missed that we seems to be issuing vm_ops->page_mkwrite() on remaping
> the page, so it is not completely silent for filesystem and can do its
> thing to re-allocate metadata (or whatever) after hole-punch.
> 
> So, I see unmap on punch-hole being justified.

Most hole punching implementations in filesystems will take i_rwsem and
mmap_invalidate lock, flush the range to disk and unmap the pagecache
for all the fsblocks around that range, and only then update the file
space mappings.  If the unmap fails because a PMD couldn't be split,
then we'll just return that error to userspace and they can decide what
to do when fallocate() fails.

--D

> -- 
>   Kiryl Shutsemau / Kirill A. Shutemov
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure
  2025-10-29 15:19           ` Darrick J. Wong
@ 2025-10-29 17:10             ` Kiryl Shutsemau
  0 siblings, 0 replies; 29+ messages in thread
From: Kiryl Shutsemau @ 2025-10-29 17:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Hugh Dickins, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Dave Chinner,
	linux-mm, linux-fsdevel, linux-kernel

On Wed, Oct 29, 2025 at 08:19:47AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 29, 2025 at 10:21:53AM +0000, Kiryl Shutsemau wrote:
> > On Wed, Oct 29, 2025 at 02:12:48AM -0700, Hugh Dickins wrote:
> > > On Mon, 27 Oct 2025, Kiryl Shutsemau wrote:
> > > > On Mon, Oct 27, 2025 at 03:10:29AM -0700, Hugh Dickins wrote:
> > > ...
> > > > 
> > > > > Aside from shmem/tmpfs, it does seem to me that this patch is
> > > > > doing more work than it needs to (but how many lines of source
> > > > > do we want to add to avoid doing work in the failed split case?):
> > > > > 
> > > > > The intent is to enable SIGBUS beyond EOF: but the changes are
> > > > > being applied unnecessarily to hole-punch in addition to truncation.
> > > > 
> > > > I am not sure much it should apply to hole-punch. Filesystem folks talk
> > > > about writing to a folio beyond round_up(i_size, PAGE_SIZE) being
> > > > problematic for correctness. I have no clue if the same applies to
> > > > writing to hole-punched parts of the folio.
> > > > 
> > > > Dave, any comments?
> > > > 
> > > > Hm. But if it is problematic it has be caught on fault. We don't do
> > > > this. It will be silently mapped.
> > > 
> > > There are strict rules about what happens beyond i_size, hence this
> > > patch.  But hole-punch has no persistent "i_size" to define it, and
> > > silently remapping in a fresh zeroed page is the correct behaviour.
> > 
> > I missed that we seems to be issuing vm_ops->page_mkwrite() on remaping
> > the page, so it is not completely silent for filesystem and can do its
> > thing to re-allocate metadata (or whatever) after hole-punch.
> > 
> > So, I see unmap on punch-hole being justified.
> 
> Most hole punching implementations in filesystems will take i_rwsem and
> mmap_invalidate lock, flush the range to disk and unmap the pagecache
> for all the fsblocks around that range, and only then update the file
> space mappings.  If the unmap fails because a PMD couldn't be split,
> then we'll just return that error to userspace and they can decide what
> to do when fallocate() fails.

Unmap does not fail and PMD can be split at any time. But split of large
folios can fail if there's an external pin on it.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCHv2 0/2] Fix SIGBUS semantics with large folios
  2025-10-23  9:32 [PATCHv2 0/2] Fix SIGBUS semantics with large folios Kiryl Shutsemau
  2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
  2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
@ 2025-10-23 17:47 ` Darrick J. Wong
  2 siblings, 0 replies; 29+ messages in thread
From: Darrick J. Wong @ 2025-10-23 17:47 UTC (permalink / raw)
  To: Kiryl Shutsemau
  Cc: Andrew Morton, David Hildenbrand, Hugh Dickins, Matthew Wilcox,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Rik van Riel, Harry Yoo,
	Johannes Weiner, Shakeel Butt, Baolin Wang, Dave Chinner,
	linux-mm, linux-fsdevel, linux-kernel, Kiryl Shutsemau

On Thu, Oct 23, 2025 at 10:32:49AM +0100, Kiryl Shutsemau wrote:
> From: Kiryl Shutsemau <kas@kernel.org>
> 
> Accessing memory within a VMA, but beyond i_size rounded up to the next
> page size, is supposed to generate SIGBUS.
> 
> Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749
> failed due to missing SIGBUS. This was caused by my recent changes that
> try to fault in the whole folio where possible:
> 
>         19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()")
>         357b92761d94 ("mm/filemap: map entire large folio faultaround")
> 
> These changes did not consider i_size when setting up PTEs, leading to
> xfstest breakage.
> 
> However, the problem has been present in the kernel for a long time -
> since huge tmpfs was introduced in 2016. The kernel happily maps
> PMD-sized folios as PMD without checking i_size. And huge=always tmpfs
> allocates PMD-size folios on any writes.
> 
> I considered this corner case when I implemented a large tmpfs, and my
> conclusion was that no one in their right mind should rely on receiving
> a SIGBUS signal when accessing beyond i_size. I cannot imagine how it
> could be useful for the workload.
> 
> But apparently filesystem folks care a lot about preserving strict
> SIGBUS semantics.
> 
> Generic/749 was introduced last year with reference to POSIX, but no
> real workloads were mentioned. It also acknowledged the tmpfs deviation
> from the test case.
> 
> POSIX indeed says[3]:
> 
>         References within the address range starting at pa and
>         continuing for len bytes to whole pages following the end of an
>         object shall result in delivery of a SIGBUS signal.
> 
> The patchset fixes the regression introduced by recent changes as well
> as more subtle SIGBUS breakage due to split failure on truncation.
> 

This fixes generic/749 for me, thanks!
Tested-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> v2:
>  - Fix try_to_unmap() flags;
>  - Add warning if try_to_unmap() fails to unmap the folio;
>  - Adjust comments and commit messages;
>  - Whitespace fixes;
> v1:
>  - Drop RFC;
>  - Add Signed-off-bys;
> 
> [1] https://lore.kernel.org/all/20251014175214.GW6188@frogsfrogsfrogs
> [2]
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/tests/generic/749?h=for-next&id=e4a6b119e5
> 229599eac96235fb7e683b8a8bdc53
> [3] https://pubs.opengroup.org/onlinepubs/9799919799/
> Kiryl Shutsemau (2):
>   mm/memory: Do not populate page table entries beyond i_size
>   mm/truncate: Unmap large folio on split failure
> 
>  mm/filemap.c  | 18 ++++++++++--------
>  mm/memory.c   | 13 +++++++++++--
>  mm/truncate.c | 31 +++++++++++++++++++++++++------
>  3 files changed, 46 insertions(+), 16 deletions(-)
> 
> -- 
> 2.50.1
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-10-30 17:08 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23  9:32 [PATCHv2 0/2] Fix SIGBUS semantics with large folios Kiryl Shutsemau
2025-10-23  9:32 ` [PATCHv2 1/2] mm/memory: Do not populate page table entries beyond i_size Kiryl Shutsemau
2025-10-23 20:49   ` Andrew Morton
2025-10-23 20:54     ` David Hildenbrand
2025-10-23 21:36       ` Andrew Morton
2025-10-24  9:26         ` Kiryl Shutsemau
2025-10-26  4:54           ` Andrew Morton
2025-10-24 15:42   ` David Hildenbrand
2025-10-24 19:32     ` Kirill A. Shutemov
2025-10-27  9:34       ` David Hildenbrand
2025-10-27  8:20   ` Hugh Dickins
2025-10-27  9:14     ` Kiryl Shutsemau
2025-10-27  9:22     ` David Hildenbrand
2025-10-29  8:31       ` Hugh Dickins
2025-10-29 10:11         ` Kiryl Shutsemau
2025-10-30  5:59           ` Hugh Dickins
2025-10-30 17:08             ` Kiryl Shutsemau
2025-10-23  9:32 ` [PATCHv2 2/2] mm/truncate: Unmap large folio on split failure Kiryl Shutsemau
2025-10-23 20:56   ` Andrew Morton
2025-10-24  9:05     ` Kiryl Shutsemau
2025-10-24 15:43   ` David Hildenbrand
2025-10-27 10:10   ` Hugh Dickins
2025-10-27 10:38     ` David Hildenbrand
2025-10-27 10:40     ` Kiryl Shutsemau
2025-10-29  9:12       ` Hugh Dickins
2025-10-29 10:21         ` Kiryl Shutsemau
2025-10-29 15:19           ` Darrick J. Wong
2025-10-29 17:10             ` Kiryl Shutsemau
2025-10-23 17:47 ` [PATCHv2 0/2] Fix SIGBUS semantics with large folios Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).