[PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
@ 2025-06-27  6:23 Lance Yang
  2025-06-27  6:52 ` Barry Song
  2025-06-27 20:09 ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Lance Yang @ 2025-06-27  6:23 UTC (permalink / raw)
  To: akpm, david, 21cnbao
  Cc: baolin.wang, chrisl, ioworker0, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Barry Song, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
can read past the end of a PTE table if a large folio is mapped starting at
the last entry of that table. It would be quite rare in practice, as
MADV_FREE typically splits the large folio ;)

So let's fix the potential out-of-bounds read by refactoring the logic into
a new helper, folio_unmap_pte_batch().

The new helper now correctly calculates the safe number of pages to scan by
limiting the operation to the boundaries of the current VMA and the PTE
table.

In addition, the "all-or-nothing" batching restriction is removed to
support partial batches. The reference counting is also cleaned up to use
folio_put_refs().

[1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com

Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation")
Cc: <stable@vger.kernel.org>
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Barry Song <baohua@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
v1 -> v2:
 - Update subject and changelog (per Barry)
 - https://lore.kernel.org/linux-mm/20250627025214.30887-1-lance.yang@linux.dev

 mm/rmap.c | 46 ++++++++++++++++++++++++++++------------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index fb63d9256f09..1320b88fab74 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1845,23 +1845,32 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
 #endif
 }
 
-/* We support batch unmapping of PTEs for lazyfree large folios */
-static inline bool can_batch_unmap_folio_ptes(unsigned long addr,
-			struct folio *folio, pte_t *ptep)
+static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
+			struct page_vma_mapped_walk *pvmw,
+			enum ttu_flags flags, pte_t pte)
 {
 	const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
-	int max_nr = folio_nr_pages(folio);
-	pte_t pte = ptep_get(ptep);
+	unsigned long end_addr, addr = pvmw->address;
+	struct vm_area_struct *vma = pvmw->vma;
+	unsigned int max_nr;
+
+	if (flags & TTU_HWPOISON)
+		return 1;
+	if (!folio_test_large(folio))
+		return 1;
 
+	/* We may only batch within a single VMA and a single page table. */
+	end_addr = pmd_addr_end(addr, vma->vm_end);
+	max_nr = (end_addr - addr) >> PAGE_SHIFT;
+
+	/* We only support lazyfree batching for now ... */
 	if (!folio_test_anon(folio) || folio_test_swapbacked(folio))
-		return false;
+		return 1;
 	if (pte_unused(pte))
-		return false;
-	if (pte_pfn(pte) != folio_pfn(folio))
-		return false;
+		return 1;
 
-	return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL,
-			       NULL, NULL) == max_nr;
+	return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags,
+			       NULL, NULL, NULL);
 }
 
 /*
@@ -2024,9 +2033,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			if (pte_dirty(pteval))
 				folio_mark_dirty(folio);
 		} else if (likely(pte_present(pteval))) {
-			if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
-			    can_batch_unmap_folio_ptes(address, folio, pvmw.pte))
-				nr_pages = folio_nr_pages(folio);
+			nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
 			end_addr = address + nr_pages * PAGE_SIZE;
 			flush_cache_range(vma, address, end_addr);
 
@@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			hugetlb_remove_rmap(folio);
 		} else {
 			folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
-			folio_ref_sub(folio, nr_pages - 1);
 		}
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
-		folio_put(folio);
-		/* We have already batched the entire folio */
-		if (nr_pages > 1)
+		folio_put_refs(folio, nr_pages);
+
+		/*
+		 * If we are sure that we batched the entire folio and cleared
+		 * all PTEs, we can just optimize and stop right here.
+		 */
+		if (nr_pages == folio_nr_pages(folio))
 			goto walk_done;
 		continue;
 walk_abort:
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  6:23 [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap Lance Yang
@ 2025-06-27  6:52 ` Barry Song
  2025-06-27  6:55   ` Barry Song
  2025-06-27 20:09 ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-06-27  6:52 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, baolin.wang, chrisl, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Lance Yang

On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> From: Lance Yang <lance.yang@linux.dev>
>
> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> can read past the end of a PTE table if a large folio is mapped starting at
> the last entry of that table. It would be quite rare in practice, as
> MADV_FREE typically splits the large folio ;)
>
> So let's fix the potential out-of-bounds read by refactoring the logic into
> a new helper, folio_unmap_pte_batch().
>
> The new helper now correctly calculates the safe number of pages to scan by
> limiting the operation to the boundaries of the current VMA and the PTE
> table.
>
> In addition, the "all-or-nothing" batching restriction is removed to
> support partial batches. The reference counting is also cleaned up to use
> folio_put_refs().
>
> [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
>

What about ?

As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
may read past the end of a PTE table when a large folio spans across two PMDs,
particularly after being remapped with mremap(). This patch fixes the
potential out-of-bounds access by capping the batch at vm_end and the PMD
boundary.

It also refactors the logic into a new helper, folio_unmap_pte_batch(),
which supports batching between 1 and folio_nr_pages. This improves code
clarity. Note that such cases are rare in practice, as MADV_FREE typically
splits large folios.

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  6:52 ` Barry Song
@ 2025-06-27  6:55   ` Barry Song
  2025-06-27  7:15     ` Lance Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-06-27  6:55 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, baolin.wang, chrisl, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Lance Yang

On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > From: Lance Yang <lance.yang@linux.dev>
> >
> > As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> > can read past the end of a PTE table if a large folio is mapped starting at
> > the last entry of that table. It would be quite rare in practice, as
> > MADV_FREE typically splits the large folio ;)
> >
> > So let's fix the potential out-of-bounds read by refactoring the logic into
> > a new helper, folio_unmap_pte_batch().
> >
> > The new helper now correctly calculates the safe number of pages to scan by
> > limiting the operation to the boundaries of the current VMA and the PTE
> > table.
> >
> > In addition, the "all-or-nothing" batching restriction is removed to
> > support partial batches. The reference counting is also cleaned up to use
> > folio_put_refs().
> >
> > [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
> >
>
> What about ?
>
> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> may read past the end of a PTE table when a large folio spans across two PMDs,
> particularly after being remapped with mremap(). This patch fixes the
> potential out-of-bounds access by capping the batch at vm_end and the PMD
> boundary.
>
> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
> which supports batching between 1 and folio_nr_pages. This improves code
> clarity. Note that such cases are rare in practice, as MADV_FREE typically
> splits large folios.

Sorry, I meant that MADV_FREE typically splits large folios if the specified
range doesn't cover the entire folio.

>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  6:55   ` Barry Song
@ 2025-06-27  7:15     ` Lance Yang
  2025-06-27  7:36       ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: Lance Yang @ 2025-06-27  7:15 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, baolin.wang, chrisl, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Lance Yang



On 2025/6/27 14:55, Barry Song wrote:
> On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> wrote:
>>>
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
>>> can read past the end of a PTE table if a large folio is mapped starting at
>>> the last entry of that table. It would be quite rare in practice, as
>>> MADV_FREE typically splits the large folio ;)
>>>
>>> So let's fix the potential out-of-bounds read by refactoring the logic into
>>> a new helper, folio_unmap_pte_batch().
>>>
>>> The new helper now correctly calculates the safe number of pages to scan by
>>> limiting the operation to the boundaries of the current VMA and the PTE
>>> table.
>>>
>>> In addition, the "all-or-nothing" batching restriction is removed to
>>> support partial batches. The reference counting is also cleaned up to use
>>> folio_put_refs().
>>>
>>> [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
>>>
>>
>> What about ?
>>
>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
>> may read past the end of a PTE table when a large folio spans across two PMDs,
>> particularly after being remapped with mremap(). This patch fixes the
>> potential out-of-bounds access by capping the batch at vm_end and the PMD
>> boundary.
>>
>> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
>> which supports batching between 1 and folio_nr_pages. This improves code
>> clarity. Note that such cases are rare in practice, as MADV_FREE typically
>> splits large folios.
> 
> Sorry, I meant that MADV_FREE typically splits large folios if the specified
> range doesn't cover the entire folio.

Hmm... I got it wrong as well :( It's the partial coverage that triggers 
the split.

how about this revised version:

As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
may read past the end of a PTE table when a large folio spans across two
PMDs, particularly after being remapped with mremap(). This patch fixes
the potential out-of-bounds access by capping the batch at vm_end and the
PMD boundary.

It also refactors the logic into a new helper, folio_unmap_pte_batch(),
which supports batching between 1 and folio_nr_pages. This improves code
clarity. Note that such boundary-straddling cases are rare in practice, as
MADV_FREE will typically split a large folio if the advice range does not
cover the entire folio.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  7:15     ` Lance Yang
@ 2025-06-27  7:36       ` Barry Song
  2025-06-27 10:13         ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-06-27  7:36 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, baolin.wang, chrisl, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Lance Yang

On Fri, Jun 27, 2025 at 7:15 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2025/6/27 14:55, Barry Song wrote:
> > On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> wrote:
> >>>
> >>> From: Lance Yang <lance.yang@linux.dev>
> >>>
> >>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> >>> can read past the end of a PTE table if a large folio is mapped starting at
> >>> the last entry of that table. It would be quite rare in practice, as
> >>> MADV_FREE typically splits the large folio ;)
> >>>
> >>> So let's fix the potential out-of-bounds read by refactoring the logic into
> >>> a new helper, folio_unmap_pte_batch().
> >>>
> >>> The new helper now correctly calculates the safe number of pages to scan by
> >>> limiting the operation to the boundaries of the current VMA and the PTE
> >>> table.
> >>>
> >>> In addition, the "all-or-nothing" batching restriction is removed to
> >>> support partial batches. The reference counting is also cleaned up to use
> >>> folio_put_refs().
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
> >>>
> >>
> >> What about ?
> >>
> >> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> >> may read past the end of a PTE table when a large folio spans across two PMDs,
> >> particularly after being remapped with mremap(). This patch fixes the
> >> potential out-of-bounds access by capping the batch at vm_end and the PMD
> >> boundary.
> >>
> >> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
> >> which supports batching between 1 and folio_nr_pages. This improves code
> >> clarity. Note that such cases are rare in practice, as MADV_FREE typically
> >> splits large folios.
> >
> > Sorry, I meant that MADV_FREE typically splits large folios if the specified
> > range doesn't cover the entire folio.
>
> Hmm... I got it wrong as well :( It's the partial coverage that triggers
> the split.
>
> how about this revised version:
>
> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> may read past the end of a PTE table when a large folio spans across two
> PMDs, particularly after being remapped with mremap(). This patch fixes
> the potential out-of-bounds access by capping the batch at vm_end and the
> PMD boundary.
>
> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
> which supports batching between 1 and folio_nr_pages. This improves code
> clarity. Note that such boundary-straddling cases are rare in practice, as
> MADV_FREE will typically split a large folio if the advice range does not
> cover the entire folio.

I assume the out-of-bounds access must be fixed, even though it is very
unlikely. It might occur after a large folio is marked with MADV_FREE and
then remapped to an unaligned address, potentially crossing two PTE tables.

A batch size between 2 and nr_pages - 1 is practically rare, as we typically
split large folios when MADV_FREE does not cover the entire folio range.
Cases where a batch of size 2 or nr_pages - 1 occurs may only happen if a
large folio is partially unmapped after being marked MADV_FREE, which is
quite an unusual pattern in userspace.

Let's wait for David's feedback before preparing a new version :-)

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  7:36       ` Barry Song
@ 2025-06-27 10:13         ` David Hildenbrand
  2025-06-27 15:29           ` Lance Yang
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-06-27 10:13 UTC (permalink / raw)
  To: Barry Song, Lance Yang
  Cc: akpm, baolin.wang, chrisl, kasong, linux-arm-kernel, linux-kernel,
	linux-mm, linux-riscv, lorenzo.stoakes, ryan.roberts,
	v-songbaohua, x86, huang.ying.caritas, zhengtangquan, riel,
	Liam.Howlett, vbabka, harry.yoo, mingzhe.yang, stable, Lance Yang

On 27.06.25 09:36, Barry Song wrote:
> On Fri, Jun 27, 2025 at 7:15 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>>
>>
>> On 2025/6/27 14:55, Barry Song wrote:
>>> On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> wrote:
>>>>>
>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>
>>>>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
>>>>> can read past the end of a PTE table if a large folio is mapped starting at
>>>>> the last entry of that table. It would be quite rare in practice, as
>>>>> MADV_FREE typically splits the large folio ;)
>>>>>
>>>>> So let's fix the potential out-of-bounds read by refactoring the logic into
>>>>> a new helper, folio_unmap_pte_batch().
>>>>>
>>>>> The new helper now correctly calculates the safe number of pages to scan by
>>>>> limiting the operation to the boundaries of the current VMA and the PTE
>>>>> table.
>>>>>
>>>>> In addition, the "all-or-nothing" batching restriction is removed to
>>>>> support partial batches. The reference counting is also cleaned up to use
>>>>> folio_put_refs().
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
>>>>>
>>>>
>>>> What about ?
>>>>
>>>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
>>>> may read past the end of a PTE table when a large folio spans across two PMDs,
>>>> particularly after being remapped with mremap(). This patch fixes the
>>>> potential out-of-bounds access by capping the batch at vm_end and the PMD
>>>> boundary.
>>>>
>>>> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
>>>> which supports batching between 1 and folio_nr_pages. This improves code
>>>> clarity. Note that such cases are rare in practice, as MADV_FREE typically
>>>> splits large folios.
>>>
>>> Sorry, I meant that MADV_FREE typically splits large folios if the specified
>>> range doesn't cover the entire folio.
>>
>> Hmm... I got it wrong as well :( It's the partial coverage that triggers
>> the split.
>>
>> how about this revised version:
>>
>> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
>> may read past the end of a PTE table when a large folio spans across two
>> PMDs, particularly after being remapped with mremap(). This patch fixes
>> the potential out-of-bounds access by capping the batch at vm_end and the
>> PMD boundary.
>>
>> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
>> which supports batching between 1 and folio_nr_pages. This improves code
>> clarity. Note that such boundary-straddling cases are rare in practice, as
>> MADV_FREE will typically split a large folio if the advice range does not
>> cover the entire folio.
> 
> I assume the out-of-bounds access must be fixed, even though it is very
> unlikely. It might occur after a large folio is marked with MADV_FREE and
> then remapped to an unaligned address, potentially crossing two PTE tables.

Right. If it can be triggered from userspace, it doesn't matter how 
likely/common/whatever it is. It must be fixed.

> 
> A batch size between 2 and nr_pages - 1 is practically rare, as we typically
> split large folios when MADV_FREE does not cover the entire folio range.
> Cases where a batch of size 2 or nr_pages - 1 occurs may only happen if a
> large folio is partially unmapped after being marked MADV_FREE, which is
> quite an unusual pattern in userspace.

I think the point is rather "Simplify the code by not special-casing for 
completely mapped folios, there is no real reason why we cannot batch 
ranges that don't cover the complete folio.".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27 10:13         ` David Hildenbrand
@ 2025-06-27 15:29           ` Lance Yang
  2025-06-27 15:49             ` David Hildenbrand
  2025-06-27 22:42             ` Barry Song
  0 siblings, 2 replies; 10+ messages in thread
From: Lance Yang @ 2025-06-27 15:29 UTC (permalink / raw)
  To: David Hildenbrand, Barry Song
  Cc: akpm, baolin.wang, chrisl, kasong, linux-arm-kernel, linux-kernel,
	linux-mm, linux-riscv, lorenzo.stoakes, ryan.roberts,
	v-songbaohua, x86, huang.ying.caritas, zhengtangquan, riel,
	Liam.Howlett, vbabka, harry.yoo, mingzhe.yang, stable, Lance Yang



On 2025/6/27 18:13, David Hildenbrand wrote:
> On 27.06.25 09:36, Barry Song wrote:
>> On Fri, Jun 27, 2025 at 7:15 PM Lance Yang <lance.yang@linux.dev> wrote:
>>>
>>>
>>>
>>> On 2025/6/27 14:55, Barry Song wrote:
>>>> On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>
>>>>> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com> 
>>>>> wrote:
>>>>>>
>>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>>
>>>>>> As pointed out by David[1], the batched unmap logic in 
>>>>>> try_to_unmap_one()
>>>>>> can read past the end of a PTE table if a large folio is mapped 
>>>>>> starting at
>>>>>> the last entry of that table. It would be quite rare in practice, as
>>>>>> MADV_FREE typically splits the large folio ;)
>>>>>>
>>>>>> So let's fix the potential out-of-bounds read by refactoring the 
>>>>>> logic into
>>>>>> a new helper, folio_unmap_pte_batch().
>>>>>>
>>>>>> The new helper now correctly calculates the safe number of pages 
>>>>>> to scan by
>>>>>> limiting the operation to the boundaries of the current VMA and 
>>>>>> the PTE
>>>>>> table.
>>>>>>
>>>>>> In addition, the "all-or-nothing" batching restriction is removed to
>>>>>> support partial batches. The reference counting is also cleaned up 
>>>>>> to use
>>>>>> folio_put_refs().
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/ 
>>>>>> a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
>>>>>>
>>>>>
>>>>> What about ?
>>>>>
>>>>> As pointed out by David[1], the batched unmap logic in 
>>>>> try_to_unmap_one()
>>>>> may read past the end of a PTE table when a large folio spans 
>>>>> across two PMDs,
>>>>> particularly after being remapped with mremap(). This patch fixes the
>>>>> potential out-of-bounds access by capping the batch at vm_end and 
>>>>> the PMD
>>>>> boundary.
>>>>>
>>>>> It also refactors the logic into a new helper, 
>>>>> folio_unmap_pte_batch(),
>>>>> which supports batching between 1 and folio_nr_pages. This improves 
>>>>> code
>>>>> clarity. Note that such cases are rare in practice, as MADV_FREE 
>>>>> typically
>>>>> splits large folios.
>>>>
>>>> Sorry, I meant that MADV_FREE typically splits large folios if the 
>>>> specified
>>>> range doesn't cover the entire folio.
>>>
>>> Hmm... I got it wrong as well :( It's the partial coverage that triggers
>>> the split.
>>>
>>> how about this revised version:
>>>
>>> As pointed out by David[1], the batched unmap logic in 
>>> try_to_unmap_one()
>>> may read past the end of a PTE table when a large folio spans across two
>>> PMDs, particularly after being remapped with mremap(). This patch fixes
>>> the potential out-of-bounds access by capping the batch at vm_end and 
>>> the
>>> PMD boundary.
>>>
>>> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
>>> which supports batching between 1 and folio_nr_pages. This improves code
>>> clarity. Note that such boundary-straddling cases are rare in 
>>> practice, as
>>> MADV_FREE will typically split a large folio if the advice range does 
>>> not
>>> cover the entire folio.
>>
>> I assume the out-of-bounds access must be fixed, even though it is very
>> unlikely. It might occur after a large folio is marked with MADV_FREE and
>> then remapped to an unaligned address, potentially crossing two PTE 
>> tables.
> 
> Right. If it can be triggered from userspace, it doesn't matter how 
> likely/common/whatever it is. It must be fixed.

Agreed. It must be fixed regardless of how rare the scenario is ;)

> 
>>
>> A batch size between 2 and nr_pages - 1 is practically rare, as we 
>> typically
>> split large folios when MADV_FREE does not cover the entire folio range.
>> Cases where a batch of size 2 or nr_pages - 1 occurs may only happen if a
>> large folio is partially unmapped after being marked MADV_FREE, which is
>> quite an unusual pattern in userspace.
> 
> I think the point is rather "Simplify the code by not special-casing for 
> completely mapped folios, there is no real reason why we cannot batch 
> ranges that don't cover the complete folio.".

Yeah. That makes the code cleaner and more generic, as there is no
strong reason to special-case for fully mapped folios ;)

Based on that, I think we're on the same page now. I'd like to post
the following commit message for the next version:

```
As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
may read past the end of a PTE table when a large folio's PTE mappings
are not fully contained within a single page table.

While this scenario might be rare, an issue triggerable from userspace must
be fixed regardless of its likelihood. This patch fixes the out-of-bounds
access by refactoring the logic into a new helper, folio_unmap_pte_batch().

The new helper correctly calculates the safe batch size by capping the
scan at both the VMA and PMD boundaries. To simplify the code, it also
supports partial batching (i.e., any number of pages from 1 up to the
calculated safe maximum), as there is no strong reason to special-case
for fully mapped folios.
```

So, wdyt?

Thanks,
Lance






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27 15:29           ` Lance Yang
@ 2025-06-27 15:49             ` David Hildenbrand
  2025-06-27 22:42             ` Barry Song
  1 sibling, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-06-27 15:49 UTC (permalink / raw)
  To: Lance Yang, Barry Song
  Cc: akpm, baolin.wang, chrisl, kasong, linux-arm-kernel, linux-kernel,
	linux-mm, linux-riscv, lorenzo.stoakes, ryan.roberts,
	v-songbaohua, x86, huang.ying.caritas, zhengtangquan, riel,
	Liam.Howlett, vbabka, harry.yoo, mingzhe.yang, stable, Lance Yang

On 27.06.25 17:29, Lance Yang wrote:
> 
> 
> On 2025/6/27 18:13, David Hildenbrand wrote:
>> On 27.06.25 09:36, Barry Song wrote:
>>> On Fri, Jun 27, 2025 at 7:15 PM Lance Yang <lance.yang@linux.dev> wrote:
>>>>
>>>>
>>>>
>>>> On 2025/6/27 14:55, Barry Song wrote:
>>>>> On Fri, Jun 27, 2025 at 6:52 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, Jun 27, 2025 at 6:23 PM Lance Yang <ioworker0@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>>>>
>>>>>>> As pointed out by David[1], the batched unmap logic in
>>>>>>> try_to_unmap_one()
>>>>>>> can read past the end of a PTE table if a large folio is mapped
>>>>>>> starting at
>>>>>>> the last entry of that table. It would be quite rare in practice, as
>>>>>>> MADV_FREE typically splits the large folio ;)
>>>>>>>
>>>>>>> So let's fix the potential out-of-bounds read by refactoring the
>>>>>>> logic into
>>>>>>> a new helper, folio_unmap_pte_batch().
>>>>>>>
>>>>>>> The new helper now correctly calculates the safe number of pages
>>>>>>> to scan by
>>>>>>> limiting the operation to the boundaries of the current VMA and
>>>>>>> the PTE
>>>>>>> table.
>>>>>>>
>>>>>>> In addition, the "all-or-nothing" batching restriction is removed to
>>>>>>> support partial batches. The reference counting is also cleaned up
>>>>>>> to use
>>>>>>> folio_put_refs().
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-mm/
>>>>>>> a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com
>>>>>>>
>>>>>>
>>>>>> What about ?
>>>>>>
>>>>>> As pointed out by David[1], the batched unmap logic in
>>>>>> try_to_unmap_one()
>>>>>> may read past the end of a PTE table when a large folio spans
>>>>>> across two PMDs,
>>>>>> particularly after being remapped with mremap(). This patch fixes the
>>>>>> potential out-of-bounds access by capping the batch at vm_end and
>>>>>> the PMD
>>>>>> boundary.
>>>>>>
>>>>>> It also refactors the logic into a new helper,
>>>>>> folio_unmap_pte_batch(),
>>>>>> which supports batching between 1 and folio_nr_pages. This improves
>>>>>> code
>>>>>> clarity. Note that such cases are rare in practice, as MADV_FREE
>>>>>> typically
>>>>>> splits large folios.
>>>>>
>>>>> Sorry, I meant that MADV_FREE typically splits large folios if the
>>>>> specified
>>>>> range doesn't cover the entire folio.
>>>>
>>>> Hmm... I got it wrong as well :( It's the partial coverage that triggers
>>>> the split.
>>>>
>>>> how about this revised version:
>>>>
>>>> As pointed out by David[1], the batched unmap logic in
>>>> try_to_unmap_one()
>>>> may read past the end of a PTE table when a large folio spans across two
>>>> PMDs, particularly after being remapped with mremap(). This patch fixes
>>>> the potential out-of-bounds access by capping the batch at vm_end and
>>>> the
>>>> PMD boundary.
>>>>
>>>> It also refactors the logic into a new helper, folio_unmap_pte_batch(),
>>>> which supports batching between 1 and folio_nr_pages. This improves code
>>>> clarity. Note that such boundary-straddling cases are rare in
>>>> practice, as
>>>> MADV_FREE will typically split a large folio if the advice range does
>>>> not
>>>> cover the entire folio.
>>>
>>> I assume the out-of-bounds access must be fixed, even though it is very
>>> unlikely. It might occur after a large folio is marked with MADV_FREE and
>>> then remapped to an unaligned address, potentially crossing two PTE
>>> tables.
>>
>> Right. If it can be triggered from userspace, it doesn't matter how
>> likely/common/whatever it is. It must be fixed.
> 
> Agreed. It must be fixed regardless of how rare the scenario is ;)
> 
>>
>>>
>>> A batch size between 2 and nr_pages - 1 is practically rare, as we
>>> typically
>>> split large folios when MADV_FREE does not cover the entire folio range.
>>> Cases where a batch of size 2 or nr_pages - 1 occurs may only happen if a
>>> large folio is partially unmapped after being marked MADV_FREE, which is
>>> quite an unusual pattern in userspace.
>>
>> I think the point is rather "Simplify the code by not special-casing for
>> completely mapped folios, there is no real reason why we cannot batch
>> ranges that don't cover the complete folio.".
> 
> Yeah. That makes the code cleaner and more generic, as there is no
> strong reason to special-case for fully mapped folios ;)
> 
> Based on that, I think we're on the same page now. I'd like to post
> the following commit message for the next version:
> 
> ```
> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> may read past the end of a PTE table when a large folio's PTE mappings
> are not fully contained within a single page table.
> 
> While this scenario might be rare, an issue triggerable from userspace must
> be fixed regardless of its likelihood. This patch fixes the out-of-bounds
> access by refactoring the logic into a new helper, folio_unmap_pte_batch().
> 
> The new helper correctly calculates the safe batch size by capping the
> scan at both the VMA and PMD boundaries. To simplify the code, it also
> supports partial batching (i.e., any number of pages from 1 up to the
> calculated safe maximum), as there is no strong reason to special-case
> for fully mapped folios.
> ```
> 
> So, wdyt?

Sounds good to me.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27  6:23 [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap Lance Yang
  2025-06-27  6:52 ` Barry Song
@ 2025-06-27 20:09 ` Andrew Morton
  1 sibling, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2025-06-27 20:09 UTC (permalink / raw)
  To: Lance Yang
  Cc: david, 21cnbao, baolin.wang, chrisl, kasong, linux-arm-kernel,
	linux-kernel, linux-mm, linux-riscv, lorenzo.stoakes,
	ryan.roberts, v-songbaohua, x86, huang.ying.caritas,
	zhengtangquan, riel, Liam.Howlett, vbabka, harry.yoo,
	mingzhe.yang, stable, Barry Song, Lance Yang

On Fri, 27 Jun 2025 14:23:19 +0800 Lance Yang <ioworker0@gmail.com> wrote:

> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> can read past the end of a PTE table if a large folio is mapped starting at
> the last entry of that table. It would be quite rare in practice, as
> MADV_FREE typically splits the large folio ;)
> 
> So let's fix the potential out-of-bounds read by refactoring the logic into
> a new helper, folio_unmap_pte_batch().
> 
> The new helper now correctly calculates the safe number of pages to scan by
> limiting the operation to the boundaries of the current VMA and the PTE
> table.
> 
> In addition, the "all-or-nothing" batching restriction is removed to
> support partial batches. The reference counting is also cleaned up to use
> folio_put_refs().

I'll queue this for testing while the updated changelog is being prepared.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap
  2025-06-27 15:29           ` Lance Yang
  2025-06-27 15:49             ` David Hildenbrand
@ 2025-06-27 22:42             ` Barry Song
  1 sibling, 0 replies; 10+ messages in thread
From: Barry Song @ 2025-06-27 22:42 UTC (permalink / raw)
  To: Lance Yang
  Cc: David Hildenbrand, akpm, baolin.wang, chrisl, kasong,
	linux-arm-kernel, linux-kernel, linux-mm, linux-riscv,
	lorenzo.stoakes, ryan.roberts, v-songbaohua, x86,
	huang.ying.caritas, zhengtangquan, riel, Liam.Howlett, vbabka,
	harry.yoo, mingzhe.yang, stable, Lance Yang

On Sat, Jun 28, 2025 at 3:29 AM Lance Yang <lance.yang@linux.dev> wrote:

[...]
>
> Based on that, I think we're on the same page now. I'd like to post
> the following commit message for the next version:
>
> ```
> As pointed out by David[1], the batched unmap logic in try_to_unmap_one()
> may read past the end of a PTE table when a large folio's PTE mappings
> are not fully contained within a single page table.
>
> While this scenario might be rare, an issue triggerable from userspace must
> be fixed regardless of its likelihood. This patch fixes the out-of-bounds
> access by refactoring the logic into a new helper, folio_unmap_pte_batch().
>
> The new helper correctly calculates the safe batch size by capping the
> scan at both the VMA and PMD boundaries. To simplify the code, it also
> supports partial batching (i.e., any number of pages from 1 up to the
> calculated safe maximum), as there is no strong reason to special-case
> for fully mapped folios.
> ```
>
> So, wdyt?
>

Acked-by: Barry Song <baohua@kernel.org>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-06-27 22:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27  6:23 [PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap Lance Yang
2025-06-27  6:52 ` Barry Song
2025-06-27  6:55   ` Barry Song
2025-06-27  7:15     ` Lance Yang
2025-06-27  7:36       ` Barry Song
2025-06-27 10:13         ` David Hildenbrand
2025-06-27 15:29           ` Lance Yang
2025-06-27 15:49             ` David Hildenbrand
2025-06-27 22:42             ` Barry Song
2025-06-27 20:09 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).