From: Harry Yoo <harry.yoo@oracle.com>
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
baolin.wang@linux.alibaba.com, chrisl@kernel.org,
david@redhat.com, ioworker0@gmail.com, kasong@tencent.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org,
lorenzo.stoakes@oracle.com, ryan.roberts@arm.com,
v-songbaohua@oppo.com, x86@kernel.org, ying.huang@intel.com,
zhengtangquan@oppo.com
Subject: Re: [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation
Date: Tue, 1 Jul 2025 22:27:30 +0900 [thread overview]
Message-ID: <aGPiQq4cIPDt-Ue-@hyeyoo> (raw)
In-Reply-To: <aGOyhvR-GaUYgLwQ@hyeyoo>
On Tue, Jul 01, 2025 at 07:03:50PM +0900, Harry Yoo wrote:
> On Fri, Feb 14, 2025 at 10:30:14PM +1300, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Currently, the PTEs and rmap of a large folio are removed one at a time.
> > This is not only slow but also causes the large folio to be unnecessarily
> > added to deferred_split, which can lead to races between the
> > deferred_split shrinker callback and memory reclamation. This patch
> > releases all PTEs and rmap entries in a batch.
> > Currently, it only handles lazyfree large folios.
> >
> > The below microbench tries to reclaim 128MB lazyfree large folios
> > whose sizes are 64KiB:
> >
> > #include <stdio.h>
> > #include <sys/mman.h>
> > #include <string.h>
> > #include <time.h>
> >
> > #define SIZE 128*1024*1024 // 128 MB
> >
> > unsigned long read_split_deferred()
> > {
> > FILE *file = fopen("/sys/kernel/mm/transparent_hugepage"
> > "/hugepages-64kB/stats/split_deferred", "r");
> > if (!file) {
> > perror("Error opening file");
> > return 0;
> > }
> >
> > unsigned long value;
> > if (fscanf(file, "%lu", &value) != 1) {
> > perror("Error reading value");
> > fclose(file);
> > return 0;
> > }
> >
> > fclose(file);
> > return value;
> > }
> >
> > int main(int argc, char *argv[])
> > {
> > while(1) {
> > volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
> > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >
> > memset((void *)p, 1, SIZE);
> >
> > madvise((void *)p, SIZE, MADV_FREE);
> >
> > clock_t start_time = clock();
> > unsigned long start_split = read_split_deferred();
> > madvise((void *)p, SIZE, MADV_PAGEOUT);
> > clock_t end_time = clock();
> > unsigned long end_split = read_split_deferred();
> >
> > double elapsed_time = (double)(end_time - start_time) / CLOCKS_PER_SEC;
> > printf("Time taken by reclamation: %f seconds, split_deferred: %ld\n",
> > elapsed_time, end_split - start_split);
> >
> > munmap((void *)p, SIZE);
> > }
> > return 0;
> > }
> >
> > w/o patch:
> > ~ # ./a.out
> > Time taken by reclamation: 0.177418 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.178348 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.174525 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.171620 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.172241 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.174003 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.171058 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.171993 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.169829 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.172895 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.176063 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.172568 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.171185 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.170632 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.170208 seconds, split_deferred: 2048
> > Time taken by reclamation: 0.174192 seconds, split_deferred: 2048
> > ...
> >
> > w/ patch:
> > ~ # ./a.out
> > Time taken by reclamation: 0.074231 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071026 seconds, split_deferred: 0
> > Time taken by reclamation: 0.072029 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071873 seconds, split_deferred: 0
> > Time taken by reclamation: 0.073573 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071906 seconds, split_deferred: 0
> > Time taken by reclamation: 0.073604 seconds, split_deferred: 0
> > Time taken by reclamation: 0.075903 seconds, split_deferred: 0
> > Time taken by reclamation: 0.073191 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071228 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071391 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071468 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071896 seconds, split_deferred: 0
> > Time taken by reclamation: 0.072508 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071884 seconds, split_deferred: 0
> > Time taken by reclamation: 0.072433 seconds, split_deferred: 0
> > Time taken by reclamation: 0.071939 seconds, split_deferred: 0
> > ...
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
>
> I'm still following the long discussions and follow-up patch series,
> but let me ask a possibly silly question here :)
>
> > mm/rmap.c | 72 ++++++++++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 50 insertions(+), 22 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 89e51a7a9509..8786704bd466 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1933,23 +1953,26 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > if (pte_dirty(pteval))
> > folio_mark_dirty(folio);
> > } else if (likely(pte_present(pteval))) {
> > - flush_cache_page(vma, address, pfn);
> > - /* Nuke the page table entry. */
> > - if (should_defer_flush(mm, flags)) {
> > - /*
> > - * We clear the PTE but do not flush so potentially
> > - * a remote CPU could still be writing to the folio.
> > - * If the entry was previously clean then the
> > - * architecture must guarantee that a clear->dirty
> > - * transition on a cached TLB entry is written through
> > - * and traps if the PTE is unmapped.
> > - */
> > - pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> > + if (folio_test_large(folio) && !(flags & TTU_HWPOISON) &&
> > + can_batch_unmap_folio_ptes(address, folio, pvmw.pte))
> > + nr_pages = folio_nr_pages(folio);
> > + end_addr = address + nr_pages * PAGE_SIZE;
> > + flush_cache_range(vma, address, end_addr);
> >
> > - set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
> > - } else {
> > - pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > - }
> > + /* Nuke the page table entry. */
> > + pteval = get_and_clear_full_ptes(mm, address, pvmw.pte, nr_pages, 0);
> > + /*
> > + * We clear the PTE but do not flush so potentially
> > + * a remote CPU could still be writing to the folio.
> > + * If the entry was previously clean then the
> > + * architecture must guarantee that a clear->dirty
> > + * transition on a cached TLB entry is written through
> > + * and traps if the PTE is unmapped.
> > + */
> > + if (should_defer_flush(mm, flags))
> > + set_tlb_ubc_flush_pending(mm, pteval, address, end_addr);
>
> When the first pte of a PTE-mapped THP has _PAGE_PROTNONE bit set
> (by NUMA balancing), can set_tlb_ubc_flush_pending() mistakenly think that
> it doesn't need to flush the whole range, although some ptes in the range
> doesn't have _PAGE_PROTNONE bit set?
No, then folio_pte_batch() should have returned nr < folio_nr_pages(folio).
> > + else
> > + flush_tlb_range(vma, address, end_addr);
> > if (pte_dirty(pteval))
> > folio_mark_dirty(folio);
> > } else {
>
> --
> Cheers,
> Harry / Hyeonggon
--
Cheers,
Harry / Hyeonggon
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
next prev parent reply other threads:[~2025-07-01 15:01 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-14 9:30 [PATCH v4 0/4] mm: batched unmap lazyfree large folios during reclamation Barry Song
2025-02-14 9:30 ` [PATCH v4 1/4] mm: Set folio swapbacked iff folios are dirty in try_to_unmap_one Barry Song
2025-02-14 9:30 ` [PATCH v4 2/4] mm: Support tlbbatch flush for a range of PTEs Barry Song
2025-02-14 9:30 ` [PATCH v4 3/4] mm: Support batched unmap for lazyfree large folios during reclamation Barry Song
2025-06-24 12:55 ` David Hildenbrand
2025-06-24 15:26 ` Lance Yang
2025-06-24 15:34 ` David Hildenbrand
2025-06-24 16:25 ` Lance Yang
2025-06-25 9:38 ` Barry Song
2025-06-25 10:00 ` David Hildenbrand
2025-06-25 10:38 ` Barry Song
2025-06-25 10:43 ` David Hildenbrand
2025-06-25 10:49 ` Barry Song
2025-06-25 10:59 ` David Hildenbrand
2025-06-25 10:47 ` Lance Yang
2025-06-25 10:49 ` David Hildenbrand
2025-06-25 10:57 ` Barry Song
2025-06-25 11:01 ` David Hildenbrand
2025-06-25 11:15 ` Barry Song
2025-06-25 11:27 ` David Hildenbrand
2025-06-25 11:42 ` Barry Song
2025-06-25 12:09 ` David Hildenbrand
2025-06-25 12:20 ` Lance Yang
2025-06-25 12:25 ` David Hildenbrand
2025-06-25 12:35 ` Lance Yang
2025-06-25 21:03 ` Barry Song
2025-06-26 1:17 ` Lance Yang
2025-06-26 8:17 ` David Hildenbrand
2025-06-26 9:29 ` Lance Yang
2025-06-26 12:44 ` Lance Yang
2025-06-26 13:16 ` David Hildenbrand
2025-06-26 13:52 ` Lance Yang
2025-06-26 14:39 ` David Hildenbrand
2025-06-26 15:06 ` Lance Yang
2025-06-26 21:46 ` Barry Song
2025-06-26 21:52 ` David Hildenbrand
2025-06-25 12:58 ` Lance Yang
2025-06-25 13:02 ` David Hildenbrand
2025-06-25 8:44 ` Lance Yang
2025-06-25 9:29 ` Lance Yang
2025-07-01 10:03 ` Harry Yoo
2025-07-01 13:27 ` Harry Yoo [this message]
2025-07-01 16:17 ` David Hildenbrand
2025-02-14 9:30 ` [PATCH v4 4/4] mm: Avoid splitting pmd for lazyfree pmd-mapped THP in try_to_unmap Barry Song
2025-06-25 13:49 ` [PATCH v4 0/4] mm: batched unmap lazyfree large folios during reclamation Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aGPiQq4cIPDt-Ue-@hyeyoo \
--to=harry.yoo@oracle.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=ioworker0@gmail.com \
--cc=kasong@tencent.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=ryan.roberts@arm.com \
--cc=v-songbaohua@oppo.com \
--cc=x86@kernel.org \
--cc=ying.huang@intel.com \
--cc=zhengtangquan@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).