* [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
@ 2025-09-01 7:48 Dev Jain
2025-09-01 8:23 ` Kiryl Shutsemau
2025-09-01 8:32 ` David Hildenbrand
0 siblings, 2 replies; 10+ messages in thread
From: Dev Jain @ 2025-09-01 7:48 UTC (permalink / raw)
To: akpm, david, kas, willy, hughd
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, baohua, linux-mm, linux-kernel, Dev Jain
Currently khugepaged does not collapse a region which does not have a
single writable page. This is wasteful since, apart from any non-writable
memory mapped by the application, there are a lot of non-writable VMAs
which will benefit from collapsing - the VMAs of the executable, those
of the glibc, vvar and vdso, which won't be unmapped during the lifetime
of the process, as opposed to other VMAs which maybe unmapped. Therefore,
remove this restriction and allow khugepaged to collapse a VMA with
arbitrary protections.
Along with this, currently MADV_COLLAPSE does not perform a collapse on a
non-writable VMA, and this restriction is nowhere to be found on the
manpage - the restriction itself sounds wrong to me since the user knows
the protection of the memory it has mapped, so collapsing read-only
memory via madvise() should be a choice of the user which shouldn't
be overriden by the kernel.
I dug into the history of this and couldn't find any concrete reason of
the current behaviour - [1] is the v1 of the original khugepaged patch
which required all ptes to be writable. [2] is the v1 of the patch which
changed this behaviour to require at least one pte to be writable. The
closest thing I could find was: in response to [2], Kirill says in [3] -
"As a side effect it will effectively allow collapse in PROT_READ vmas,
right? I'm not convinced it's a good idea." (Although Kirill realizes in
[4] that this was not the intention of the patch).
I can see performance improvements on mmtests run on an arm64 machine
comparing with 6.17-rc2. (I) denotes statistically significant improvement,
(R) denotes statistically significant regression (Please ignore the
numbers in the middle column):
+------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
| mmtests/hackbench | process-pipes-1 (seconds) | 0.145 | -0.06% |
| | process-pipes-4 (seconds) | 0.4335 | -0.27% |
| | process-pipes-7 (seconds) | 0.823 | (I) -12.13% |
| | process-pipes-12 (seconds) | 1.3538333333333334 | (I) -5.32% |
| | process-pipes-21 (seconds) | 1.8971666666666664 | (I) -2.87% |
| | process-pipes-30 (seconds) | 2.5023333333333335 | (I) -3.39% |
| | process-pipes-48 (seconds) | 3.4305 | (I) -5.65% |
| | process-pipes-79 (seconds) | 4.245833333333334 | (I) -6.74% |
| | process-pipes-110 (seconds) | 5.114833333333333 | (I) -6.26% |
| | process-pipes-141 (seconds) | 6.1885 | (I) -4.99% |
| | process-pipes-172 (seconds) | 7.231833333333334 | (I) -4.45% |
| | process-pipes-203 (seconds) | 8.393166666666668 | (I) -3.65% |
| | process-pipes-234 (seconds) | 9.487499999999999 | (I) -3.45% |
| | process-pipes-256 (seconds) | 10.316166666666666 | (I) -3.47% |
| | process-sockets-1 (seconds) | 0.289 | 2.13% |
| | process-sockets-4 (seconds) | 0.7596666666666666 | 1.02% |
| | process-sockets-7 (seconds) | 1.1663333333333334 | -0.26% |
| | process-sockets-12 (seconds) | 1.8641666666666665 | -1.24% |
| | process-sockets-21 (seconds) | 3.0773333333333333 | 0.01% |
| | process-sockets-30 (seconds) | 4.2405 | -0.15% |
| | process-sockets-48 (seconds) | 6.459666666666666 | 0.15% |
| | process-sockets-79 (seconds) | 10.156833333333333 | 1.45% |
| | process-sockets-110 (seconds) | 14.317833333333333 | -1.64% |
| | process-sockets-141 (seconds) | 20.8735 | (I) -4.27% |
| | process-sockets-172 (seconds) | 26.205333333333332 | 0.30% |
| | process-sockets-203 (seconds) | 31.298000000000002 | -1.71% |
| | process-sockets-234 (seconds) | 36.104000000000006 | -1.94% |
| | process-sockets-256 (seconds) | 39.44016666666667 | -0.71% |
| | thread-pipes-1 (seconds) | 0.17550000000000002 | 0.66% |
| | thread-pipes-4 (seconds) | 0.44716666666666666 | 1.66% |
| | thread-pipes-7 (seconds) | 0.7345 | -0.17% |
| | thread-pipes-12 (seconds) | 1.405833333333333 | (I) -4.12% |
| | thread-pipes-21 (seconds) | 2.0113333333333334 | (I) -2.13% |
| | thread-pipes-30 (seconds) | 2.6648333333333336 | (I) -3.78% |
| | thread-pipes-48 (seconds) | 3.6341666666666668 | (I) -5.77% |
| | thread-pipes-79 (seconds) | 4.4085 | (I) -5.31% |
| | thread-pipes-110 (seconds) | 5.374666666666666 | (I) -6.12% |
| | thread-pipes-141 (seconds) | 6.385666666666666 | (I) -4.00% |
| | thread-pipes-172 (seconds) | 7.403000000000001 | (I) -3.01% |
| | thread-pipes-203 (seconds) | 8.570333333333332 | (I) -2.62% |
| | thread-pipes-234 (seconds) | 9.719166666666666 | (I) -2.00% |
| | thread-pipes-256 (seconds) | 10.552833333333334 | (I) -2.30% |
| | thread-sockets-1 (seconds) | 0.3065 | (R) 2.39% |
+------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
+------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
| mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | 194.38333333333333 | -0.02% |
| | sysbenchmutex-4 (usec) | 200.875 | -0.02% |
| | sysbenchmutex-7 (usec) | 201.23000000000002 | 0.00% |
| | sysbenchmutex-12 (usec) | 201.77666666666664 | 0.12% |
| | sysbenchmutex-21 (usec) | 203.03 | -0.40% |
| | sysbenchmutex-30 (usec) | 203.285 | 0.08% |
| | sysbenchmutex-48 (usec) | 231.30000000000004 | 2.59% |
| | sysbenchmutex-79 (usec) | 362.075 | -0.80% |
| | sysbenchmutex-110 (usec) | 516.8233333333334 | -3.87% |
| | sysbenchmutex-128 (usec) | 593.3533333333334 | (I) -4.46% |
+------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
No regressions were observed with mm-selftests.
[1] https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
[2] https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
[3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
[4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Based on mm-new.
Not very sure of the tracing parts which this patch changes. I have kept
the writable portion for the tracing to maintain backward compat, just
dropped it as a collapse condition.
include/trace/events/huge_memory.h | 2 +-
mm/khugepaged.c | 11 +++--------
2 files changed, 4 insertions(+), 9 deletions(-)
diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 2305df6cb485..f2472c1c132a 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -19,7 +19,7 @@
EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
- EM( SCAN_PAGE_RO, "no_writable_page") \
+ EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
EM( SCAN_PAGE_NULL, "page_null") \
EM( SCAN_SCAN_ABORT, "scan_aborted") \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ec324a4c1fe..5ef8482597a9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -39,7 +39,7 @@ enum scan_result {
SCAN_PTE_NON_PRESENT,
SCAN_PTE_UFFD_WP,
SCAN_PTE_MAPPED_HUGEPAGE,
- SCAN_PAGE_RO,
+ SCAN_PAGE_RO, /* deprecated */
SCAN_LACK_REFERENCED_PAGE,
SCAN_PAGE_NULL,
SCAN_SCAN_ABORT,
@@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
writable = true;
}
- if (unlikely(!writable)) {
- result = SCAN_PAGE_RO;
- } else if (unlikely(cc->is_khugepaged && !referenced)) {
+ if (unlikely(cc->is_khugepaged && !referenced)) {
result = SCAN_LACK_REFERENCED_PAGE;
} else {
result = SCAN_SUCCEED;
@@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
mmu_notifier_test_young(vma->vm_mm, _address)))
referenced++;
}
- if (!writable) {
- result = SCAN_PAGE_RO;
- } else if (cc->is_khugepaged &&
+ if (cc->is_khugepaged &&
(!referenced ||
(unmapped && referenced < HPAGE_PMD_NR / 2))) {
result = SCAN_LACK_REFERENCED_PAGE;
@@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
case SCAN_PMD_NULL:
case SCAN_PTE_NON_PRESENT:
case SCAN_PTE_UFFD_WP:
- case SCAN_PAGE_RO:
case SCAN_LACK_REFERENCED_PAGE:
case SCAN_PAGE_NULL:
case SCAN_PAGE_COUNT:
--
2.30.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 7:48 [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs Dev Jain
@ 2025-09-01 8:23 ` Kiryl Shutsemau
2025-09-01 8:32 ` David Hildenbrand
2025-09-01 8:39 ` Dev Jain
2025-09-01 8:32 ` David Hildenbrand
1 sibling, 2 replies; 10+ messages in thread
From: Kiryl Shutsemau @ 2025-09-01 8:23 UTC (permalink / raw)
To: Dev Jain
Cc: akpm, david, willy, hughd, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, baohua, linux-mm,
linux-kernel
On Mon, Sep 01, 2025 at 01:18:17PM +0530, Dev Jain wrote:
> Currently khugepaged does not collapse a region which does not have a
> single writable page. This is wasteful since, apart from any non-writable
> memory mapped by the application, there are a lot of non-writable VMAs
> which will benefit from collapsing - the VMAs of the executable, those
> of the glibc, vvar and vdso, which won't be unmapped during the lifetime
> of the process, as opposed to other VMAs which maybe unmapped. Therefore,
> remove this restriction and allow khugepaged to collapse a VMA with
> arbitrary protections.
>
> Along with this, currently MADV_COLLAPSE does not perform a collapse on a
> non-writable VMA, and this restriction is nowhere to be found on the
> manpage - the restriction itself sounds wrong to me since the user knows
> the protection of the memory it has mapped, so collapsing read-only
> memory via madvise() should be a choice of the user which shouldn't
> be overriden by the kernel.
>
> I dug into the history of this and couldn't find any concrete reason of
> the current behaviour - [1] is the v1 of the original khugepaged patch
> which required all ptes to be writable. [2] is the v1 of the patch which
> changed this behaviour to require at least one pte to be writable. The
> closest thing I could find was: in response to [2], Kirill says in [3] -
> "As a side effect it will effectively allow collapse in PROT_READ vmas,
> right? I'm not convinced it's a good idea." (Although Kirill realizes in
> [4] that this was not the intention of the patch).
Hm. I don't see a justification for only collapsing writable pages.
> I can see performance improvements on mmtests run on an arm64 machine
> comparing with 6.17-rc2. (I) denotes statistically significant improvement,
> (R) denotes statistically significant regression (Please ignore the
> numbers in the middle column):
Could you give a summary instead of raw data? It is too much for commit
message. Raw data can be put under "---" for reference.
BTW, why did you pick hackbench as a benchmark? Seems random.
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
> | mmtests/hackbench | process-pipes-1 (seconds) | 0.145 | -0.06% |
> | | process-pipes-4 (seconds) | 0.4335 | -0.27% |
> | | process-pipes-7 (seconds) | 0.823 | (I) -12.13% |
> | | process-pipes-12 (seconds) | 1.3538333333333334 | (I) -5.32% |
> | | process-pipes-21 (seconds) | 1.8971666666666664 | (I) -2.87% |
> | | process-pipes-30 (seconds) | 2.5023333333333335 | (I) -3.39% |
> | | process-pipes-48 (seconds) | 3.4305 | (I) -5.65% |
> | | process-pipes-79 (seconds) | 4.245833333333334 | (I) -6.74% |
> | | process-pipes-110 (seconds) | 5.114833333333333 | (I) -6.26% |
> | | process-pipes-141 (seconds) | 6.1885 | (I) -4.99% |
> | | process-pipes-172 (seconds) | 7.231833333333334 | (I) -4.45% |
> | | process-pipes-203 (seconds) | 8.393166666666668 | (I) -3.65% |
> | | process-pipes-234 (seconds) | 9.487499999999999 | (I) -3.45% |
> | | process-pipes-256 (seconds) | 10.316166666666666 | (I) -3.47% |
> | | process-sockets-1 (seconds) | 0.289 | 2.13% |
> | | process-sockets-4 (seconds) | 0.7596666666666666 | 1.02% |
> | | process-sockets-7 (seconds) | 1.1663333333333334 | -0.26% |
> | | process-sockets-12 (seconds) | 1.8641666666666665 | -1.24% |
> | | process-sockets-21 (seconds) | 3.0773333333333333 | 0.01% |
> | | process-sockets-30 (seconds) | 4.2405 | -0.15% |
> | | process-sockets-48 (seconds) | 6.459666666666666 | 0.15% |
> | | process-sockets-79 (seconds) | 10.156833333333333 | 1.45% |
> | | process-sockets-110 (seconds) | 14.317833333333333 | -1.64% |
> | | process-sockets-141 (seconds) | 20.8735 | (I) -4.27% |
> | | process-sockets-172 (seconds) | 26.205333333333332 | 0.30% |
> | | process-sockets-203 (seconds) | 31.298000000000002 | -1.71% |
> | | process-sockets-234 (seconds) | 36.104000000000006 | -1.94% |
> | | process-sockets-256 (seconds) | 39.44016666666667 | -0.71% |
> | | thread-pipes-1 (seconds) | 0.17550000000000002 | 0.66% |
> | | thread-pipes-4 (seconds) | 0.44716666666666666 | 1.66% |
> | | thread-pipes-7 (seconds) | 0.7345 | -0.17% |
> | | thread-pipes-12 (seconds) | 1.405833333333333 | (I) -4.12% |
> | | thread-pipes-21 (seconds) | 2.0113333333333334 | (I) -2.13% |
> | | thread-pipes-30 (seconds) | 2.6648333333333336 | (I) -3.78% |
> | | thread-pipes-48 (seconds) | 3.6341666666666668 | (I) -5.77% |
> | | thread-pipes-79 (seconds) | 4.4085 | (I) -5.31% |
> | | thread-pipes-110 (seconds) | 5.374666666666666 | (I) -6.12% |
> | | thread-pipes-141 (seconds) | 6.385666666666666 | (I) -4.00% |
> | | thread-pipes-172 (seconds) | 7.403000000000001 | (I) -3.01% |
> | | thread-pipes-203 (seconds) | 8.570333333333332 | (I) -2.62% |
> | | thread-pipes-234 (seconds) | 9.719166666666666 | (I) -2.00% |
> | | thread-pipes-256 (seconds) | 10.552833333333334 | (I) -2.30% |
> | | thread-sockets-1 (seconds) | 0.3065 | (R) 2.39% |
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
> | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | 194.38333333333333 | -0.02% |
> | | sysbenchmutex-4 (usec) | 200.875 | -0.02% |
> | | sysbenchmutex-7 (usec) | 201.23000000000002 | 0.00% |
> | | sysbenchmutex-12 (usec) | 201.77666666666664 | 0.12% |
> | | sysbenchmutex-21 (usec) | 203.03 | -0.40% |
> | | sysbenchmutex-30 (usec) | 203.285 | 0.08% |
> | | sysbenchmutex-48 (usec) | 231.30000000000004 | 2.59% |
> | | sysbenchmutex-79 (usec) | 362.075 | -0.80% |
> | | sysbenchmutex-110 (usec) | 516.8233333333334 | -3.87% |
> | | sysbenchmutex-128 (usec) | 593.3533333333334 | (I) -4.46% |
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>
> No regressions were observed with mm-selftests.
>
> [1] https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
> [2] https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
> [3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
> [4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Based on mm-new.
>
> Not very sure of the tracing parts which this patch changes. I have kept
> the writable portion for the tracing to maintain backward compat, just
> dropped it as a collapse condition.
>
> include/trace/events/huge_memory.h | 2 +-
> mm/khugepaged.c | 11 +++--------
> 2 files changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 2305df6cb485..f2472c1c132a 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -19,7 +19,7 @@
> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
> - EM( SCAN_PAGE_RO, "no_writable_page") \
> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
Why not remove SCAN_PAGE_RO?
> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
> EM( SCAN_PAGE_NULL, "page_null") \
> EM( SCAN_SCAN_ABORT, "scan_aborted") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..5ef8482597a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -39,7 +39,7 @@ enum scan_result {
> SCAN_PTE_NON_PRESENT,
> SCAN_PTE_UFFD_WP,
> SCAN_PTE_MAPPED_HUGEPAGE,
> - SCAN_PAGE_RO,
> + SCAN_PAGE_RO, /* deprecated */
> SCAN_LACK_REFERENCED_PAGE,
> SCAN_PAGE_NULL,
> SCAN_SCAN_ABORT,
> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> writable = true;
> }
>
> - if (unlikely(!writable)) {
> - result = SCAN_PAGE_RO;
> - } else if (unlikely(cc->is_khugepaged && !referenced)) {
> + if (unlikely(cc->is_khugepaged && !referenced)) {
> result = SCAN_LACK_REFERENCED_PAGE;
> } else {
> result = SCAN_SUCCEED;
> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> mmu_notifier_test_young(vma->vm_mm, _address)))
> referenced++;
> }
> - if (!writable) {
> - result = SCAN_PAGE_RO;
> - } else if (cc->is_khugepaged &&
> + if (cc->is_khugepaged &&
The only practical use of the writable is gone. The only other usage is
tracing which can be dropped to as it is not actionable anymore.
Could you drop writable? Maybe as a separate commit.
> (!referenced ||
> (unmapped && referenced < HPAGE_PMD_NR / 2))) {
> result = SCAN_LACK_REFERENCED_PAGE;
> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> case SCAN_PMD_NULL:
> case SCAN_PTE_NON_PRESENT:
> case SCAN_PTE_UFFD_WP:
> - case SCAN_PAGE_RO:
> case SCAN_LACK_REFERENCED_PAGE:
> case SCAN_PAGE_NULL:
> case SCAN_PAGE_COUNT:
> --
> 2.30.2
>
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 7:48 [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs Dev Jain
2025-09-01 8:23 ` Kiryl Shutsemau
@ 2025-09-01 8:32 ` David Hildenbrand
2025-09-01 8:50 ` Dev Jain
2025-09-01 15:26 ` Dev Jain
1 sibling, 2 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-09-01 8:32 UTC (permalink / raw)
To: Dev Jain, akpm, kas, willy, hughd
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, baohua, linux-mm, linux-kernel
On 01.09.25 09:48, Dev Jain wrote:
> Currently khugepaged does not collapse a region which does not have a
> single writable page. This is wasteful since, apart from any non-writable
> memory mapped by the application, there are a lot of non-writable VMAs
> which will benefit from collapsing - the VMAs of the executable, those
> of the glibc, vvar and vdso, which won't be unmapped during the lifetime
> of the process, as opposed to other VMAs which maybe unmapped.
Are these anonymous folios? ("VMAs of the executable"), or is you description
misleading?
> Therefore,
> remove this restriction and allow khugepaged to collapse a VMA with
> arbitrary protections.
>
> Along with this, currently MADV_COLLAPSE does not perform a collapse on a
> non-writable VMA, and this restriction is nowhere to be found on the
> manpage - the restriction itself sounds wrong to me since the user knows
> the protection of the memory it has mapped, so collapsing read-only
> memory via madvise() should be a choice of the user which shouldn't
> be overriden by the kernel.
>
> I dug into the history of this and couldn't find any concrete reason of
> the current behaviour - [1] is the v1 of the original khugepaged patch
> which required all ptes to be writable. [2] is the v1 of the patch which
> changed this behaviour to require at least one pte to be writable. The
> closest thing I could find was: in response to [2], Kirill says in [3] -
> "As a side effect it will effectively allow collapse in PROT_READ vmas,
> right? I'm not convinced it's a good idea." (Although Kirill realizes in
> [4] that this was not the intention of the patch).
>
> I can see performance improvements on mmtests run on an arm64 machine
> comparing with 6.17-rc2. (I) denotes statistically significant improvement,
> (R) denotes statistically significant regression (Please ignore the
> numbers in the middle column):
I once dug into that myself as well as part of
commit 1bafe96e89f056cb6e25d47451fb16aee2c7c4d0
Author: David Hildenbrand <david@redhat.com>
Date: Wed Apr 24 14:26:30 2024 +0200
mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()
where I noted:
Interestingly, khugepaged will only collapse an anonymous THP if at least
one PTE is writable. After fork(), that means that something (usually a
page fault) populated at least a single exclusive anonymous THP in that
PMD range.
The problem I was concerned with (also documented in that patch) should no
longer apply ever since we changed how folio_maybe_mapped_shared() operates.
So yes, I don't see a good reason to fail on R/O PTEs
>
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
> | mmtests/hackbench | process-pipes-1 (seconds) | 0.145 | -0.06% |
> | | process-pipes-4 (seconds) | 0.4335 | -0.27% |
> | | process-pipes-7 (seconds) | 0.823 | (I) -12.13% |
> | | process-pipes-12 (seconds) | 1.3538333333333334 | (I) -5.32% |
> | | process-pipes-21 (seconds) | 1.8971666666666664 | (I) -2.87% |
> | | process-pipes-30 (seconds) | 2.5023333333333335 | (I) -3.39% |
> | | process-pipes-48 (seconds) | 3.4305 | (I) -5.65% |
> | | process-pipes-79 (seconds) | 4.245833333333334 | (I) -6.74% |
> | | process-pipes-110 (seconds) | 5.114833333333333 | (I) -6.26% |
> | | process-pipes-141 (seconds) | 6.1885 | (I) -4.99% |
> | | process-pipes-172 (seconds) | 7.231833333333334 | (I) -4.45% |
> | | process-pipes-203 (seconds) | 8.393166666666668 | (I) -3.65% |
> | | process-pipes-234 (seconds) | 9.487499999999999 | (I) -3.45% |
> | | process-pipes-256 (seconds) | 10.316166666666666 | (I) -3.47% |
> | | process-sockets-1 (seconds) | 0.289 | 2.13% |
> | | process-sockets-4 (seconds) | 0.7596666666666666 | 1.02% |
> | | process-sockets-7 (seconds) | 1.1663333333333334 | -0.26% |
> | | process-sockets-12 (seconds) | 1.8641666666666665 | -1.24% |
> | | process-sockets-21 (seconds) | 3.0773333333333333 | 0.01% |
> | | process-sockets-30 (seconds) | 4.2405 | -0.15% |
> | | process-sockets-48 (seconds) | 6.459666666666666 | 0.15% |
> | | process-sockets-79 (seconds) | 10.156833333333333 | 1.45% |
> | | process-sockets-110 (seconds) | 14.317833333333333 | -1.64% |
> | | process-sockets-141 (seconds) | 20.8735 | (I) -4.27% |
> | | process-sockets-172 (seconds) | 26.205333333333332 | 0.30% |
> | | process-sockets-203 (seconds) | 31.298000000000002 | -1.71% |
> | | process-sockets-234 (seconds) | 36.104000000000006 | -1.94% |
> | | process-sockets-256 (seconds) | 39.44016666666667 | -0.71% |
> | | thread-pipes-1 (seconds) | 0.17550000000000002 | 0.66% |
> | | thread-pipes-4 (seconds) | 0.44716666666666666 | 1.66% |
> | | thread-pipes-7 (seconds) | 0.7345 | -0.17% |
> | | thread-pipes-12 (seconds) | 1.405833333333333 | (I) -4.12% |
> | | thread-pipes-21 (seconds) | 2.0113333333333334 | (I) -2.13% |
> | | thread-pipes-30 (seconds) | 2.6648333333333336 | (I) -3.78% |
> | | thread-pipes-48 (seconds) | 3.6341666666666668 | (I) -5.77% |
> | | thread-pipes-79 (seconds) | 4.4085 | (I) -5.31% |
> | | thread-pipes-110 (seconds) | 5.374666666666666 | (I) -6.12% |
> | | thread-pipes-141 (seconds) | 6.385666666666666 | (I) -4.00% |
> | | thread-pipes-172 (seconds) | 7.403000000000001 | (I) -3.01% |
> | | thread-pipes-203 (seconds) | 8.570333333333332 | (I) -2.62% |
> | | thread-pipes-234 (seconds) | 9.719166666666666 | (I) -2.00% |
> | | thread-pipes-256 (seconds) | 10.552833333333334 | (I) -2.30% |
> | | thread-sockets-1 (seconds) | 0.3065 | (R) 2.39% |
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
> | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | 194.38333333333333 | -0.02% |
> | | sysbenchmutex-4 (usec) | 200.875 | -0.02% |
> | | sysbenchmutex-7 (usec) | 201.23000000000002 | 0.00% |
> | | sysbenchmutex-12 (usec) | 201.77666666666664 | 0.12% |
> | | sysbenchmutex-21 (usec) | 203.03 | -0.40% |
> | | sysbenchmutex-30 (usec) | 203.285 | 0.08% |
> | | sysbenchmutex-48 (usec) | 231.30000000000004 | 2.59% |
> | | sysbenchmutex-79 (usec) | 362.075 | -0.80% |
> | | sysbenchmutex-110 (usec) | 516.8233333333334 | -3.87% |
> | | sysbenchmutex-128 (usec) | 593.3533333333334 | (I) -4.46% |
> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>
> No regressions were observed with mm-selftests.
>
> [1] https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
> [2] https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
> [3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
> [4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Based on mm-new.
>
> Not very sure of the tracing parts which this patch changes. I have kept
> the writable portion for the tracing to maintain backward compat, just
> dropped it as a collapse condition.
>
> include/trace/events/huge_memory.h | 2 +-
> mm/khugepaged.c | 11 +++--------
> 2 files changed, 4 insertions(+), 9 deletions(-)
>
> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
> index 2305df6cb485..f2472c1c132a 100644
> --- a/include/trace/events/huge_memory.h
> +++ b/include/trace/events/huge_memory.h
> @@ -19,7 +19,7 @@
> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
> - EM( SCAN_PAGE_RO, "no_writable_page") \
> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
> EM( SCAN_PAGE_NULL, "page_null") \
> EM( SCAN_SCAN_ABORT, "scan_aborted") \
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..5ef8482597a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -39,7 +39,7 @@ enum scan_result {
> SCAN_PTE_NON_PRESENT,
> SCAN_PTE_UFFD_WP,
> SCAN_PTE_MAPPED_HUGEPAGE,
> - SCAN_PAGE_RO,
> + SCAN_PAGE_RO, /* deprecated */
Why can't we remove that completely.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 8:23 ` Kiryl Shutsemau
@ 2025-09-01 8:32 ` David Hildenbrand
2025-09-01 13:15 ` Kiryl Shutsemau
2025-09-01 8:39 ` Dev Jain
1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-09-01 8:32 UTC (permalink / raw)
To: Kiryl Shutsemau, Dev Jain
Cc: akpm, willy, hughd, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, baohua, linux-mm,
linux-kernel
>> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> writable = true;
>> }
>>
>> - if (unlikely(!writable)) {
>> - result = SCAN_PAGE_RO;
>> - } else if (unlikely(cc->is_khugepaged && !referenced)) {
>> + if (unlikely(cc->is_khugepaged && !referenced)) {
>> result = SCAN_LACK_REFERENCED_PAGE;
>> } else {
>> result = SCAN_SUCCEED;
>> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> mmu_notifier_test_young(vma->vm_mm, _address)))
>> referenced++;
>> }
>> - if (!writable) {
>> - result = SCAN_PAGE_RO;
>> - } else if (cc->is_khugepaged &&
>> + if (cc->is_khugepaged &&
>
> The only practical use of the writable is gone. The only other usage is
> tracing which can be dropped to as it is not actionable anymore.
>
> Could you drop writable? Maybe as a separate commit.
I think we should just do it in the same patch.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 8:23 ` Kiryl Shutsemau
2025-09-01 8:32 ` David Hildenbrand
@ 2025-09-01 8:39 ` Dev Jain
1 sibling, 0 replies; 10+ messages in thread
From: Dev Jain @ 2025-09-01 8:39 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: akpm, david, willy, hughd, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, baohua, linux-mm,
linux-kernel
On 01/09/25 1:53 pm, Kiryl Shutsemau wrote:
> On Mon, Sep 01, 2025 at 01:18:17PM +0530, Dev Jain wrote:
>> Currently khugepaged does not collapse a region which does not have a
>> single writable page. This is wasteful since, apart from any non-writable
>> memory mapped by the application, there are a lot of non-writable VMAs
>> which will benefit from collapsing - the VMAs of the executable, those
>> of the glibc, vvar and vdso, which won't be unmapped during the lifetime
>> of the process, as opposed to other VMAs which maybe unmapped. Therefore,
>> remove this restriction and allow khugepaged to collapse a VMA with
>> arbitrary protections.
>>
>> Along with this, currently MADV_COLLAPSE does not perform a collapse on a
>> non-writable VMA, and this restriction is nowhere to be found on the
>> manpage - the restriction itself sounds wrong to me since the user knows
>> the protection of the memory it has mapped, so collapsing read-only
>> memory via madvise() should be a choice of the user which shouldn't
>> be overriden by the kernel.
>>
>> I dug into the history of this and couldn't find any concrete reason of
>> the current behaviour - [1] is the v1 of the original khugepaged patch
>> which required all ptes to be writable. [2] is the v1 of the patch which
>> changed this behaviour to require at least one pte to be writable. The
>> closest thing I could find was: in response to [2], Kirill says in [3] -
>> "As a side effect it will effectively allow collapse in PROT_READ vmas,
>> right? I'm not convinced it's a good idea." (Although Kirill realizes in
>> [4] that this was not the intention of the patch).
> Hm. I don't see a justification for only collapsing writable pages.
Glad we are on the same page :)
>
>> I can see performance improvements on mmtests run on an arm64 machine
>> comparing with 6.17-rc2. (I) denotes statistically significant improvement,
>> (R) denotes statistically significant regression (Please ignore the
>> numbers in the middle column):
> Could you give a summary instead of raw data? It is too much for commit
> message. Raw data can be put under "---" for reference.
Okay.
>
> BTW, why did you pick hackbench as a benchmark? Seems random.
I didn't actually run mmtests myself (in fact I have consistently failed to run mmtests - someday
I'll gather the courage to figure it out again :)) - I just start a test pipeline which runs tests
for me and gets me results. The pipeline ran sysbench-cpu, sysbench-mutex, sysbench-thread,
kernbench and hackbench for me. For the others not mentioned in the below table, I didn't get any
statistically significant results.
>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>> | mmtests/hackbench | process-pipes-1 (seconds) | 0.145 | -0.06% |
>> | | process-pipes-4 (seconds) | 0.4335 | -0.27% |
>> | | process-pipes-7 (seconds) | 0.823 | (I) -12.13% |
>> | | process-pipes-12 (seconds) | 1.3538333333333334 | (I) -5.32% |
>> | | process-pipes-21 (seconds) | 1.8971666666666664 | (I) -2.87% |
>> | | process-pipes-30 (seconds) | 2.5023333333333335 | (I) -3.39% |
>> | | process-pipes-48 (seconds) | 3.4305 | (I) -5.65% |
>> | | process-pipes-79 (seconds) | 4.245833333333334 | (I) -6.74% |
>> | | process-pipes-110 (seconds) | 5.114833333333333 | (I) -6.26% |
>> | | process-pipes-141 (seconds) | 6.1885 | (I) -4.99% |
>> | | process-pipes-172 (seconds) | 7.231833333333334 | (I) -4.45% |
>> | | process-pipes-203 (seconds) | 8.393166666666668 | (I) -3.65% |
>> | | process-pipes-234 (seconds) | 9.487499999999999 | (I) -3.45% |
>> | | process-pipes-256 (seconds) | 10.316166666666666 | (I) -3.47% |
>> | | process-sockets-1 (seconds) | 0.289 | 2.13% |
>> | | process-sockets-4 (seconds) | 0.7596666666666666 | 1.02% |
>> | | process-sockets-7 (seconds) | 1.1663333333333334 | -0.26% |
>> | | process-sockets-12 (seconds) | 1.8641666666666665 | -1.24% |
>> | | process-sockets-21 (seconds) | 3.0773333333333333 | 0.01% |
>> | | process-sockets-30 (seconds) | 4.2405 | -0.15% |
>> | | process-sockets-48 (seconds) | 6.459666666666666 | 0.15% |
>> | | process-sockets-79 (seconds) | 10.156833333333333 | 1.45% |
>> | | process-sockets-110 (seconds) | 14.317833333333333 | -1.64% |
>> | | process-sockets-141 (seconds) | 20.8735 | (I) -4.27% |
>> | | process-sockets-172 (seconds) | 26.205333333333332 | 0.30% |
>> | | process-sockets-203 (seconds) | 31.298000000000002 | -1.71% |
>> | | process-sockets-234 (seconds) | 36.104000000000006 | -1.94% |
>> | | process-sockets-256 (seconds) | 39.44016666666667 | -0.71% |
>> | | thread-pipes-1 (seconds) | 0.17550000000000002 | 0.66% |
>> | | thread-pipes-4 (seconds) | 0.44716666666666666 | 1.66% |
>> | | thread-pipes-7 (seconds) | 0.7345 | -0.17% |
>> | | thread-pipes-12 (seconds) | 1.405833333333333 | (I) -4.12% |
>> | | thread-pipes-21 (seconds) | 2.0113333333333334 | (I) -2.13% |
>> | | thread-pipes-30 (seconds) | 2.6648333333333336 | (I) -3.78% |
>> | | thread-pipes-48 (seconds) | 3.6341666666666668 | (I) -5.77% |
>> | | thread-pipes-79 (seconds) | 4.4085 | (I) -5.31% |
>> | | thread-pipes-110 (seconds) | 5.374666666666666 | (I) -6.12% |
>> | | thread-pipes-141 (seconds) | 6.385666666666666 | (I) -4.00% |
>> | | thread-pipes-172 (seconds) | 7.403000000000001 | (I) -3.01% |
>> | | thread-pipes-203 (seconds) | 8.570333333333332 | (I) -2.62% |
>> | | thread-pipes-234 (seconds) | 9.719166666666666 | (I) -2.00% |
>> | | thread-pipes-256 (seconds) | 10.552833333333334 | (I) -2.30% |
>> | | thread-sockets-1 (seconds) | 0.3065 | (R) 2.39% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>> | mmtests/sysbench-mutex | sysbenchmutex-1 (usec) | 194.38333333333333 | -0.02% |
>> | | sysbenchmutex-4 (usec) | 200.875 | -0.02% |
>> | | sysbenchmutex-7 (usec) | 201.23000000000002 | 0.00% |
>> | | sysbenchmutex-12 (usec) | 201.77666666666664 | 0.12% |
>> | | sysbenchmutex-21 (usec) | 203.03 | -0.40% |
>> | | sysbenchmutex-30 (usec) | 203.285 | 0.08% |
>> | | sysbenchmutex-48 (usec) | 231.30000000000004 | 2.59% |
>> | | sysbenchmutex-79 (usec) | 362.075 | -0.80% |
>> | | sysbenchmutex-110 (usec) | 516.8233333333334 | -3.87% |
>> | | sysbenchmutex-128 (usec) | 593.3533333333334 | (I) -4.46% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>> No regressions were observed with mm-selftests.
>>
>> [1] https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
>> [2] https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
>> [3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
>> [4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Based on mm-new.
>>
>> Not very sure of the tracing parts which this patch changes. I have kept
>> the writable portion for the tracing to maintain backward compat, just
>> dropped it as a collapse condition.
>>
>> include/trace/events/huge_memory.h | 2 +-
>> mm/khugepaged.c | 11 +++--------
>> 2 files changed, 4 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
>> index 2305df6cb485..f2472c1c132a 100644
>> --- a/include/trace/events/huge_memory.h
>> +++ b/include/trace/events/huge_memory.h
>> @@ -19,7 +19,7 @@
>> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
>> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
>> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
>> - EM( SCAN_PAGE_RO, "no_writable_page") \
>> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
> Why not remove SCAN_PAGE_RO?
I was wondering whether any userspace script could fail due to using this metric already?
>
>> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
>> EM( SCAN_PAGE_NULL, "page_null") \
>> EM( SCAN_SCAN_ABORT, "scan_aborted") \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ec324a4c1fe..5ef8482597a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -39,7 +39,7 @@ enum scan_result {
>> SCAN_PTE_NON_PRESENT,
>> SCAN_PTE_UFFD_WP,
>> SCAN_PTE_MAPPED_HUGEPAGE,
>> - SCAN_PAGE_RO,
>> + SCAN_PAGE_RO, /* deprecated */
>> SCAN_LACK_REFERENCED_PAGE,
>> SCAN_PAGE_NULL,
>> SCAN_SCAN_ABORT,
>> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>> writable = true;
>> }
>>
>> - if (unlikely(!writable)) {
>> - result = SCAN_PAGE_RO;
>> - } else if (unlikely(cc->is_khugepaged && !referenced)) {
>> + if (unlikely(cc->is_khugepaged && !referenced)) {
>> result = SCAN_LACK_REFERENCED_PAGE;
>> } else {
>> result = SCAN_SUCCEED;
>> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>> mmu_notifier_test_young(vma->vm_mm, _address)))
>> referenced++;
>> }
>> - if (!writable) {
>> - result = SCAN_PAGE_RO;
>> - } else if (cc->is_khugepaged &&
>> + if (cc->is_khugepaged &&
> The only practical use of the writable is gone. The only other usage is
> tracing which can be dropped to as it is not actionable anymore.
>
> Could you drop writable? Maybe as a separate commit.
>
>> (!referenced ||
>> (unmapped && referenced < HPAGE_PMD_NR / 2))) {
>> result = SCAN_LACK_REFERENCED_PAGE;
>> @@ -2830,7 +2826,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>> case SCAN_PMD_NULL:
>> case SCAN_PTE_NON_PRESENT:
>> case SCAN_PTE_UFFD_WP:
>> - case SCAN_PAGE_RO:
>> case SCAN_LACK_REFERENCED_PAGE:
>> case SCAN_PAGE_NULL:
>> case SCAN_PAGE_COUNT:
>> --
>> 2.30.2
>>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 8:32 ` David Hildenbrand
@ 2025-09-01 8:50 ` Dev Jain
2025-09-01 15:26 ` Dev Jain
1 sibling, 0 replies; 10+ messages in thread
From: Dev Jain @ 2025-09-01 8:50 UTC (permalink / raw)
To: David Hildenbrand, akpm, kas, willy, hughd
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, baohua, linux-mm, linux-kernel
On 01/09/25 2:02 pm, David Hildenbrand wrote:
> On 01.09.25 09:48, Dev Jain wrote:
>> Currently khugepaged does not collapse a region which does not have a
>> single writable page. This is wasteful since, apart from any
>> non-writable
>> memory mapped by the application, there are a lot of non-writable VMAs
>> which will benefit from collapsing - the VMAs of the executable, those
>> of the glibc, vvar and vdso, which won't be unmapped during the lifetime
>> of the process, as opposed to other VMAs which maybe unmapped.
>
> Are these anonymous folios? ("VMAs of the executable"), or is you
> description
> misleading?
Oops. I dropped writable everywhere and failed to notice that the
callsites are
all for anon collapse. So I'll only mention vvar, vdso and other
non-writable
VMAs in the description.
>
>> Therefore,
>> remove this restriction and allow khugepaged to collapse a VMA with
>> arbitrary protections.
>>
>> Along with this, currently MADV_COLLAPSE does not perform a collapse
>> on a
>> non-writable VMA, and this restriction is nowhere to be found on the
>> manpage - the restriction itself sounds wrong to me since the user knows
>> the protection of the memory it has mapped, so collapsing read-only
>> memory via madvise() should be a choice of the user which shouldn't
>> be overriden by the kernel.
>>
>> I dug into the history of this and couldn't find any concrete reason of
>> the current behaviour - [1] is the v1 of the original khugepaged patch
>> which required all ptes to be writable. [2] is the v1 of the patch which
>> changed this behaviour to require at least one pte to be writable. The
>> closest thing I could find was: in response to [2], Kirill says in [3] -
>> "As a side effect it will effectively allow collapse in PROT_READ vmas,
>> right? I'm not convinced it's a good idea." (Although Kirill realizes in
>> [4] that this was not the intention of the patch).
>>
>> I can see performance improvements on mmtests run on an arm64 machine
>> comparing with 6.17-rc2. (I) denotes statistically significant
>> improvement,
>> (R) denotes statistically significant regression (Please ignore the
>> numbers in the middle column):
>
> I once dug into that myself as well as part of
>
> commit 1bafe96e89f056cb6e25d47451fb16aee2c7c4d0
> Author: David Hildenbrand <david@redhat.com>
> Date: Wed Apr 24 14:26:30 2024 +0200
>
> mm/khugepaged: replace page_mapcount() check by
> folio_likely_mapped_shared()
>
> where I noted:
>
> Interestingly, khugepaged will only collapse an anonymous THP if
> at least
> one PTE is writable. After fork(), that means that something
> (usually a
> page fault) populated at least a single exclusive anonymous THP in
> that
> PMD range.
> The problem I was concerned with (also documented in that patch)
> should no
> longer apply ever since we changed how folio_maybe_mapped_shared()
> operates.
>
> So yes, I don't see a good reason to fail on R/O PTEs
>
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>> | mmtests/hackbench | process-pipes-1
>> (seconds) | 0.145
>> | -0.06% |
>> | | process-pipes-4
>> (seconds) | 0.4335
>> | -0.27% |
>> | | process-pipes-7
>> (seconds) | 0.823
>> | (I) -12.13% |
>> | | process-pipes-12
>> (seconds) | 1.3538333333333334
>> | (I) -5.32% |
>> | | process-pipes-21
>> (seconds) | 1.8971666666666664
>> | (I) -2.87% |
>> | | process-pipes-30
>> (seconds) | 2.5023333333333335
>> | (I) -3.39% |
>> | | process-pipes-48
>> (seconds) | 3.4305
>> | (I) -5.65% |
>> | | process-pipes-79
>> (seconds) | 4.245833333333334
>> | (I) -6.74% |
>> | | process-pipes-110
>> (seconds) | 5.114833333333333
>> | (I) -6.26% |
>> | | process-pipes-141
>> (seconds) | 6.1885
>> | (I) -4.99% |
>> | | process-pipes-172
>> (seconds) | 7.231833333333334
>> | (I) -4.45% |
>> | | process-pipes-203
>> (seconds) | 8.393166666666668
>> | (I) -3.65% |
>> | | process-pipes-234
>> (seconds) | 9.487499999999999
>> | (I) -3.45% |
>> | | process-pipes-256
>> (seconds) | 10.316166666666666
>> | (I) -3.47% |
>> | | process-sockets-1
>> (seconds) | 0.289
>> | 2.13% |
>> | | process-sockets-4
>> (seconds) | 0.7596666666666666
>> | 1.02% |
>> | | process-sockets-7
>> (seconds) | 1.1663333333333334
>> | -0.26% |
>> | | process-sockets-12
>> (seconds) | 1.8641666666666665
>> | -1.24% |
>> | | process-sockets-21
>> (seconds) | 3.0773333333333333
>> | 0.01% |
>> | | process-sockets-30
>> (seconds) | 4.2405
>> | -0.15% |
>> | | process-sockets-48
>> (seconds) | 6.459666666666666
>> | 0.15% |
>> | | process-sockets-79
>> (seconds) | 10.156833333333333
>> | 1.45% |
>> | | process-sockets-110
>> (seconds) | 14.317833333333333
>> | -1.64% |
>> | | process-sockets-141
>> (seconds) | 20.8735
>> | (I) -4.27% |
>> | | process-sockets-172
>> (seconds) | 26.205333333333332
>> | 0.30% |
>> | | process-sockets-203
>> (seconds) | 31.298000000000002
>> | -1.71% |
>> | | process-sockets-234
>> (seconds) | 36.104000000000006
>> | -1.94% |
>> | | process-sockets-256
>> (seconds) | 39.44016666666667
>> | -0.71% |
>> | | thread-pipes-1
>> (seconds) | 0.17550000000000002
>> | 0.66% |
>> | | thread-pipes-4
>> (seconds) | 0.44716666666666666
>> | 1.66% |
>> | | thread-pipes-7
>> (seconds) | 0.7345
>> | -0.17% |
>> | | thread-pipes-12
>> (seconds) | 1.405833333333333
>> | (I) -4.12% |
>> | | thread-pipes-21
>> (seconds) | 2.0113333333333334
>> | (I) -2.13% |
>> | | thread-pipes-30
>> (seconds) | 2.6648333333333336
>> | (I) -3.78% |
>> | | thread-pipes-48
>> (seconds) | 3.6341666666666668
>> | (I) -5.77% |
>> | | thread-pipes-79
>> (seconds) | 4.4085
>> | (I) -5.31% |
>> | | thread-pipes-110
>> (seconds) | 5.374666666666666
>> | (I) -6.12% |
>> | | thread-pipes-141
>> (seconds) | 6.385666666666666
>> | (I) -4.00% |
>> | | thread-pipes-172
>> (seconds) | 7.403000000000001
>> | (I) -3.01% |
>> | | thread-pipes-203
>> (seconds) | 8.570333333333332
>> | (I) -2.62% |
>> | | thread-pipes-234
>> (seconds) | 9.719166666666666
>> | (I) -2.00% |
>> | | thread-pipes-256
>> (seconds) | 10.552833333333334
>> | (I) -2.30% |
>> | | thread-sockets-1
>> (seconds) | 0.3065
>> | (R) 2.39% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>>
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>> | mmtests/sysbench-mutex | sysbenchmutex-1
>> (usec) | 194.38333333333333
>> | -0.02% |
>> | | sysbenchmutex-4
>> (usec) | 200.875
>> | -0.02% |
>> | | sysbenchmutex-7
>> (usec) | 201.23000000000002
>> | 0.00% |
>> | | sysbenchmutex-12
>> (usec) | 201.77666666666664
>> | 0.12% |
>> | | sysbenchmutex-21
>> (usec) | 203.03
>> | -0.40% |
>> | | sysbenchmutex-30
>> (usec) | 203.285
>> | 0.08% |
>> | | sysbenchmutex-48
>> (usec) | 231.30000000000004
>> | 2.59% |
>> | | sysbenchmutex-79
>> (usec) | 362.075
>> | -0.80% |
>> | | sysbenchmutex-110
>> (usec) | 516.8233333333334
>> | -3.87% |
>> | | sysbenchmutex-128
>> (usec) | 593.3533333333334
>> | (I) -4.46% |
>> +------------------------------------+----------------------------------------------------------+-----------------------+--------------------------+
>>
>>
>> No regressions were observed with mm-selftests.
>>
>> [1]
>> https://lore.kernel.org/all/679861e2e81b32a0ae08.1264054854@v2.random/
>> [2]
>> https://lore.kernel.org/all/1421999256-3881-1-git-send-email-ebru.akagunduz@gmail.com/
>> [3] https://lore.kernel.org/all/20150123113701.GB5975@node.dhcp.inet.fi/
>> [4] https://lore.kernel.org/all/20150123155802.GA7011@node.dhcp.inet.fi/
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Based on mm-new.
>>
>> Not very sure of the tracing parts which this patch changes. I have kept
>> the writable portion for the tracing to maintain backward compat, just
>> dropped it as a collapse condition.
>>
>> include/trace/events/huge_memory.h | 2 +-
>> mm/khugepaged.c | 11 +++--------
>> 2 files changed, 4 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/trace/events/huge_memory.h
>> b/include/trace/events/huge_memory.h
>> index 2305df6cb485..f2472c1c132a 100644
>> --- a/include/trace/events/huge_memory.h
>> +++ b/include/trace/events/huge_memory.h
>> @@ -19,7 +19,7 @@
>> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
>> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
>> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
>> - EM( SCAN_PAGE_RO, "no_writable_page") \
>> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
>> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
>> EM( SCAN_PAGE_NULL, "page_null") \
>> EM( SCAN_SCAN_ABORT, "scan_aborted") \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ec324a4c1fe..5ef8482597a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -39,7 +39,7 @@ enum scan_result {
>> SCAN_PTE_NON_PRESENT,
>> SCAN_PTE_UFFD_WP,
>> SCAN_PTE_MAPPED_HUGEPAGE,
>> - SCAN_PAGE_RO,
>> + SCAN_PAGE_RO, /* deprecated */
>
> Why can't we remove that completely.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 8:32 ` David Hildenbrand
@ 2025-09-01 13:15 ` Kiryl Shutsemau
2025-09-01 13:43 ` David Hildenbrand
0 siblings, 1 reply; 10+ messages in thread
From: Kiryl Shutsemau @ 2025-09-01 13:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: Dev Jain, akpm, willy, hughd, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, baohua, linux-mm,
linux-kernel
On Mon, Sep 01, 2025 at 10:32:34AM +0200, David Hildenbrand wrote:
>
> > > @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > > writable = true;
> > > }
> > > - if (unlikely(!writable)) {
> > > - result = SCAN_PAGE_RO;
> > > - } else if (unlikely(cc->is_khugepaged && !referenced)) {
> > > + if (unlikely(cc->is_khugepaged && !referenced)) {
> > > result = SCAN_LACK_REFERENCED_PAGE;
> > > } else {
> > > result = SCAN_SUCCEED;
> > > @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> > > mmu_notifier_test_young(vma->vm_mm, _address)))
> > > referenced++;
> > > }
> > > - if (!writable) {
> > > - result = SCAN_PAGE_RO;
> > > - } else if (cc->is_khugepaged &&
> > > + if (cc->is_khugepaged &&
> >
> > The only practical use of the writable is gone. The only other usage is
> > tracing which can be dropped to as it is not actionable anymore.
> >
> > Could you drop writable? Maybe as a separate commit.
>
> I think we should just do it in the same patch.
Change in trace_mm_collapse_huge_page_isolate() interface doesn't belong
to the same patch in my view.
--
Kiryl Shutsemau / Kirill A. Shutemov
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 13:15 ` Kiryl Shutsemau
@ 2025-09-01 13:43 ` David Hildenbrand
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-09-01 13:43 UTC (permalink / raw)
To: Kiryl Shutsemau
Cc: Dev Jain, akpm, willy, hughd, ziy, baolin.wang, lorenzo.stoakes,
Liam.Howlett, npache, ryan.roberts, baohua, linux-mm,
linux-kernel
On 01.09.25 15:15, Kiryl Shutsemau wrote:
> On Mon, Sep 01, 2025 at 10:32:34AM +0200, David Hildenbrand wrote:
>>
>>>> @@ -676,9 +676,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>>>> writable = true;
>>>> }
>>>> - if (unlikely(!writable)) {
>>>> - result = SCAN_PAGE_RO;
>>>> - } else if (unlikely(cc->is_khugepaged && !referenced)) {
>>>> + if (unlikely(cc->is_khugepaged && !referenced)) {
>>>> result = SCAN_LACK_REFERENCED_PAGE;
>>>> } else {
>>>> result = SCAN_SUCCEED;
>>>> @@ -1421,9 +1419,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>>> mmu_notifier_test_young(vma->vm_mm, _address)))
>>>> referenced++;
>>>> }
>>>> - if (!writable) {
>>>> - result = SCAN_PAGE_RO;
>>>> - } else if (cc->is_khugepaged &&
>>>> + if (cc->is_khugepaged &&
>>>
>>> The only practical use of the writable is gone. The only other usage is
>>> tracing which can be dropped to as it is not actionable anymore.
>>>
>>> Could you drop writable? Maybe as a separate commit.
>>
>> I think we should just do it in the same patch.
>
> Change in trace_mm_collapse_huge_page_isolate() interface doesn't belong
> to the same patch in my view.
We frequently adjust tracing code (which is specific to khugepaged in
this case) in the same patch.
Anyhow, no strong opinion.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 8:32 ` David Hildenbrand
2025-09-01 8:50 ` Dev Jain
@ 2025-09-01 15:26 ` Dev Jain
2025-09-01 15:35 ` David Hildenbrand
1 sibling, 1 reply; 10+ messages in thread
From: Dev Jain @ 2025-09-01 15:26 UTC (permalink / raw)
To: David Hildenbrand, akpm, kas, willy, hughd
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, baohua, linux-mm, linux-kernel
On 01/09/25 2:02 pm, David Hildenbrand wrote:
> On 01.09.25 09:48, Dev Jain wrote:
>>
>> @@ -19,7 +19,7 @@
>> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
>> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
>> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
>> - EM( SCAN_PAGE_RO, "no_writable_page") \
>> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
>> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
>> EM( SCAN_PAGE_NULL, "page_null") \
>> EM( SCAN_SCAN_ABORT, "scan_aborted") \
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ec324a4c1fe..5ef8482597a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -39,7 +39,7 @@ enum scan_result {
>> SCAN_PTE_NON_PRESENT,
>> SCAN_PTE_UFFD_WP,
>> SCAN_PTE_MAPPED_HUGEPAGE,
>> - SCAN_PAGE_RO,
>> + SCAN_PAGE_RO, /* deprecated */
>
> Why can't we remove that completely.
(I raised this query in the other mail but due to not snipping stuff in
between,
it may have been glossed over)
I was wondering whether a userspace script could break which assumes
scan_page_ro
metric is there? I played with tracing long time back so I don't remember.
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs
2025-09-01 15:26 ` Dev Jain
@ 2025-09-01 15:35 ` David Hildenbrand
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-09-01 15:35 UTC (permalink / raw)
To: Dev Jain, akpm, kas, willy, hughd
Cc: ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
ryan.roberts, baohua, linux-mm, linux-kernel
On 01.09.25 17:26, Dev Jain wrote:
>
> On 01/09/25 2:02 pm, David Hildenbrand wrote:
>> On 01.09.25 09:48, Dev Jain wrote:
>>>
>>> @@ -19,7 +19,7 @@
>>> EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \
>>> EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \
>>> EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \
>>> - EM( SCAN_PAGE_RO, "no_writable_page") \
>>> + EM( SCAN_PAGE_RO, "no_writable_page") /* deprecated */ \
>>> EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \
>>> EM( SCAN_PAGE_NULL, "page_null") \
>>> EM( SCAN_SCAN_ABORT, "scan_aborted") \
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 4ec324a4c1fe..5ef8482597a9 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -39,7 +39,7 @@ enum scan_result {
>>> SCAN_PTE_NON_PRESENT,
>>> SCAN_PTE_UFFD_WP,
>>> SCAN_PTE_MAPPED_HUGEPAGE,
>>> - SCAN_PAGE_RO,
>>> + SCAN_PAGE_RO, /* deprecated */
>>
>> Why can't we remove that completely.
>
> (I raised this query in the other mail but due to not snipping stuff in
> between,
>
I saw it but forgot to reply :)
> it may have been glossed over)
>
> I was wondering whether a userspace script could break which assumes
> scan_page_ro
>
> metric is there? I played with tracing long time back so I don't remember.
If it's part of a stable abi/api, then it could be a problem indeed.
I don't think trace events are in general not stable abi/api. In
practice I think some are (e.g., in scheduler).
I assume we can safely drop it, just like we can add new stuff.
After all, if "SCAN_PAGE_RO" never happens anymore, we would never be
running into the __print_symbolic() with that value.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-09-01 15:36 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-01 7:48 [RFC PATCH] mm: Enable khugepaged to operate on non-writable VMAs Dev Jain
2025-09-01 8:23 ` Kiryl Shutsemau
2025-09-01 8:32 ` David Hildenbrand
2025-09-01 13:15 ` Kiryl Shutsemau
2025-09-01 13:43 ` David Hildenbrand
2025-09-01 8:39 ` Dev Jain
2025-09-01 8:32 ` David Hildenbrand
2025-09-01 8:50 ` Dev Jain
2025-09-01 15:26 ` Dev Jain
2025-09-01 15:35 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).