* [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes
@ 2022-10-21 10:11 David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 1/9] selftests/vm: add test to measure MADV_UNMERGEABLE performance David Hildenbrand
` (9 more replies)
0 siblings, 10 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
This series cleans up and fixes break_ksm(). In summary, we no longer
use fake write faults to break COW but instead FAULT_FLAG_UNSHARE. Further,
we move away from using follow_page() --- that we can hopefully remove
completely at one point --- and use new walk_page_range_vma() instead.
Fortunately, we can get rid of VM_FAULT_WRITE and FOLL_MIGRATION in common
code now.
Extend the existing ksm tests by an unmerge benchmark, and a some new
unmerge tests.
Add a selftest to measure MADV_UNMERGEABLE performance. In my setup
(AMD Ryzen 9 3900X), running the KSM selftest to test unmerge performance
on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in a
performance degradation of ~6% -- 7% (old: ~5250 MiB/s, new: ~4900 MiB/s).
I don't think we particularly care for now, but it's good to be aware
of the implication.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Xu <peterx@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
v1 -> v2:
* "selftests/vm: add KSM unmerge tests"
-> Add new unmerge tests
* "mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE"
-> Simplify patch description now that we have a selftest
* "mm/pagewalk: don't trigger test_walk() in walk_page_vma()"
-> Added
* "mm/pagewalk: add walk_page_range_vma()"
-> Don't call test_walk()
* "mm/ksm: convert break_ksm() to use walk_page_range_vma()"
-> Simplify and fix missing unlock, fix missing "static"
David Hildenbrand (9):
selftests/vm: add test to measure MADV_UNMERGEABLE performance
mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE
mm: remove VM_FAULT_WRITE
selftests/vm: add KSM unmerge tests
mm/ksm: fix KSM COW breaking with userfaultfd-wp via
FAULT_FLAG_UNSHARE
mm/pagewalk: don't trigger test_walk() in walk_page_vma()
mm/pagewalk: add walk_page_range_vma()
mm/ksm: convert break_ksm() to use walk_page_range_vma()
mm/gup: remove FOLL_MIGRATION
include/linux/mm.h | 1 -
include/linux/mm_types.h | 3 -
include/linux/pagewalk.h | 5 +
mm/gup.c | 55 +---
mm/huge_memory.c | 2 +-
mm/ksm.c | 78 +++--
mm/memory.c | 9 +-
mm/pagewalk.c | 27 +-
tools/testing/selftests/vm/Makefile | 2 +
.../selftests/vm/ksm_functional_tests.c | 279 ++++++++++++++++++
tools/testing/selftests/vm/ksm_tests.c | 76 ++++-
tools/testing/selftests/vm/run_vmtests.sh | 2 +
tools/testing/selftests/vm/vm_util.c | 10 +
tools/testing/selftests/vm/vm_util.h | 1 +
14 files changed, 456 insertions(+), 94 deletions(-)
create mode 100644 tools/testing/selftests/vm/ksm_functional_tests.c
base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
--
2.37.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/9] selftests/vm: add test to measure MADV_UNMERGEABLE performance
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 2/9] mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE David Hildenbrand
` (8 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Let's add a test to measure performance of KSM breaking not triggered
via COW, but triggered by disabling KSM on an area filled with KSM pages
via MADV_UNMERGEABLE.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
tools/testing/selftests/vm/ksm_tests.c | 76 +++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vm/ksm_tests.c b/tools/testing/selftests/vm/ksm_tests.c
index 0d85be2350fa..f9eb4d67e0dd 100644
--- a/tools/testing/selftests/vm/ksm_tests.c
+++ b/tools/testing/selftests/vm/ksm_tests.c
@@ -40,6 +40,7 @@ enum ksm_test_name {
CHECK_KSM_NUMA_MERGE,
KSM_MERGE_TIME,
KSM_MERGE_TIME_HUGE_PAGES,
+ KSM_UNMERGE_TIME,
KSM_COW_TIME
};
@@ -108,7 +109,10 @@ static void print_help(void)
" -P evaluate merging time and speed.\n"
" For this test, the size of duplicated memory area (in MiB)\n"
" must be provided using -s option\n"
- " -H evaluate merging time and speed of area allocated mostly with huge pages\n"
+ " -H evaluate merging time and speed of area allocated mostly with huge pages\n"
+ " For this test, the size of duplicated memory area (in MiB)\n"
+ " must be provided using -s option\n"
+ " -D evaluate unmerging time and speed when disabling KSM.\n"
" For this test, the size of duplicated memory area (in MiB)\n"
" must be provided using -s option\n"
" -C evaluate the time required to break COW of merged pages.\n\n");
@@ -188,6 +192,16 @@ static int ksm_merge_pages(void *addr, size_t size, struct timespec start_time,
return 0;
}
+static int ksm_unmerge_pages(void *addr, size_t size,
+ struct timespec start_time, int timeout)
+{
+ if (madvise(addr, size, MADV_UNMERGEABLE)) {
+ perror("madvise");
+ return 1;
+ }
+ return 0;
+}
+
static bool assert_ksm_pages_count(long dupl_page_count)
{
unsigned long max_page_sharing, pages_sharing, pages_shared;
@@ -560,6 +574,53 @@ static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size)
return KSFT_FAIL;
}
+static int ksm_unmerge_time(int mapping, int prot, int timeout, size_t map_size)
+{
+ void *map_ptr;
+ struct timespec start_time, end_time;
+ unsigned long scan_time_ns;
+
+ map_size *= MB;
+
+ map_ptr = allocate_memory(NULL, prot, mapping, '*', map_size);
+ if (!map_ptr)
+ return KSFT_FAIL;
+ if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) {
+ perror("clock_gettime");
+ goto err_out;
+ }
+ if (ksm_merge_pages(map_ptr, map_size, start_time, timeout))
+ goto err_out;
+
+ if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) {
+ perror("clock_gettime");
+ goto err_out;
+ }
+ if (ksm_unmerge_pages(map_ptr, map_size, start_time, timeout))
+ goto err_out;
+ if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) {
+ perror("clock_gettime");
+ goto err_out;
+ }
+
+ scan_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC +
+ (end_time.tv_nsec - start_time.tv_nsec);
+
+ printf("Total size: %lu MiB\n", map_size / MB);
+ printf("Total time: %ld.%09ld s\n", scan_time_ns / NSEC_PER_SEC,
+ scan_time_ns % NSEC_PER_SEC);
+ printf("Average speed: %.3f MiB/s\n", (map_size / MB) /
+ ((double)scan_time_ns / NSEC_PER_SEC));
+
+ munmap(map_ptr, map_size);
+ return KSFT_PASS;
+
+err_out:
+ printf("Not OK\n");
+ munmap(map_ptr, map_size);
+ return KSFT_FAIL;
+}
+
static int ksm_cow_time(int mapping, int prot, int timeout, size_t page_size)
{
void *map_ptr;
@@ -644,7 +705,7 @@ int main(int argc, char *argv[])
bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT;
long size_MB = 0;
- while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPCH")) != -1) {
+ while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPCHD")) != -1) {
switch (opt) {
case 'a':
prot = str_to_prot(optarg);
@@ -701,6 +762,9 @@ int main(int argc, char *argv[])
case 'H':
test_name = KSM_MERGE_TIME_HUGE_PAGES;
break;
+ case 'D':
+ test_name = KSM_UNMERGE_TIME;
+ break;
case 'C':
test_name = KSM_COW_TIME;
break;
@@ -762,6 +826,14 @@ int main(int argc, char *argv[])
ret = ksm_merge_hugepages_time(MAP_PRIVATE | MAP_ANONYMOUS, prot,
ksm_scan_limit_sec, size_MB);
break;
+ case KSM_UNMERGE_TIME:
+ if (size_MB == 0) {
+ printf("Option '-s' is required.\n");
+ return KSFT_FAIL;
+ }
+ ret = ksm_unmerge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot,
+ ksm_scan_limit_sec, size_MB);
+ break;
case KSM_COW_TIME:
ret = ksm_cow_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
page_size);
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/9] mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 1/9] selftests/vm: add test to measure MADV_UNMERGEABLE performance David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 3/9] mm: remove VM_FAULT_WRITE David Hildenbrand
` (7 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Now that GUP no longer requires VM_FAULT_WRITE, break_ksm() is the sole
remaining user of VM_FAULT_WRITE. As we also want to stop triggering a
fake write fault and instead use FAULT_FLAG_UNSHARE -- similar to
GUP-triggered unsharing when taking a R/O pin on a shared anonymous page
(including KSM pages), let's stop relying on VM_FAULT_WRITE.
Let's rework break_ksm() to not rely on the return value of
handle_mm_fault() anymore to figure out whether COW-breaking was
successful. Simply perform another follow_page() lookup to verify the
result.
While this makes break_ksm() slightly less efficient, we can simplify
handle_mm_fault() a little and easily switch to FAULT_FLAG_UNSHARE
without introducing similar KSM-specific behavior for
FAULT_FLAG_UNSHARE.
In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test
unmerge performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this
results in a performance degradation of ~4% -- 5% (old: ~5250 MiB/s,
new: ~5010 MiB/s).
I don't think that we particularly care about that performance drop when
unmerging. If it ever turns out to be an actual performance issue, we can
think about a better alternative for FAULT_FLAG_UNSHARE -- let's just keep
it simple for now.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/ksm.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index c19fcca9bc03..b884a22f3c3c 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -440,26 +440,27 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
vm_fault_t ret = 0;
do {
+ bool ksm_page = false;
+
cond_resched();
page = follow_page(vma, addr,
FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
if (IS_ERR_OR_NULL(page))
break;
if (PageKsm(page))
- ret = handle_mm_fault(vma, addr,
- FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE,
- NULL);
- else
- ret = VM_FAULT_WRITE;
+ ksm_page = true;
put_page(page);
- } while (!(ret & (VM_FAULT_WRITE | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM)));
+
+ if (!ksm_page)
+ return 0;
+ ret = handle_mm_fault(vma, addr,
+ FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE,
+ NULL);
+ } while (!(ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM)));
/*
- * We must loop because handle_mm_fault() may back out if there's
- * any difficulty e.g. if pte accessed bit gets updated concurrently.
- *
- * VM_FAULT_WRITE is what we have been hoping for: it indicates that
- * COW has been broken, even if the vma does not permit VM_WRITE;
- * but note that a concurrent fault might break PageKsm for us.
+ * We must loop until we no longer find a KSM page because
+ * handle_mm_fault() may back out if there's any difficulty e.g. if
+ * pte accessed bit gets updated concurrently.
*
* VM_FAULT_SIGBUS could occur if we race with truncation of the
* backing file, which also invalidates anonymous pages: that's
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 3/9] mm: remove VM_FAULT_WRITE
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 1/9] selftests/vm: add test to measure MADV_UNMERGEABLE performance David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 2/9] mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 4/9] selftests/vm: add KSM unmerge tests David Hildenbrand
` (6 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
All users -- GUP and KSM -- are gone, let's just remove it.
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm_types.h | 3 ---
mm/huge_memory.c | 2 +-
mm/memory.c | 9 ++++-----
3 files changed, 5 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..6bc3baced3e3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -847,7 +847,6 @@ typedef __bitwise unsigned int vm_fault_t;
* @VM_FAULT_OOM: Out Of Memory
* @VM_FAULT_SIGBUS: Bad access
* @VM_FAULT_MAJOR: Page read from storage
- * @VM_FAULT_WRITE: Special case for get_user_pages
* @VM_FAULT_HWPOISON: Hit poisoned small page
* @VM_FAULT_HWPOISON_LARGE: Hit poisoned large page. Index encoded
* in upper bits
@@ -868,7 +867,6 @@ enum vm_fault_reason {
VM_FAULT_OOM = (__force vm_fault_t)0x000001,
VM_FAULT_SIGBUS = (__force vm_fault_t)0x000002,
VM_FAULT_MAJOR = (__force vm_fault_t)0x000004,
- VM_FAULT_WRITE = (__force vm_fault_t)0x000008,
VM_FAULT_HWPOISON = (__force vm_fault_t)0x000010,
VM_FAULT_HWPOISON_LARGE = (__force vm_fault_t)0x000020,
VM_FAULT_SIGSEGV = (__force vm_fault_t)0x000040,
@@ -894,7 +892,6 @@ enum vm_fault_reason {
{ VM_FAULT_OOM, "OOM" }, \
{ VM_FAULT_SIGBUS, "SIGBUS" }, \
{ VM_FAULT_MAJOR, "MAJOR" }, \
- { VM_FAULT_WRITE, "WRITE" }, \
{ VM_FAULT_HWPOISON, "HWPOISON" }, \
{ VM_FAULT_HWPOISON_LARGE, "HWPOISON_LARGE" }, \
{ VM_FAULT_SIGSEGV, "SIGSEGV" }, \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1cc4a5f4791e..be13fe55b798 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1379,7 +1379,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
spin_unlock(vmf->ptl);
- return VM_FAULT_WRITE;
+ return 0;
}
unlock_fallback:
diff --git a/mm/memory.c b/mm/memory.c
index f88c351aecd4..8e72f703ed99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3242,7 +3242,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
delayacct_wpcopy_end();
- return (page_copied && !unshare) ? VM_FAULT_WRITE : 0;
+ return 0;
oom_free_new:
put_page(new_page);
oom:
@@ -3306,14 +3306,14 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
return finish_mkwrite_fault(vmf);
}
wp_page_reuse(vmf);
- return VM_FAULT_WRITE;
+ return 0;
}
static vm_fault_t wp_page_shared(struct vm_fault *vmf)
__releases(vmf->ptl)
{
struct vm_area_struct *vma = vmf->vma;
- vm_fault_t ret = VM_FAULT_WRITE;
+ vm_fault_t ret = 0;
get_page(vmf->page);
@@ -3464,7 +3464,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
return 0;
}
wp_page_reuse(vmf);
- return VM_FAULT_WRITE;
+ return 0;
} else if (unshare) {
/* No anonymous page -> nothing to do. */
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3983,7 +3983,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->flags & FAULT_FLAG_WRITE) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
vmf->flags &= ~FAULT_FLAG_WRITE;
- ret |= VM_FAULT_WRITE;
}
rmap_flags |= RMAP_EXCLUSIVE;
}
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 4/9] selftests/vm: add KSM unmerge tests
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (2 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 3/9] mm: remove VM_FAULT_WRITE David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 5/9] mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE David Hildenbrand
` (5 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Let's add three unmerge tests (MADV_UNMERGEABLE unmerging all pages in the
range).
test_unmerge(): basic unmerge tests
test_unmerge_discarded(): have some pte_none() entries in the range
test_unmerge_uffd_wp(): protect the merged pages using uffd-wp
ksm_tests.c currently contains a mixture of benchmarks and tests,
whereby each test is carried out by executing the ksm_tests binary with
specific parameters. Let's add new ksm_functional_tests.c that performs
multiple, smaller functional tests all at once.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
tools/testing/selftests/vm/Makefile | 2 +
.../selftests/vm/ksm_functional_tests.c | 279 ++++++++++++++++++
tools/testing/selftests/vm/run_vmtests.sh | 2 +
tools/testing/selftests/vm/vm_util.c | 10 +
tools/testing/selftests/vm/vm_util.h | 1 +
5 files changed, 294 insertions(+)
create mode 100644 tools/testing/selftests/vm/ksm_functional_tests.c
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 163c2fde3cb3..2d640a48255c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -52,6 +52,7 @@ TEST_GEN_FILES += userfaultfd
TEST_GEN_PROGS += soft-dirty
TEST_GEN_PROGS += split_huge_page_test
TEST_GEN_FILES += ksm_tests
+TEST_GEN_PROGS += ksm_functional_tests
ifeq ($(MACHINE),x86_64)
CAN_BUILD_I386 := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_32bit_program.c -m32)
@@ -96,6 +97,7 @@ TEST_FILES += va_128TBswitch.sh
include ../lib.mk
$(OUTPUT)/khugepaged: vm_util.c
+$(OUTPUT)/ksm_functional_tests: vm_util.c
$(OUTPUT)/madv_populate: vm_util.c
$(OUTPUT)/soft-dirty: vm_util.c
$(OUTPUT)/split_huge_page_test: vm_util.c
diff --git a/tools/testing/selftests/vm/ksm_functional_tests.c b/tools/testing/selftests/vm/ksm_functional_tests.c
new file mode 100644
index 000000000000..96644be68962
--- /dev/null
+++ b/tools/testing/selftests/vm/ksm_functional_tests.c
@@ -0,0 +1,279 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * KSM functional tests
+ *
+ * Copyright 2022, Red Hat, Inc.
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/userfaultfd.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define KiB 1024u
+#define MiB (1024 * KiB)
+
+static int ksm_fd;
+static int ksm_full_scans_fd;
+static int pagemap_fd;
+static size_t pagesize;
+
+static bool range_maps_duplicates(char *addr, unsigned long size)
+{
+ unsigned long offs_a, offs_b, pfn_a, pfn_b;
+
+ /*
+ * There is no easy way to check if there are KSM pages mapped into
+ * this range. We only check that the range does not map the same PFN
+ * twice by comaring each pair of mapped pages.
+ */
+ for (offs_a = 0; offs_a < size; offs_a += pagesize) {
+ pfn_a = pagemap_get_pfn(pagemap_fd, addr + offs_a);
+ /* Page not present or PFN not exposed by the kernel. */
+ if (pfn_a == -1ull || !pfn_a)
+ continue;
+
+ for (offs_b = offs_a + pagesize; offs_b < size;
+ offs_b += pagesize) {
+ pfn_b = pagemap_get_pfn(pagemap_fd, addr + offs_b);
+ if (pfn_b == -1ull || !pfn_b)
+ continue;
+ if (pfn_a == pfn_b)
+ return true;
+ }
+ }
+ return false;
+}
+
+static long ksm_get_full_scans(void)
+{
+ char buf[10];
+ ssize_t ret;
+
+ ret = pread(ksm_full_scans_fd, buf, sizeof(buf) - 1, 0);
+ if (ret <= 0)
+ return -errno;
+ buf[ret] = 0;
+
+ return strtol(buf, NULL, 10);
+}
+
+static int ksm_merge(void)
+{
+ long start_scans, end_scans;
+
+ /* Wait for two full scans such that any possible merging happened. */
+ start_scans = ksm_get_full_scans();
+ if (start_scans < 0)
+ return start_scans;
+ if (write(ksm_fd, "1", 1) != 1)
+ return -errno;
+ do {
+ end_scans = ksm_get_full_scans();
+ if (end_scans < 0)
+ return end_scans;
+ } while (end_scans < start_scans + 2);
+
+ return 0;
+}
+
+static char *mmap_and_merge_range(char val, unsigned long size)
+{
+ char *map;
+
+ map = mmap(NULL, size, PROT_READ|PROT_WRITE,
+ MAP_PRIVATE|MAP_ANON, -1, 0);
+ if (map == MAP_FAILED) {
+ ksft_test_result_fail("mmap() failed\n");
+ return MAP_FAILED;
+ }
+
+ /* Don't use THP. Ignore if THP are not around on a kernel. */
+ if (madvise(map, size, MADV_NOHUGEPAGE) && errno != EINVAL) {
+ ksft_test_result_fail("MADV_NOHUGEPAGE failed\n");
+ goto unmap;
+ }
+
+ /* Make sure each page contains the same values to merge them. */
+ memset(map, val, size);
+ if (madvise(map, size, MADV_MERGEABLE)) {
+ ksft_test_result_fail("MADV_MERGEABLE failed\n");
+ goto unmap;
+ }
+
+ /* Run KSM to trigger merging and wait. */
+ if (ksm_merge()) {
+ ksft_test_result_fail("Running KSM failed\n");
+ goto unmap;
+ }
+ return map;
+unmap:
+ munmap(map, size);
+ return MAP_FAILED;
+}
+
+static void test_unmerge(void)
+{
+ const unsigned int size = 2 * MiB;
+ char *map;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ map = mmap_and_merge_range(0xcf, size);
+ if (map == MAP_FAILED)
+ return;
+
+ if (madvise(map, size, MADV_UNMERGEABLE)) {
+ ksft_test_result_fail("MADV_UNMERGEABLE failed\n");
+ goto unmap;
+ }
+
+ ksft_test_result(!range_maps_duplicates(map, size),
+ "Pages were unmerged\n");
+unmap:
+ munmap(map, size);
+}
+
+static void test_unmerge_discarded(void)
+{
+ const unsigned int size = 2 * MiB;
+ char *map;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ map = mmap_and_merge_range(0xcf, size);
+ if (map == MAP_FAILED)
+ return;
+
+ /* Discard half of all mapped pages so we have pte_none() entries. */
+ if (madvise(map, size / 2, MADV_DONTNEED)) {
+ ksft_test_result_fail("MADV_DONTNEED failed\n");
+ goto unmap;
+ }
+
+ if (madvise(map, size, MADV_UNMERGEABLE)) {
+ ksft_test_result_fail("MADV_UNMERGEABLE failed\n");
+ goto unmap;
+ }
+
+ ksft_test_result(!range_maps_duplicates(map, size),
+ "Pages were unmerged\n");
+unmap:
+ munmap(map, size);
+}
+
+#ifdef __NR_userfaultfd
+static void test_unmerge_uffd_wp(void)
+{
+ struct uffdio_writeprotect uffd_writeprotect;
+ struct uffdio_register uffdio_register;
+ const unsigned int size = 2 * MiB;
+ struct uffdio_api uffdio_api;
+ char *map;
+ int uffd;
+
+ ksft_print_msg("[RUN] %s\n", __func__);
+
+ map = mmap_and_merge_range(0xcf, size);
+ if (map == MAP_FAILED)
+ return;
+
+ /* See if UFFD is around. */
+ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+ if (uffd < 0) {
+ ksft_test_result_skip("__NR_userfaultfd failed\n");
+ goto unmap;
+ }
+
+ /* See if UFFD-WP is around. */
+ uffdio_api.api = UFFD_API;
+ uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
+ if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
+ ksft_test_result_fail("UFFDIO_API failed\n");
+ goto close_uffd;
+ }
+ if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
+ ksft_test_result_skip("UFFD_FEATURE_PAGEFAULT_FLAG_WP not available\n");
+ goto close_uffd;
+ }
+
+ /* Register UFFD-WP, no need for an actual handler. */
+ uffdio_register.range.start = (unsigned long) map;
+ uffdio_register.range.len = size;
+ uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
+ if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
+ ksft_test_result_fail("UFFDIO_REGISTER_MODE_WP failed\n");
+ goto close_uffd;
+ }
+
+ /* Write-protect the range using UFFD-WP. */
+ uffd_writeprotect.range.start = (unsigned long) map;
+ uffd_writeprotect.range.len = size;
+ uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
+ if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
+ ksft_test_result_fail("UFFDIO_WRITEPROTECT failed\n");
+ goto close_uffd;
+ }
+
+ if (madvise(map, size, MADV_UNMERGEABLE)) {
+ ksft_test_result_fail("MADV_UNMERGEABLE failed\n");
+ goto close_uffd;
+ }
+
+ ksft_test_result(!range_maps_duplicates(map, size),
+ "Pages were unmerged\n");
+close_uffd:
+ close(uffd);
+unmap:
+ munmap(map, size);
+}
+#endif
+
+int main(int argc, char **argv)
+{
+ unsigned int tests = 2;
+ int err;
+
+#ifdef __NR_userfaultfd
+ tests++;
+#endif
+
+ ksft_print_header();
+ ksft_set_plan(tests);
+
+ pagesize = getpagesize();
+
+ ksm_fd = open("/sys/kernel/mm/ksm/run", O_RDWR);
+ if (ksm_fd < 0)
+ ksft_exit_skip("open(\"/sys/kernel/mm/ksm/run\") failed\n");
+ ksm_full_scans_fd = open("/sys/kernel/mm/ksm/full_scans", O_RDONLY);
+ if (ksm_full_scans_fd < 0)
+ ksft_exit_skip("open(\"/sys/kernel/mm/ksm/full_scans\") failed\n");
+ pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+ if (pagemap_fd < 0)
+ ksft_exit_skip("open(\"/proc/self/pagemap\") failed\n");
+
+ test_unmerge();
+ test_unmerge_discarded();
+#ifdef __NR_userfaultfd
+ test_unmerge_uffd_wp();
+#endif
+
+ err = ksft_get_fail_cnt();
+ if (err)
+ ksft_exit_fail_msg("%d out of %d tests failed\n",
+ err, ksft_test_num());
+ return ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index e780e76c26b8..b8950891259b 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -184,6 +184,8 @@ run_test ./ksm_tests -N -m 1
# KSM test with 2 NUMA nodes and merge_across_nodes = 0
run_test ./ksm_tests -N -m 0
+run_test ./ksm_functional_tests
+
# protection_keys tests
if [ -x ./protection_keys_32 ]
then
diff --git a/tools/testing/selftests/vm/vm_util.c b/tools/testing/selftests/vm/vm_util.c
index f11f8adda521..dbd8889324e6 100644
--- a/tools/testing/selftests/vm/vm_util.c
+++ b/tools/testing/selftests/vm/vm_util.c
@@ -28,6 +28,16 @@ bool pagemap_is_softdirty(int fd, char *start)
return entry & 0x0080000000000000ull;
}
+unsigned long pagemap_get_pfn(int fd, char *start)
+{
+ uint64_t entry = pagemap_get_entry(fd, start);
+
+ /* If present (63th bit), PFN is at bit 0 -- 54. */
+ if (entry & 0x8000000000000000ull)
+ return entry & 0x007fffffffffffffull;
+ return -1ull;
+}
+
void clear_softdirty(void)
{
int ret;
diff --git a/tools/testing/selftests/vm/vm_util.h b/tools/testing/selftests/vm/vm_util.h
index 5c35de454e08..acecb5b6f8ca 100644
--- a/tools/testing/selftests/vm/vm_util.h
+++ b/tools/testing/selftests/vm/vm_util.h
@@ -4,6 +4,7 @@
uint64_t pagemap_get_entry(int fd, char *start);
bool pagemap_is_softdirty(int fd, char *start);
+unsigned long pagemap_get_pfn(int fd, char *start);
void clear_softdirty(void);
bool check_for_pattern(FILE *fp, const char *pattern, char *buf, size_t len);
uint64_t read_pmd_pagesize(void);
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 5/9] mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (3 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 4/9] selftests/vm: add KSM unmerge tests David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 6/9] mm/pagewalk: don't trigger test_walk() in walk_page_vma() David Hildenbrand
` (4 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Let's stop breaking COW via a fake write fault and let's use
FAULT_FLAG_UNSHARE instead. This avoids any wrong side effects of the fake
write fault, such as mapping the PTE writable and marking the pte
dirty/softdirty.
Consequently, we will no longer trigger a fake write fault and break COW
without any such side-effects.
Also, this fixes KSM interaction with userfaultfd-wp: when we have a KSM
page that's write-protected by userfaultfd, break_ksm()->handle_mm_fault()
will fail with VM_FAULT_SIGBUS and will simply return in break_ksm() with
0 instead of actually breaking COW.
For now, the KSM unmerge tests can trigger that:
$ sudo ./ksm_functional_tests
TAP version 13
1..3
# [RUN] test_unmerge
ok 1 Pages were unmerged
# [RUN] test_unmerge_discarded
ok 2 Pages were unmerged
# [RUN] test_unmerge_uffd_wp
not ok 3 Pages were unmerged
Bail out! 1 out of 3 tests failed
# Planned tests != run tests (2 != 3)
# Totals: pass:2 fail:1 xfail:0 xpass:0 skip:0 error:0
The warning in dmesg also indicates this wrong handling:
[ 230.096368] FAULT_FLAG_ALLOW_RETRY missing 881
[ 230.100822] CPU: 1 PID: 1643 Comm: ksm-uffd-wp [...]
[ 230.110124] Hardware name: [...]
[ 230.117775] Call Trace:
[ 230.120227] <TASK>
[ 230.122334] dump_stack_lvl+0x44/0x5c
[ 230.126010] handle_userfault.cold+0x14/0x19
[ 230.130281] ? tlb_finish_mmu+0x65/0x170
[ 230.134207] ? uffd_wp_range+0x65/0xa0
[ 230.137959] ? _raw_spin_unlock+0x15/0x30
[ 230.141972] ? do_wp_page+0x50/0x590
[ 230.145551] __handle_mm_fault+0x9f5/0xf50
[ 230.149652] ? mmput+0x1f/0x40
[ 230.152712] handle_mm_fault+0xb9/0x2a0
[ 230.156550] break_ksm+0x141/0x180
[ 230.159964] unmerge_ksm_pages+0x60/0x90
[ 230.163890] ksm_madvise+0x3c/0xb0
[ 230.167295] do_madvise.part.0+0x10c/0xeb0
[ 230.171396] ? do_syscall_64+0x67/0x80
[ 230.175157] __x64_sys_madvise+0x5a/0x70
[ 230.179082] do_syscall_64+0x58/0x80
[ 230.182661] ? do_syscall_64+0x67/0x80
[ 230.186413] entry_SYSCALL_64_after_hwframe+0x63/0xcd
This is primarily a fix for KSM+userfaultfd-wp, however, the fake write
fault was always questionable. As this fix is not easy to backport and it's
not very critical, let's not cc stable.
Fixes: 529b930b87d9 ("userfaultfd: wp: hook userfault handler to write protection fault")
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/ksm.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index b884a22f3c3c..c6f58aa6e731 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -420,17 +420,15 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
}
/*
- * We use break_ksm to break COW on a ksm page: it's a stripped down
+ * We use break_ksm to break COW on a ksm page by triggering unsharing,
+ * such that the ksm page will get replaced by an exclusive anonymous page.
*
- * if (get_user_pages(addr, 1, FOLL_WRITE, &page, NULL) == 1)
- * put_page(page);
- *
- * but taking great care only to touch a ksm page, in a VM_MERGEABLE vma,
+ * We take great care only to touch a ksm page, in a VM_MERGEABLE vma,
* in case the application has unmapped and remapped mm,addr meanwhile.
* Could a ksm page appear anywhere else? Actually yes, in a VM_PFNMAP
* mmap of /dev/mem, where we would not want to touch it.
*
- * FAULT_FLAG/FOLL_REMOTE are because we do this outside the context
+ * FAULT_FLAG_REMOTE/FOLL_REMOTE are because we do this outside the context
* of the process that owns 'vma'. We also do not want to enforce
* protection keys here anyway.
*/
@@ -454,7 +452,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
if (!ksm_page)
return 0;
ret = handle_mm_fault(vma, addr,
- FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE,
+ FAULT_FLAG_UNSHARE | FAULT_FLAG_REMOTE,
NULL);
} while (!(ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | VM_FAULT_OOM)));
/*
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 6/9] mm/pagewalk: don't trigger test_walk() in walk_page_vma()
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (4 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 5/9] mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 7/9] mm/pagewalk: add walk_page_range_vma() David Hildenbrand
` (3 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
As Peter points out, the caller passes a single VMA and can just do that
check itself.
And in fact, no existing users rely on test_walk() getting called. So let's
just remove it and make the implementation slightly more efficient.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/pagewalk.h | 2 ++
mm/pagewalk.c | 7 -------
2 files changed, 2 insertions(+), 7 deletions(-)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index f3fafb731ffd..37dc0208862d 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -27,6 +27,8 @@ struct mm_walk;
* "do page table walk over the current vma", returning
* a negative value means "abort current page table walk
* right now" and returning 1 means "skip the current vma"
+ * Note that this callback is not called when the caller
+ * passes in a single VMA as for walk_page_vma().
* @pre_vma: if set, called before starting walk on a non-null vma.
* @post_vma: if set, called after a walk on a non-null vma, provided
* that @pre_vma and the vma walk succeeded.
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 2ff3a5bebceb..0a5d71aaf9c7 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -526,18 +526,11 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
.vma = vma,
.private = private,
};
- int err;
if (!walk.mm)
return -EINVAL;
mmap_assert_locked(walk.mm);
-
- err = walk_page_test(vma->vm_start, vma->vm_end, &walk);
- if (err > 0)
- return 0;
- if (err < 0)
- return err;
return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
}
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 7/9] mm/pagewalk: add walk_page_range_vma()
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (5 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 6/9] mm/pagewalk: don't trigger test_walk() in walk_page_vma() David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 8/9] mm/ksm: convert break_ksm() to use walk_page_range_vma() David Hildenbrand
` (2 subsequent siblings)
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Let's add walk_page_range_vma(), which is similar to walk_page_vma(),
however, is only interested in a subset of the VMA range.
To be used in KSM code to stop using follow_page() next.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/pagewalk.h | 3 +++
mm/pagewalk.c | 20 ++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 37dc0208862d..959f52e5867d 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -101,6 +101,9 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
unsigned long end, const struct mm_walk_ops *ops,
pgd_t *pgd,
void *private);
+int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, const struct mm_walk_ops *ops,
+ void *private);
int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private);
int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 0a5d71aaf9c7..7f1c9b274906 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -517,6 +517,26 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
return walk_pgd_range(start, end, &walk);
}
+int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, const struct mm_walk_ops *ops,
+ void *private)
+{
+ struct mm_walk walk = {
+ .ops = ops,
+ .mm = vma->vm_mm,
+ .vma = vma,
+ .private = private,
+ };
+
+ if (start >= end || !walk.mm)
+ return -EINVAL;
+ if (start < vma->vm_start || end > vma->vm_end)
+ return -EINVAL;
+
+ mmap_assert_locked(walk.mm);
+ return __walk_page_range(start, end, &walk);
+}
+
int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
void *private)
{
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 8/9] mm/ksm: convert break_ksm() to use walk_page_range_vma()
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (6 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 7/9] mm/pagewalk: add walk_page_range_vma() David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 9/9] mm/gup: remove FOLL_MIGRATION David Hildenbrand
2022-10-21 19:57 ` [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes Andrew Morton
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
FOLL_MIGRATION exists only for the purpose of break_ksm(), and
actually, there is not even the need to wait for the migration to
finish, we only want to know if we're dealing with a KSM page.
Using follow_page() just to identify a KSM page overcomplicates GUP
code. Let's use walk_page_range_vma() instead, because we don't actually
care about the page itself, we only need to know a single property --
no need to even grab a reference.
So, get rid of follow_page() usage such that we can get rid of
FOLL_MIGRATION now and eventually be able to get rid of follow_page() in
the future.
In my setup (AMD Ryzen 9 3900X), running the KSM selftest to test unmerge
performance on 2 GiB (taskset 0x8 ./ksm_tests -D -s 2048), this results in
a performance degradation of ~2% (old: ~5010 MiB/s, new: ~4900 MiB/s).
I don't think we particularly care for now.
Interestingly, the benchmark reduction is due to the single callback.
Adding a second callback (e.g., pud_entry()) reduces the benchmark by
another 100-200 MiB/s.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/ksm.c | 49 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 39 insertions(+), 10 deletions(-)
diff --git a/mm/ksm.c b/mm/ksm.c
index c6f58aa6e731..5cdb852ff132 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -39,6 +39,7 @@
#include <linux/freezer.h>
#include <linux/oom.h>
#include <linux/numa.h>
+#include <linux/pagewalk.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -419,6 +420,39 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
return atomic_read(&mm->mm_users) == 0;
}
+static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long next,
+ struct mm_walk *walk)
+{
+ struct page *page = NULL;
+ spinlock_t *ptl;
+ pte_t *pte;
+ int ret;
+
+ if (pmd_leaf(*pmd) || !pmd_present(*pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (pte_present(*pte)) {
+ page = vm_normal_page(walk->vma, addr, *pte);
+ } else if (!pte_none(*pte)) {
+ swp_entry_t entry = pte_to_swp_entry(*pte);
+
+ /*
+ * As KSM pages remain KSM pages until freed, no need to wait
+ * here for migration to end.
+ */
+ if (is_migration_entry(entry))
+ page = pfn_swap_entry_to_page(entry);
+ }
+ ret = page && PageKsm(page);
+ pte_unmap_unlock(pte, ptl);
+ return ret;
+}
+
+static const struct mm_walk_ops break_ksm_ops = {
+ .pmd_entry = break_ksm_pmd_entry,
+};
+
/*
* We use break_ksm to break COW on a ksm page by triggering unsharing,
* such that the ksm page will get replaced by an exclusive anonymous page.
@@ -434,21 +468,16 @@ static inline bool ksm_test_exit(struct mm_struct *mm)
*/
static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
{
- struct page *page;
vm_fault_t ret = 0;
do {
- bool ksm_page = false;
+ int ksm_page;
cond_resched();
- page = follow_page(vma, addr,
- FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
- if (IS_ERR_OR_NULL(page))
- break;
- if (PageKsm(page))
- ksm_page = true;
- put_page(page);
-
+ ksm_page = walk_page_range_vma(vma, addr, addr + 1,
+ &break_ksm_ops, NULL);
+ if (WARN_ON_ONCE(ksm_page < 0))
+ return ksm_page;
if (!ksm_page)
return 0;
ret = handle_mm_fault(vma, addr,
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 9/9] mm/gup: remove FOLL_MIGRATION
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (7 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 8/9] mm/ksm: convert break_ksm() to use walk_page_range_vma() David Hildenbrand
@ 2022-10-21 10:11 ` David Hildenbrand
2022-10-21 19:57 ` [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes Andrew Morton
9 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-21 10:11 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, David Hildenbrand, Andrew Morton, Shuah Khan,
Hugh Dickins, Vlastimil Babka, Peter Xu, Andrea Arcangeli,
Matthew Wilcox (Oracle), Jason Gunthorpe, John Hubbard
Fortunately, the last user (KSM) is gone, so let's just remove this
rather special code from generic GUP handling -- especially because KSM
never required the PMD handling as KSM only deals with individual base
pages.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 1 -
mm/gup.c | 55 +++++-----------------------------------------
2 files changed, 5 insertions(+), 51 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..a63415ac9dc2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2950,7 +2950,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
* and return without waiting upon it */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
-#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_ANON 0x8000 /* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index fe195d47de74..bcb46e9d496e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -549,30 +549,13 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
return no_page_table(vma, flags);
}
-retry:
if (unlikely(pmd_bad(*pmd)))
return no_page_table(vma, flags);
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
- if (!pte_present(pte)) {
- swp_entry_t entry;
- /*
- * KSM's break_ksm() relies upon recognizing a ksm page
- * even while it is being migrated, so for that case we
- * need migration_entry_wait().
- */
- if (likely(!(flags & FOLL_MIGRATION)))
- goto no_page;
- if (pte_none(pte))
- goto no_page;
- entry = pte_to_swp_entry(pte);
- if (!is_migration_entry(entry))
- goto no_page;
- pte_unmap_unlock(ptep, ptl);
- migration_entry_wait(mm, pmd, address);
- goto retry;
- }
+ if (!pte_present(pte))
+ goto no_page;
if (pte_protnone(pte) && !gup_can_follow_protnone(flags))
goto no_page;
@@ -694,28 +677,8 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return page;
return no_page_table(vma, flags);
}
-retry:
- if (!pmd_present(pmdval)) {
- /*
- * Should never reach here, if thp migration is not supported;
- * Otherwise, it must be a thp migration entry.
- */
- VM_BUG_ON(!thp_migration_supported() ||
- !is_pmd_migration_entry(pmdval));
-
- if (likely(!(flags & FOLL_MIGRATION)))
- return no_page_table(vma, flags);
-
- pmd_migration_entry_wait(mm, pmd);
- pmdval = READ_ONCE(*pmd);
- /*
- * MADV_DONTNEED may convert the pmd to null because
- * mmap_lock is held in read mode
- */
- if (pmd_none(pmdval))
- return no_page_table(vma, flags);
- goto retry;
- }
+ if (!pmd_present(pmdval))
+ return no_page_table(vma, flags);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
@@ -729,18 +692,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
if (pmd_protnone(pmdval) && !gup_can_follow_protnone(flags))
return no_page_table(vma, flags);
-retry_locked:
ptl = pmd_lock(mm, pmd);
- if (unlikely(pmd_none(*pmd))) {
- spin_unlock(ptl);
- return no_page_table(vma, flags);
- }
if (unlikely(!pmd_present(*pmd))) {
spin_unlock(ptl);
- if (likely(!(flags & FOLL_MIGRATION)))
- return no_page_table(vma, flags);
- pmd_migration_entry_wait(mm, pmd);
- goto retry_locked;
+ return no_page_table(vma, flags);
}
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
` (8 preceding siblings ...)
2022-10-21 10:11 ` [PATCH v2 9/9] mm/gup: remove FOLL_MIGRATION David Hildenbrand
@ 2022-10-21 19:57 ` Andrew Morton
2022-10-24 13:32 ` David Hildenbrand
9 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2022-10-21 19:57 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, Shuah Khan, Hugh Dickins, Vlastimil Babka,
Peter Xu, Andrea Arcangeli, Matthew Wilcox (Oracle),
Jason Gunthorpe, John Hubbard
On Fri, 21 Oct 2022 12:11:32 +0200 David Hildenbrand <david@redhat.com> wrote:
> This series cleans up and fixes break_ksm().
Quite a lot of fixups were needed merging this. I guess you couldn't
develop against mm-unstable because the v1 series was already in there.
For this reason I'll henceforth be more inclined to drop serieses when
I know a full resend is coming out.
So please do let me know when a full resend is coming out. Or, of
course, send little fixes against the current version.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes
2022-10-21 19:57 ` [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes Andrew Morton
@ 2022-10-24 13:32 ` David Hildenbrand
0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2022-10-24 13:32 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, Shuah Khan, Hugh Dickins, Vlastimil Babka,
Peter Xu, Andrea Arcangeli, Matthew Wilcox (Oracle),
Jason Gunthorpe, John Hubbard
On 21.10.22 21:57, Andrew Morton wrote:
> On Fri, 21 Oct 2022 12:11:32 +0200 David Hildenbrand <david@redhat.com> wrote:
>
>> This series cleans up and fixes break_ksm().
>
> Quite a lot of fixups were needed merging this. I guess you couldn't
> develop against mm-unstable because the v1 series was already in there.
Nowadays, I tend to send against mm-stable (I remember that was the
suggestion). Usually it works because there are no conflicts -- this
time there are probably quite some kvm unit test conflicts.
Feel free to ask me next time to rebase on XYZ so I can make you life
easier ;)
>
> For this reason I'll henceforth be more inclined to drop serieses when
> I know a full resend is coming out.
Yes, good idea. While the fixup-patch process works for small
adjustments, it's not a good fit for bigger changes, especially once
involving new patches.
>
> So please do let me know when a full resend is coming out.
Will do; I kind-of did that [1] but I should have been more clear
("Please drop the current series, I'll send a new version.").
> Or, of
> course, send little fixes against the current version.
I prefer a full resend when there are bigger changes that involve
modifications to multiple patches ... which also makes life easier for
reviewers.
[1]
https://lore.kernel.org/all/87104912-6615-4917-eae1-6ae0a80677e1@redhat.com/T/#u
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-10-24 13:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-21 10:11 [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 1/9] selftests/vm: add test to measure MADV_UNMERGEABLE performance David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 2/9] mm/ksm: simplify break_ksm() to not rely on VM_FAULT_WRITE David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 3/9] mm: remove VM_FAULT_WRITE David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 4/9] selftests/vm: add KSM unmerge tests David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 5/9] mm/ksm: fix KSM COW breaking with userfaultfd-wp via FAULT_FLAG_UNSHARE David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 6/9] mm/pagewalk: don't trigger test_walk() in walk_page_vma() David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 7/9] mm/pagewalk: add walk_page_range_vma() David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 8/9] mm/ksm: convert break_ksm() to use walk_page_range_vma() David Hildenbrand
2022-10-21 10:11 ` [PATCH v2 9/9] mm/gup: remove FOLL_MIGRATION David Hildenbrand
2022-10-21 19:57 ` [PATCH v2 0/9] mm/ksm: break_ksm() cleanups and fixes Andrew Morton
2022-10-24 13:32 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).