* [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
@ 2025-02-28 2:30 Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 2:30 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
Olivier Dion, linux-mm
This series introduces SKSM, a new page deduplication ABI,
aiming to fix the limitations inherent to the KSM ABI.
The implementation is simple enough: SKSM is implemented in about 100
LOC compared to 2.5k LOC for KSM (on top of the common KSM helpers).
This is sent as a proof of concept. It applies on top of v6.13.
Feedback is welcome!
Mathieu
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org
Mathieu Desnoyers (2):
mm: Introduce SKSM: Synchronous Kernel Samepage Merging
selftests/kskm: Introduce SKSM basic test
include/linux/ksm.h | 4 +
include/linux/mm_types.h | 7 +
include/linux/page-flags.h | 42 ++++
include/linux/sksm.h | 27 +++
include/uapi/asm-generic/mman-common.h | 2 +
mm/Kconfig | 5 +
mm/Makefile | 1 +
mm/ksm-common.h | 228 ++++++++++++++++++++++
mm/ksm.c | 219 +--------------------
mm/madvise.c | 6 +
mm/memory.c | 2 +
mm/page_alloc.c | 3 +
mm/sksm.c | 190 ++++++++++++++++++
tools/testing/selftests/sksm/.gitignore | 2 +
tools/testing/selftests/sksm/Makefile | 14 ++
tools/testing/selftests/sksm/basic_test.c | 217 ++++++++++++++++++++
16 files changed, 751 insertions(+), 218 deletions(-)
create mode 100644 include/linux/sksm.h
create mode 100644 mm/ksm-common.h
create mode 100644 mm/sksm.c
create mode 100644 tools/testing/selftests/sksm/.gitignore
create mode 100644 tools/testing/selftests/sksm/Makefile
create mode 100644 tools/testing/selftests/sksm/basic_test.c
--
2.39.5
^ permalink raw reply [flat|nested] 29+ messages in thread
* [RFC PATCH 1/2] mm: Introduce SKSM: Synchronous Kernel Samepage Merging
2025-02-28 2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
@ 2025-02-28 2:30 ` Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
2025-02-28 2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
2 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 2:30 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
Olivier Dion, linux-mm
* Main use-case targeted by SKSM: Code patching
The main use-case targeted by SKSM is deduplication of anonymous
pages created by COW (Copy-On-Write) triggered by patching executable
and library code for a user-space implementation of "static keys" and
"alternative" code patching. Code patching improves:
- Runtime feature detection, where a constructor can dynamically
enable a feature by turning a no-op into a jump.
- Instrumentation activation at runtime (e.g. tracepoints) (patch
a dormant no-op instrumentation into a jump).
- Runtime assembler specialisation, where a constructor can dynamically
modify assembler instructions to select the best alternative for the
detected hardware and software environment (e.g. CPU features, rseq
availability).
The main distinction between doing code patching at kernel-level and at
user-space level is that in user-space, executable and library code is
shared across all processes mapping the same executable or library
files. This reduces memory use and improves cache locality by sharing
executable pages across processes.
Writing to those private mappings trigger COW, which allocates anonymous
pages within each process, and thus lose the benefit from sharing the
same pages from the backing storage.
Without memory deduplication, this increases memory use, and therefore
degrades cache locality: populating the patched content into separate
COW pages within each process ends up using distinct CPU cache lines,
thus trashing the CPU instruction and data caches.
* Why not use KSM ?
The KSM mechanism has the following downsides which the SKSM ABI aims to
overcome:
- KSM requires careful tuning of scan parameters for the workload by the
system administrator.
A) This makes KSM mostly useless with a standard distro config.
B) KSM is workload-specific.
C) Scanning pages adds overhead to the system, which is the reason
why the scan parameters must be tuned for the workload.
- KSM has security implications, because it allows processes to
confirm that an unrelated process has a page which contains a known
content.
A) The documentation of madvise(2) MADV_MERGEABLE would benefit from
advising against targeting memory that contains secret data,
due to the risk of discovery through side-channel timing attack.
B) prctl(2) PR_SET_MEMORY_MERGE inherently marks the entire process
memory as mergeable, which makes it incompatible with security
oriented use-cases.
* SKSM Overview
SKSM enables synchronous dynamic sharing of identical pages found in
different memory areas, even if they are not shared by fork().
Userspace must explicitly request for pages within specific address
ranges to be merged with madvise MADV_MERGE. Those should *not* contain
secrets, as side-channel timing attacks can allow a process to learn the
existence of a known content within another process.
The synchronous memory merging performs the memory merging synchronously
within madvise. There is no global scan and no need for background
daemon.
The anonymous pages targeted for merge are write-protected and
checksummed. They are then compared to other pages targeted for merge.
The mergeable pages are added to a hash table indexed by checksum of
their content. The hash value is derived from the page content checksum,
and its comparison function is based on comparison of the page content.
If a page is written to after being targeted for merge, a COW will be
triggered, and thus a new page will be populated in its stead.
* Expected Use
User-space is expected to perform code patching, e.g. from a library
constructor, and then when the text pages are expected to stay invariant
for a long time, issue madvise(2) MADV_MERGE on those pages. At this
point, the pages will be write-protected, and merged with identical SKSM
pages. Further stores to those pages will trigger COW again.
* Results
Output of "cat /proc/vmstat | grep nr_anon_pages" while running 1000
instances of "sleep 500":
- Baseline (no preload): nr_anon_pages 39721
- COW each executable page from libc: nr_anon_pages 419927
- madvise MADV_MERGE after COW of libc: nr_anon_pages 45525
* Limitations
- This is a Proof-of-concept !
- It is incompatible with SKM (depends on !KSM) for now.
- Swap behavior under memory pressure is untested.
- The size of the hash table is static (65536 buckets) for now.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org
---
include/linux/ksm.h | 4 +
include/linux/mm_types.h | 7 +
include/linux/page-flags.h | 42 +++++
include/linux/sksm.h | 27 +++
include/uapi/asm-generic/mman-common.h | 2 +
mm/Kconfig | 5 +
mm/Makefile | 1 +
mm/ksm-common.h | 228 +++++++++++++++++++++++++
mm/ksm.c | 219 +-----------------------
mm/madvise.c | 6 +
mm/memory.c | 2 +
mm/page_alloc.c | 3 +
mm/sksm.c | 190 +++++++++++++++++++++
13 files changed, 518 insertions(+), 218 deletions(-)
create mode 100644 include/linux/sksm.h
create mode 100644 mm/ksm-common.h
create mode 100644 mm/sksm.c
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 6a53ac4885bb..dc3ce855863c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -118,6 +118,10 @@ static inline void ksm_exit(struct mm_struct *mm)
{
}
+static inline void ksm_map_zero_page(struct mm_struct *mm)
+{
+}
+
static inline void ksm_might_unmap_zero_page(struct mm_struct *mm, pte_t pte)
{
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 332cee285662..e4940562cb81 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
#include <linux/workqueue.h>
#include <linux/seqlock.h>
#include <linux/percpu_counter.h>
+#include <linux/types.h>
#include <asm/mmu.h>
@@ -216,6 +217,12 @@ struct page {
struct page *kmsan_shadow;
struct page *kmsan_origin;
#endif
+
+#ifdef CONFIG_SKSM
+ /* TODO: move those fields into unused union fields instead. */
+ struct hlist_node sksm_node;
+ u32 checksum;
+#endif
} _struct_page_alignment;
/*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 691506bdf2c5..4e96437ab94e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -701,6 +701,48 @@ static __always_inline bool PageAnon(const struct page *page)
return folio_test_anon(page_folio(page));
}
+#ifdef CONFIG_SKSM
+static __always_inline bool folio_test_sksm(const struct folio *folio)
+{
+ return !hlist_unhashed_lockless(&folio->page.sksm_node);
+}
+#else
+static __always_inline bool folio_test_sksm(const struct folio *folio)
+{
+ return false;
+}
+#endif
+
+static __always_inline bool PageSKSM(const struct page *page)
+{
+ return folio_test_sksm(page_folio(page));
+}
+
+#ifdef CONFIG_SKSM
+static inline void set_page_checksum(struct page *page, u32 checksum)
+{
+ page->checksum = checksum;
+}
+
+static inline void init_page_sksm_node(struct page *page)
+{
+ INIT_HLIST_NODE(&page->sksm_node);
+}
+
+void __sksm_page_remove(struct page *page);
+
+static inline void sksm_page_remove(struct page *page)
+{
+ if (!PageSKSM(page))
+ return;
+ __sksm_page_remove(page);
+}
+#else
+static inline void set_page_checksum(struct page *page, u32 checksum) { }
+static inline void init_page_sksm_node(struct page *page) { }
+static inline void sksm_page_remove(struct page *page) { }
+#endif
+
static __always_inline bool __folio_test_movable(const struct folio *folio)
{
return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) ==
diff --git a/include/linux/sksm.h b/include/linux/sksm.h
new file mode 100644
index 000000000000..4f3aaec512df
--- /dev/null
+++ b/include/linux/sksm.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_SKSM_H
+#define __LINUX_SKSM_H
+/*
+ * Synchronous memory merging support.
+ *
+ * This code enables synchronous dynamic sharing of identical pages
+ * found in different memory areas, even if they are not shared by
+ * fork().
+ */
+
+#ifdef CONFIG_SKSM
+
+int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end);
+
+#else /* !CONFIG_KSM */
+
+static inline int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ return 0;
+}
+
+#endif /* !CONFIG_KSM */
+
+#endif /* __LINUX_SKSM_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1ea2c4c33b86..8bd57eb21c12 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_MERGE 26 /* Synchronously merge identical pages */
+
#define MADV_GUARD_INSTALL 102 /* fatal signal on access to range */
#define MADV_GUARD_REMOVE 103 /* unguard range */
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b016808..067d4c3aa21c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -740,6 +740,11 @@ config KSM
until a program has madvised that an area is MADV_MERGEABLE, and
root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
+config SKSM
+ bool "Enable Synchronous KSM for page merging"
+ depends on MMU && !KSM
+ select XXHASH
+
config DEFAULT_MMAP_MIN_ADDR
int "Low address space to protect from user allocation"
depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index dba52bb0da8a..8722c3ea572c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
obj-$(CONFIG_KSM) += ksm.o
+obj-$(CONFIG_SKSM) += sksm.o
obj-$(CONFIG_PAGE_POISONING) += page_poison.o
obj-$(CONFIG_KASAN) += kasan/
obj-$(CONFIG_KFENCE) += kfence/
diff --git a/mm/ksm-common.h b/mm/ksm-common.h
new file mode 100644
index 000000000000..b676f1f5c10f
--- /dev/null
+++ b/mm/ksm-common.h
@@ -0,0 +1,228 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory merging support, common code.
+ */
+#ifndef _KSM_COMMON_H
+#define _KSM_COMMON_H
+
+#include <linux/ksm.h>
+
+static bool vma_ksm_compatible(struct vm_area_struct *vma)
+{
+ if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP |
+ VM_IO | VM_DONTEXPAND | VM_HUGETLB |
+ VM_MIXEDMAP| VM_DROPPABLE))
+ return false; /* just ignore the advice */
+
+ if (vma_is_dax(vma))
+ return false;
+
+#ifdef VM_SAO
+ if (vma->vm_flags & VM_SAO)
+ return false;
+#endif
+#ifdef VM_SPARC_ADI
+ if (vma->vm_flags & VM_SPARC_ADI)
+ return false;
+#endif
+
+ return true;
+}
+
+static u32 calc_checksum(struct page *page)
+{
+ u32 checksum;
+ void *addr = kmap_local_page(page);
+ checksum = xxhash(addr, PAGE_SIZE, 0);
+ kunmap_local(addr);
+ return checksum;
+}
+
+static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
+ pte_t *orig_pte)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
+ int swapped;
+ int err = -EFAULT;
+ struct mmu_notifier_range range;
+ bool anon_exclusive;
+ pte_t entry;
+
+ if (WARN_ON_ONCE(folio_test_large(folio)))
+ return err;
+
+ pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
+ if (pvmw.address == -EFAULT)
+ goto out;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
+ pvmw.address + PAGE_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ if (!page_vma_mapped_walk(&pvmw))
+ goto out_mn;
+ if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
+ goto out_unlock;
+
+ anon_exclusive = PageAnonExclusive(&folio->page);
+ entry = ptep_get(pvmw.pte);
+ if (pte_write(entry) || pte_dirty(entry) ||
+ anon_exclusive || mm_tlb_flush_pending(mm)) {
+ swapped = folio_test_swapcache(folio);
+ flush_cache_page(vma, pvmw.address, folio_pfn(folio));
+ /*
+ * Ok this is tricky, when get_user_pages_fast() run it doesn't
+ * take any lock, therefore the check that we are going to make
+ * with the pagecount against the mapcount is racy and
+ * O_DIRECT can happen right after the check.
+ * So we clear the pte and flush the tlb before the check
+ * this assure us that no O_DIRECT can happen after the check
+ * or in the middle of the check.
+ *
+ * No need to notify as we are downgrading page table to read
+ * only not changing it to point to a new page.
+ *
+ * See Documentation/mm/mmu_notifier.rst
+ */
+ entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
+ /*
+ * Check that no O_DIRECT or similar I/O is in progress on the
+ * page
+ */
+ if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) {
+ set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+ goto out_unlock;
+ }
+
+ /* See folio_try_share_anon_rmap_pte(): clear PTE first. */
+ if (anon_exclusive &&
+ folio_try_share_anon_rmap_pte(folio, &folio->page)) {
+ set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+ goto out_unlock;
+ }
+
+ if (pte_dirty(entry))
+ folio_mark_dirty(folio);
+ entry = pte_mkclean(entry);
+
+ if (pte_write(entry))
+ entry = pte_wrprotect(entry);
+
+ set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+ }
+ *orig_pte = entry;
+ err = 0;
+
+out_unlock:
+ page_vma_mapped_walk_done(&pvmw);
+out_mn:
+ mmu_notifier_invalidate_range_end(&range);
+out:
+ return err;
+}
+
+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma: vma that holds the pte pointing to page
+ * @page: the page we are replacing by kpage
+ * @kpage: the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int replace_page(struct vm_area_struct *vma, struct page *page,
+ struct page *kpage, pte_t orig_pte)
+{
+ struct folio *kfolio = page_folio(kpage);
+ struct mm_struct *mm = vma->vm_mm;
+ struct folio *folio = page_folio(page);
+ pmd_t *pmd;
+ pmd_t pmde;
+ pte_t *ptep;
+ pte_t newpte;
+ spinlock_t *ptl;
+ unsigned long addr;
+ int err = -EFAULT;
+ struct mmu_notifier_range range;
+
+ addr = page_address_in_vma(folio, page, vma);
+ if (addr == -EFAULT)
+ goto out;
+
+ pmd = mm_find_pmd(mm, addr);
+ if (!pmd)
+ goto out;
+ /*
+ * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
+ * without holding anon_vma lock for write. So when looking for a
+ * genuine pmde (in which to find pte), test present and !THP together.
+ */
+ pmde = pmdp_get_lockless(pmd);
+ if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+ goto out;
+
+ mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
+ addr + PAGE_SIZE);
+ mmu_notifier_invalidate_range_start(&range);
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out_mn;
+ if (!pte_same(ptep_get(ptep), orig_pte)) {
+ pte_unmap_unlock(ptep, ptl);
+ goto out_mn;
+ }
+ VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
+ VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
+ kfolio);
+
+ /*
+ * No need to check ksm_use_zero_pages here: we can only have a
+ * zero_page here if ksm_use_zero_pages was enabled already.
+ */
+ if (!is_zero_pfn(page_to_pfn(kpage))) {
+ folio_get(kfolio);
+ folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
+ newpte = mk_pte(kpage, vma->vm_page_prot);
+ } else {
+ /*
+ * Use pte_mkdirty to mark the zero page mapped by KSM, and then
+ * we can easily track all KSM-placed zero pages by checking if
+ * the dirty bit in zero page's PTE is set.
+ */
+ newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
+ ksm_map_zero_page(mm);
+ /*
+ * We're replacing an anonymous page with a zero page, which is
+ * not anonymous. We need to do proper accounting otherwise we
+ * will get wrong values in /proc, and a BUG message in dmesg
+ * when tearing down the mm.
+ */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ }
+
+ flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
+ /*
+ * No need to notify as we are replacing a read only page with another
+ * read only page with the same content.
+ *
+ * See Documentation/mm/mmu_notifier.rst
+ */
+ ptep_clear_flush(vma, addr, ptep);
+ set_pte_at(mm, addr, ptep, newpte);
+
+ folio_remove_rmap_pte(folio, page, vma);
+ if (!folio_mapped(folio))
+ folio_free_swap(folio);
+ folio_put(folio);
+
+ pte_unmap_unlock(ptep, ptl);
+ err = 0;
+out_mn:
+ mmu_notifier_invalidate_range_end(&range);
+out:
+ return err;
+}
+
+#endif /* _KSM_COMMON_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 31a9bc365437..c495469a8329 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -44,6 +44,7 @@
#include <asm/tlbflush.h>
#include "internal.h"
#include "mm_slot.h"
+#include "ksm-common.h"
#define CREATE_TRACE_POINTS
#include <trace/events/ksm.h>
@@ -677,28 +678,6 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v
return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
}
-static bool vma_ksm_compatible(struct vm_area_struct *vma)
-{
- if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE | VM_PFNMAP |
- VM_IO | VM_DONTEXPAND | VM_HUGETLB |
- VM_MIXEDMAP| VM_DROPPABLE))
- return false; /* just ignore the advice */
-
- if (vma_is_dax(vma))
- return false;
-
-#ifdef VM_SAO
- if (vma->vm_flags & VM_SAO)
- return false;
-#endif
-#ifdef VM_SPARC_ADI
- if (vma->vm_flags & VM_SPARC_ADI)
- return false;
-#endif
-
- return true;
-}
-
static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
unsigned long addr)
{
@@ -1234,202 +1213,6 @@ static int unmerge_and_remove_all_rmap_items(void)
}
#endif /* CONFIG_SYSFS */
-static u32 calc_checksum(struct page *page)
-{
- u32 checksum;
- void *addr = kmap_local_page(page);
- checksum = xxhash(addr, PAGE_SIZE, 0);
- kunmap_local(addr);
- return checksum;
-}
-
-static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
- pte_t *orig_pte)
-{
- struct mm_struct *mm = vma->vm_mm;
- DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
- int swapped;
- int err = -EFAULT;
- struct mmu_notifier_range range;
- bool anon_exclusive;
- pte_t entry;
-
- if (WARN_ON_ONCE(folio_test_large(folio)))
- return err;
-
- pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
- if (pvmw.address == -EFAULT)
- goto out;
-
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
- pvmw.address + PAGE_SIZE);
- mmu_notifier_invalidate_range_start(&range);
-
- if (!page_vma_mapped_walk(&pvmw))
- goto out_mn;
- if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
- goto out_unlock;
-
- anon_exclusive = PageAnonExclusive(&folio->page);
- entry = ptep_get(pvmw.pte);
- if (pte_write(entry) || pte_dirty(entry) ||
- anon_exclusive || mm_tlb_flush_pending(mm)) {
- swapped = folio_test_swapcache(folio);
- flush_cache_page(vma, pvmw.address, folio_pfn(folio));
- /*
- * Ok this is tricky, when get_user_pages_fast() run it doesn't
- * take any lock, therefore the check that we are going to make
- * with the pagecount against the mapcount is racy and
- * O_DIRECT can happen right after the check.
- * So we clear the pte and flush the tlb before the check
- * this assure us that no O_DIRECT can happen after the check
- * or in the middle of the check.
- *
- * No need to notify as we are downgrading page table to read
- * only not changing it to point to a new page.
- *
- * See Documentation/mm/mmu_notifier.rst
- */
- entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
- /*
- * Check that no O_DIRECT or similar I/O is in progress on the
- * page
- */
- if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) {
- set_pte_at(mm, pvmw.address, pvmw.pte, entry);
- goto out_unlock;
- }
-
- /* See folio_try_share_anon_rmap_pte(): clear PTE first. */
- if (anon_exclusive &&
- folio_try_share_anon_rmap_pte(folio, &folio->page)) {
- set_pte_at(mm, pvmw.address, pvmw.pte, entry);
- goto out_unlock;
- }
-
- if (pte_dirty(entry))
- folio_mark_dirty(folio);
- entry = pte_mkclean(entry);
-
- if (pte_write(entry))
- entry = pte_wrprotect(entry);
-
- set_pte_at(mm, pvmw.address, pvmw.pte, entry);
- }
- *orig_pte = entry;
- err = 0;
-
-out_unlock:
- page_vma_mapped_walk_done(&pvmw);
-out_mn:
- mmu_notifier_invalidate_range_end(&range);
-out:
- return err;
-}
-
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma: vma that holds the pte pointing to page
- * @page: the page we are replacing by kpage
- * @kpage: the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
- struct page *kpage, pte_t orig_pte)
-{
- struct folio *kfolio = page_folio(kpage);
- struct mm_struct *mm = vma->vm_mm;
- struct folio *folio = page_folio(page);
- pmd_t *pmd;
- pmd_t pmde;
- pte_t *ptep;
- pte_t newpte;
- spinlock_t *ptl;
- unsigned long addr;
- int err = -EFAULT;
- struct mmu_notifier_range range;
-
- addr = page_address_in_vma(folio, page, vma);
- if (addr == -EFAULT)
- goto out;
-
- pmd = mm_find_pmd(mm, addr);
- if (!pmd)
- goto out;
- /*
- * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
- * without holding anon_vma lock for write. So when looking for a
- * genuine pmde (in which to find pte), test present and !THP together.
- */
- pmde = pmdp_get_lockless(pmd);
- if (!pmd_present(pmde) || pmd_trans_huge(pmde))
- goto out;
-
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
- addr + PAGE_SIZE);
- mmu_notifier_invalidate_range_start(&range);
-
- ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
- if (!ptep)
- goto out_mn;
- if (!pte_same(ptep_get(ptep), orig_pte)) {
- pte_unmap_unlock(ptep, ptl);
- goto out_mn;
- }
- VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
- VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
- kfolio);
-
- /*
- * No need to check ksm_use_zero_pages here: we can only have a
- * zero_page here if ksm_use_zero_pages was enabled already.
- */
- if (!is_zero_pfn(page_to_pfn(kpage))) {
- folio_get(kfolio);
- folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
- newpte = mk_pte(kpage, vma->vm_page_prot);
- } else {
- /*
- * Use pte_mkdirty to mark the zero page mapped by KSM, and then
- * we can easily track all KSM-placed zero pages by checking if
- * the dirty bit in zero page's PTE is set.
- */
- newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
- ksm_map_zero_page(mm);
- /*
- * We're replacing an anonymous page with a zero page, which is
- * not anonymous. We need to do proper accounting otherwise we
- * will get wrong values in /proc, and a BUG message in dmesg
- * when tearing down the mm.
- */
- dec_mm_counter(mm, MM_ANONPAGES);
- }
-
- flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
- /*
- * No need to notify as we are replacing a read only page with another
- * read only page with the same content.
- *
- * See Documentation/mm/mmu_notifier.rst
- */
- ptep_clear_flush(vma, addr, ptep);
- set_pte_at(mm, addr, ptep, newpte);
-
- folio_remove_rmap_pte(folio, page, vma);
- if (!folio_mapped(folio))
- folio_free_swap(folio);
- folio_put(folio);
-
- pte_unmap_unlock(ptep, ptl);
- err = 0;
-out_mn:
- mmu_notifier_invalidate_range_end(&range);
-out:
- return err;
-}
-
/*
* try_to_merge_one_page - take two pages and merge them into one
* @vma: the vma that holds the pte pointing to page
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7da..d9d678053ca2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -22,6 +22,7 @@
#include <linux/string.h>
#include <linux/uio.h>
#include <linux/ksm.h>
+#include <linux/sksm.h>
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/blkdev.h>
@@ -1318,6 +1319,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
return madvise_guard_install(vma, prev, start, end);
case MADV_GUARD_REMOVE:
return madvise_guard_remove(vma, prev, start, end);
+ case MADV_MERGE:
+ return sksm_merge(vma, start, end);
}
anon_name = anon_vma_name(vma);
@@ -1422,6 +1425,9 @@ madvise_behavior_valid(int behavior)
#ifdef CONFIG_MEMORY_FAILURE
case MADV_SOFT_OFFLINE:
case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_SKSM
+ case MADV_MERGE:
#endif
return true;
diff --git a/mm/memory.c b/mm/memory.c
index 398c031be9ba..782363315b31 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3618,6 +3618,8 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
*/
if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
return false;
+ if (folio_test_sksm(folio))
+ return false;
if (!folio_test_lru(folio))
/*
* We cannot easily detect+handle references from
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01eab25edf89..0bb9755896ce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1122,6 +1122,7 @@ __always_inline bool free_pages_prepare(struct page *page,
return false;
}
+ sksm_page_remove(page);
page_cpupid_reset_last(page);
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
@@ -1509,6 +1510,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_private(page, 0);
set_page_refcounted(page);
+ set_page_checksum(page, 0);
+ init_page_sksm_node(page);
arch_alloc_page(page, order);
debug_pagealloc_map_pages(page, 1 << order);
diff --git a/mm/sksm.c b/mm/sksm.c
new file mode 100644
index 000000000000..190f6bc05f2d
--- /dev/null
+++ b/mm/sksm.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Synchronous memory merging support.
+ *
+ * This code enables synchronous dynamic sharing of identical pages
+ * found in different memory areas, even if they are not shared by
+ * fork().
+ *
+ * Userspace must explicitly request for pages within specific address
+ * ranges to be merged with madvise MADV_MERGE. Those should *not*
+ * contain secrets, as side-channel timing attacks can allow a process
+ * to learn the existence of a known content within another process.
+ *
+ * The synchronous memory merging performs the memory merging
+ * synchronously within madvise. There is no global scan and no need
+ * for background daemon.
+ *
+ * The anonymous pages targeted for merge are write-protected and
+ * checksummed. They are then compared to other pages targeted for
+ * merge.
+ *
+ * The mergeable pages are added to a hash table indexed by checksum of
+ * their content. The hash value is derived from the page content
+ * checksum, and its comparison function is based on comparison of
+ * the page content.
+ *
+ * If a page is written to after being targeted for merge, a COW will be
+ * triggered, and thus a new page will be populated in its stead.
+ *
+ * The typical usage pattern expected from userspace is:
+ *
+ * 1) Userspace writes non-secret content to a MAP_PRIVATE page, thus
+ * triggering COW.
+ *
+ * 2) After userspace has completed writing to the page, it issues
+ * madvise MADV_MERGE on a range containing the page, which
+ * write-protect, checksum, and add the page to the sksm hash
+ * table. It then merges this page with other mergeable pages
+ * that have the same content.
+ *
+ * 3) It is typically expected that this page's content stays invariant
+ * for a long time. If userspace issues writes to the page after
+ * madvise MADV_MERGE, another COW will be triggered, which will
+ * populate a new page copy into the process page table and release
+ * the reference to the old page.
+ */
+
+#include <linux/mutex.h>
+#include <linux/cleanup.h>
+#include <linux/mm_types.h>
+#include <linux/hashtable.h>
+#include <linux/highmem.h>
+#include <linux/xxhash.h>
+#include <linux/rmap.h>
+#include <linux/mm.h>
+#include <linux/pagewalk.h>
+#include <linux/sksm.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+
+#include "internal.h"
+#include "ksm-common.h"
+
+#define SKSM_HT_BITS 16
+
+static DEFINE_MUTEX(sksm_lock);
+
+/*
+ * The hash is derived from the page checksum.
+ */
+static DEFINE_HASHTABLE(sksm_ht, SKSM_HT_BITS);
+
+void __sksm_page_remove(struct page *page)
+{
+ guard(mutex)(&sksm_lock);
+ hash_del(&page->sksm_node);
+}
+
+static int sksm_merge_page(struct vm_area_struct *vma, struct page *page)
+{
+ struct folio *folio = page_folio(page);
+ pte_t orig_pte = __pte(0);
+ struct page *kpage;
+ int err = 0;
+
+ folio_lock(folio);
+
+ if (folio_test_large(folio)) {
+ if (split_huge_page(page))
+ goto out_unlock;
+ folio = page_folio(page);
+ }
+
+ /* Write protect page. */
+ if (write_protect_page(vma, folio, &orig_pte) != 0)
+ goto out_unlock;
+
+ /* Checksum page. */
+ page->checksum = calc_checksum(page);
+
+ guard(mutex)(&sksm_lock);
+
+ /* Merge page with duplicates. */
+ hash_for_each_possible(sksm_ht, kpage, sksm_node, page->checksum) {
+ if (page->checksum != kpage->checksum || !pages_identical(page, kpage))
+ continue;
+ if (!get_page_unless_zero(kpage))
+ continue;
+ err = replace_page(vma, page, kpage, orig_pte);
+ put_page(kpage);
+ if (!err)
+ goto out_unlock;
+ }
+
+ /*
+ * This page is not linked to its address_space anymore because it
+ * can be shared with other processes and replace pages originally
+ * associated with other address spaces.
+ */
+ page->mapping = (void *) PAGE_MAPPING_ANON;
+
+ /* Add page to hash table. */
+ hash_add(sksm_ht, &page->sksm_node, page->checksum);
+out_unlock:
+ folio_unlock(folio);
+ return err;
+}
+
+static struct page *get_vma_page_from_addr(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct page *page = NULL;
+ struct folio_walk fw;
+ struct folio *folio;
+
+ folio = folio_walk_start(&fw, vma, addr, 0);
+ if (folio) {
+ if (!folio_is_zone_device(folio) &&
+ folio_test_anon(folio)) {
+ folio_get(folio);
+ page = fw.page;
+ }
+ folio_walk_end(&fw, vma);
+ }
+ if (page) {
+ flush_anon_page(vma, page, addr);
+ flush_dcache_page(page);
+ }
+ return page;
+}
+
+/* Called with mmap write lock held. */
+int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ unsigned long addr;
+ int err = 0;
+
+ if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(end))
+ return -EINVAL;
+ if (!vma_ksm_compatible(vma))
+ return 0;
+
+ /*
+ * A number of pages can hang around indefinitely in per-cpu
+ * LRU cache, raised page count preventing write_protect_page
+ * from merging them.
+ */
+ lru_add_drain_all();
+
+ for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+ struct page *page = get_vma_page_from_addr(vma, addr);
+
+ if (!page)
+ continue;
+ err = sksm_merge_page(vma, page);
+ put_page(page);
+ }
+ return err;
+}
+
+static int __init sksm_init(void)
+{
+ struct page *zero_page = ZERO_PAGE(0);
+
+ zero_page->checksum = calc_checksum(zero_page);
+ /* Add page to hash table. */
+ hash_add(sksm_ht, &zero_page->sksm_node, zero_page->checksum);
+ return 0;
+}
+subsys_initcall(sksm_init);
--
2.39.5
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test
2025-02-28 2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
@ 2025-02-28 2:30 ` Mathieu Desnoyers
2025-02-28 2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
2 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 2:30 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
Olivier Dion, linux-mm
Introduce a basic selftest for SKSM. See ./basic_test -h for
options.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org
---
tools/testing/selftests/sksm/.gitignore | 2 +
tools/testing/selftests/sksm/Makefile | 14 ++
tools/testing/selftests/sksm/basic_test.c | 217 ++++++++++++++++++++++
3 files changed, 233 insertions(+)
create mode 100644 tools/testing/selftests/sksm/.gitignore
create mode 100644 tools/testing/selftests/sksm/Makefile
create mode 100644 tools/testing/selftests/sksm/basic_test.c
diff --git a/tools/testing/selftests/sksm/.gitignore b/tools/testing/selftests/sksm/.gitignore
new file mode 100644
index 000000000000..0f5b0baa91e7
--- /dev/null
+++ b/tools/testing/selftests/sksm/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+basic_test
diff --git a/tools/testing/selftests/sksm/Makefile b/tools/testing/selftests/sksm/Makefile
new file mode 100644
index 000000000000..ec1a10783bda
--- /dev/null
+++ b/tools/testing/selftests/sksm/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+ OR MIT
+
+top_srcdir = ../../../..
+
+CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -L$(OUTPUT) -Wl,-rpath=./ \
+ $(CLANG_FLAGS) -I$(top_srcdir)/tools/include
+LDLIBS += -lpthread
+
+TEST_GEN_PROGS = basic_test
+
+include ../lib.mk
+
+$(OUTPUT)/%: %.c
+ $(CC) $(CFLAGS) $< $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/sksm/basic_test.c b/tools/testing/selftests/sksm/basic_test.c
new file mode 100644
index 000000000000..1a7571a999d2
--- /dev/null
+++ b/tools/testing/selftests/sksm/basic_test.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic test for SKSM.
+ */
+
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+#include <poll.h>
+
+#ifndef MADV_MERGE
+#define MADV_MERGE 26
+#endif
+
+#define PAGE_SIZE 4096
+
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *) &(x)) = (val))
+
+static int opt_stop_at = 0, opt_pause = 0;
+
+struct test_page {
+ char array[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
+};
+
+struct test_page2 {
+ char array[2 * PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
+};
+
+/* identical to zero page. */
+static struct test_page zero;
+
+/* a1 and a2 are identical. */
+static struct test_page a1 = {
+ .array[0] = 0x42,
+ .array[1] = 0x42,
+};
+
+static struct test_page a2 = {
+ .array[0] = 0x42,
+ .array[1] = 0x42,
+};
+
+/* b1 and b2 are identical. */
+static struct test_page2 b1 = {
+ .array[0] = 0x43,
+ .array[1] = 0x43,
+ .array[PAGE_SIZE] = 0x44,
+ .array[PAGE_SIZE + 1] = 0x44,
+};
+
+static struct test_page2 b2 = {
+ .array[0] = 0x43,
+ .array[1] = 0x43,
+ .array[PAGE_SIZE] = 0x44,
+ .array[PAGE_SIZE + 1] = 0x44,
+};
+
+static void touch_pages(void *p, size_t len)
+{
+ size_t i;
+
+ for (i = 0; i < len; i += PAGE_SIZE)
+ WRITE_ONCE(((char *)p)[i], ((char *)p)[i]);
+}
+
+static void test_step(char step)
+{
+ printf("\nTest step: <%c>\n", step);
+ if (opt_pause) {
+ printf("Press ENTER to continue...\n");
+ getchar();
+ }
+ if (opt_stop_at == step) {
+ poll(NULL, 0, -1);
+ exit(0);
+ }
+}
+
+static void show_usage(int argc, char **argv)
+{
+ printf("Usage : %s <OPTIONS>\n",
+ argv[0]);
+ printf("OPTIONS:\n");
+ printf(" [-s stop_at] Stop test at step A, B, C, D, E, or F and wait forever.\n");
+ printf(" [-p] Pause test between steps (await newline from the console).\n");
+ printf(" [-h] Show this help.\n");
+ printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+ int i;
+
+ for (i = 1; i < argc; i++) {
+ if (argv[i][0] != '-')
+ continue;
+ switch (argv[i][1]) {
+ case 's':
+ if (argc < i + 2) {
+ show_usage(argc, argv);
+ return -1;
+ }
+ opt_stop_at = *argv[i + 1];
+ switch (opt_stop_at) {
+ case 'A':
+ case 'B':
+ case 'C':
+ case 'D':
+ case 'E':
+ case 'F':
+ break;
+ default:
+ show_usage(argc, argv);
+ return -1;
+ }
+ i++;
+ break;
+ case 'p':
+ opt_pause = 1;
+ i++;
+ break;
+ case 'h':
+ show_usage(argc, argv);
+ return 0;
+ default:
+ show_usage(argc, argv);
+ return -1;
+ }
+ }
+
+
+ printf("PID: %d\n", getpid());
+ printf("Shared mapping (write-protected)\n");
+
+ test_step('A');
+
+ printf("madvise MADV_MERGE a1\n");
+ if (madvise(&a1, sizeof(a1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE a2\n");
+ if (madvise(&a2, sizeof(a2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b1\n");
+ if (madvise(&b1, sizeof(b1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b2\n");
+ if (madvise(&b2, sizeof(b2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE zero\n");
+ if (madvise(&zero, sizeof(zero), MADV_MERGE))
+ goto error;
+
+ test_step('B');
+
+ printf("Trigger COW\n");
+ touch_pages(&zero, sizeof(zero));
+ touch_pages(&a1, sizeof(a1));
+ touch_pages(&a2, sizeof(a2));
+ touch_pages(&b1, sizeof(b1));
+ touch_pages(&b2, sizeof(b2));
+
+ test_step('C');
+
+ printf("madvise MADV_MERGE a1\n");
+ if (madvise(&a1, sizeof(a1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE a2\n");
+ if (madvise(&a2, sizeof(a2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b1\n");
+ if (madvise(&b1, sizeof(b1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b2\n");
+ if (madvise(&b2, sizeof(b2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE zero\n");
+ if (madvise(&zero, sizeof(zero), MADV_MERGE))
+ goto error;
+
+ test_step('D');
+
+ printf("Trigger COW\n");
+ touch_pages(&zero, sizeof(zero));
+ touch_pages(&a1, sizeof(a1));
+ touch_pages(&a2, sizeof(a2));
+ touch_pages(&b1, sizeof(b1));
+ touch_pages(&b2, sizeof(b2));
+
+ test_step('E');
+
+ printf("madvise MADV_MERGE a1\n");
+ if (madvise(&a1, sizeof(a1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE a2\n");
+ if (madvise(&a2, sizeof(a2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b1\n");
+ if (madvise(&b1, sizeof(b1), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE b2\n");
+ if (madvise(&b2, sizeof(b2), MADV_MERGE))
+ goto error;
+ printf("madvise MADV_MERGE zero\n");
+ if (madvise(&zero, sizeof(zero), MADV_MERGE))
+ goto error;
+
+ test_step('F');
+
+ return 0;
+
+error:
+ perror("madvise");
+ return -1;
+}
--
2.39.5
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
@ 2025-02-28 2:51 ` Linus Torvalds
2025-02-28 3:03 ` Mathieu Desnoyers
2025-02-28 15:34 ` David Hildenbrand
2 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2025-02-28 2:51 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This series introduces SKSM, a new page deduplication ABI,
> aiming to fix the limitations inherent to the KSM ABI.
So I'm not interested in seeing *another* KSM version.
Because I absolutely do *NOT* want a new chapter in the saga of SLUB
vs SLAB vs SLOB.
However, if the feeling is that this can *replace* the current horror
that is KSM, I'm a lot more interested. I suspect our current KSM
model has largely been a failure, and this might be "good enough".
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
@ 2025-02-28 3:03 ` Mathieu Desnoyers
2025-02-28 5:17 ` Linus Torvalds
2025-02-28 15:34 ` David Hildenbrand
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 3:03 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 2025-02-27 21:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
>
> So I'm not interested in seeing *another* KSM version.
>
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
>
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".
I'd be fine with SKSM replacing KSM entirely. However, I don't
think we should try to re-implement the existing KSM userspace ABIs
over SKSM. I suspect that much of the problems KSM has today are
caused by the semantic of the ABI it exposes, which were targeted
solely for a host deduplicating guest VMs memory use-case.
KSM tracks memory meant to be mergeable on an ongoing
basis with a worker thread:
madvise(2) MADV_{UN,}MERGEABLE
prctl(2) PR_{SET,GET}_MEMORY_MERGE (security concern)
~2.5k LOC exclusing ksm-common code
requires parameter fine-tuning from sysadmin
SKSM gets the hint from userspace that memory is a good
candidate for merging in its current state and is expected
to stay invariant:
madvise(2) MADV_MERGE
~100 LOC exclusing ksm-common code
The main reason why SKSM could be implemented without all the
scanning complexity is because of this simpler ABI.
Thanks for the feedback!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 3:03 ` Mathieu Desnoyers
@ 2025-02-28 5:17 ` Linus Torvalds
2025-02-28 13:59 ` David Hildenbrand
2025-02-28 14:59 ` Mathieu Desnoyers
0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2025-02-28 5:17 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> I'd be fine with SKSM replacing KSM entirely. However, I don't
> think we should try to re-implement the existing KSM userspace ABIs
> over SKSM.
No, absolutely. The only point (for me) for your new synchronous one
would be if it replaced the kernel thread async scanning, which would
make the old user space interface basically pointless.
But I don't actually know who uses KSM right now. My reaction really
comes from a "it's not nice code in the kernel", not from any actual
knowledge of the users.
Maybe it works really well in some cloud VM environment, and we're
stuck with it forever.
In which case I don't want to see some second different interface that
just makes it all worse.
Linus
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 5:17 ` Linus Torvalds
@ 2025-02-28 13:59 ` David Hildenbrand
2025-02-28 14:59 ` Sean Christopherson
2025-02-28 15:01 ` Mathieu Desnoyers
2025-02-28 14:59 ` Mathieu Desnoyers
1 sibling, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 13:59 UTC (permalink / raw)
To: Linus Torvalds, Mathieu Desnoyers
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 28.02.25 06:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
>
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
>
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
>
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.
Exactly that; and besides the VM use-case, lately people stated using it
in the context of interpreters (IIRC inside Meta) quite successfully as
well.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 5:17 ` Linus Torvalds
2025-02-28 13:59 ` David Hildenbrand
@ 2025-02-28 14:59 ` Mathieu Desnoyers
2025-02-28 16:32 ` Peter Xu
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 14:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 2025-02-28 00:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
>
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
>
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
>
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.
>
For the VM use-case, I wonder if we could just add a userfaultfd
"COW" event that would notify userspace when a COW happens ?
This would allow userspace to replace ksmd by tracking the age of
those anonymous pages, and issue madvise MADV_MERGE on them to
write-protect+merge them when it is deemed useful.
With both a new userfaultfd COW event and madvise MADV_MERGE,
is there anything else that is fundamentally missing to move
all the scanning complexity of KSM to userspace for the VM
deduplication use-case ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 13:59 ` David Hildenbrand
@ 2025-02-28 14:59 ` Sean Christopherson
2025-02-28 15:10 ` David Hildenbrand
2025-02-28 15:01 ` Mathieu Desnoyers
1 sibling, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-02-28 14:59 UTC (permalink / raw)
To: David Hildenbrand
Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
Matthew Wilcox, Olivier Dion, linux-mm
On Fri, Feb 28, 2025, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
> > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > >
> > > I'd be fine with SKSM replacing KSM entirely. However, I don't
> > > think we should try to re-implement the existing KSM userspace ABIs
> > > over SKSM.
> >
> > No, absolutely. The only point (for me) for your new synchronous one
> > would be if it replaced the kernel thread async scanning, which would
> > make the old user space interface basically pointless.
> >
> > But I don't actually know who uses KSM right now. My reaction really
> > comes from a "it's not nice code in the kernel", not from any actual
> > knowledge of the users.
> >
> > Maybe it works really well in some cloud VM environment, and we're
> > stuck with it forever.
>
> Exactly that; and besides the VM use-case, lately people stated using it in
> the context of interpreters (IIRC inside Meta) quite successfully as well.
Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
in cloud environments?
The security implications of scanning guest memory and having co-tenant VMs share
mappings (should) make it a complete non-starter for any scenario where VMs and/or
their workloads are owned by third parties.
I can imagine there might be first-party use cases, but I would expect many/most
of those to be able to explicitly share mappings, which would provide far, far
better power and performance characteristics.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 13:59 ` David Hildenbrand
2025-02-28 14:59 ` Sean Christopherson
@ 2025-02-28 15:01 ` Mathieu Desnoyers
2025-02-28 15:18 ` David Hildenbrand
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 15:01 UTC (permalink / raw)
To: David Hildenbrand, Linus Torvalds
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 2025-02-28 08:59, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>> think we should try to re-implement the existing KSM userspace ABIs
>>> over SKSM.
>>
>> No, absolutely. The only point (for me) for your new synchronous one
>> would be if it replaced the kernel thread async scanning, which would
>> make the old user space interface basically pointless.
>>
>> But I don't actually know who uses KSM right now. My reaction really
>> comes from a "it's not nice code in the kernel", not from any actual
>> knowledge of the users.
>>
>> Maybe it works really well in some cloud VM environment, and we're
>> stuck with it forever.
>
> Exactly that; and besides the VM use-case, lately people stated using it
> in the context of interpreters (IIRC inside Meta) quite successfully as
> well.
>
I suspect that SKSM is a better fit for JIT and code patching than KSM,
because user-space knows better when a set of pages is going to become
invariant for a long time and thus benefit from merging. This removes
the background scanning from the picture.
Does the interpreter use-case require background scanning, or does
it know when a set of pages are meant to become invariant for a long
time ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 14:59 ` Sean Christopherson
@ 2025-02-28 15:10 ` David Hildenbrand
2025-02-28 15:19 ` David Hildenbrand
2025-02-28 21:38 ` Mathieu Desnoyers
0 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:10 UTC (permalink / raw)
To: Sean Christopherson
Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
Matthew Wilcox, Olivier Dion, linux-mm
On 28.02.25 15:59, Sean Christopherson wrote:
> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it in
>> the context of interpreters (IIRC inside Meta) quite successfully as well.
>
> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
> in cloud environments?
Private clouds yes, that's where it is most commonly used for. I would
assume that nobody for
For example, there is some older documentation here:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/virtualization_administration_guide/chap-ksm#chap-KSM
which touches on the security aspects:
"The page deduplication technology (used also by the KSM implementation)
may introduce side channels that could potentially be used to leak
information across multiple guests. In case this is a concern, KSM can
be disabled on a per-guest basis."
>
> The security implications of scanning guest memory and having co-tenant VMs share
> mappings (should) make it a complete non-starter for any scenario where VMs and/or
> their workloads are owned by third parties.
Jep.
>
> I can imagine there might be first-party use cases, but I would expect many/most
> of those to be able to explicitly share mappings, which would provide far, far
> better power and performance characteristics.
Note that KSM can be very efficient when you have multiple VMs running
the same kernel,executable,libraries etc. If my memory doesn't trick me,
that's precisely for what it was originally invented, and how it is
getting used today in the context of VMs.
For example, QEMU will mark all guest memory is mergeable using MADV, to
limit the deduplicaton to guest RAM only.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 15:01 ` Mathieu Desnoyers
@ 2025-02-28 15:18 ` David Hildenbrand
0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:18 UTC (permalink / raw)
To: Mathieu Desnoyers, Linus Torvalds
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 28.02.25 16:01, Mathieu Desnoyers wrote:
> On 2025-02-28 08:59, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it
>> in the context of interpreters (IIRC inside Meta) quite successfully as
>> well.
>>
>
> I suspect that SKSM is a better fit for JIT and code patching than KSM,
> because user-space knows better when a set of pages is going to become
> invariant for a long time and thus benefit from merging. This removes
> the background scanning from the picture.
> > Does the interpreter use-case require background scanning, or does
> it know when a set of pages are meant to become invariant for a long
> time ?
To make the JIT/interpreter use case happy, people wanted ways to
*force* KSM on for *the whole process*, not just individual VMAs like
the traditional VM use case would have done.
I recall one of the reasons being that you don't really want to modify
your JIT/interpreter to just make KSM work.
See [1] "KSM at Meta" for some details, and in general, optimization
work to adapt KSM to new use cases.
Regarding some concerns you raised, Stefan did a lot of optimization
work like "smart scanning" (slide "Optimization - Smart Scan (6.7)") to
reduce the scanning overhead and make it much more efficient.
So people started optimizing for that already and got pretty good results.
[1]
https://lpc.events/event/17/contributions/1625/attachments/1320/2649/KSM.pdf
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 15:10 ` David Hildenbrand
@ 2025-02-28 15:19 ` David Hildenbrand
2025-02-28 21:38 ` Mathieu Desnoyers
1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:19 UTC (permalink / raw)
To: Sean Christopherson
Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
Matthew Wilcox, Olivier Dion, linux-mm
On 28.02.25 16:10, David Hildenbrand wrote:
> On 28.02.25 15:59, Sean Christopherson wrote:
>> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>>> On 28.02.25 06:17, Linus Torvalds wrote:
>>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>>> over SKSM.
>>>>
>>>> No, absolutely. The only point (for me) for your new synchronous one
>>>> would be if it replaced the kernel thread async scanning, which would
>>>> make the old user space interface basically pointless.
>>>>
>>>> But I don't actually know who uses KSM right now. My reaction really
>>>> comes from a "it's not nice code in the kernel", not from any actual
>>>> knowledge of the users.
>>>>
>>>> Maybe it works really well in some cloud VM environment, and we're
>>>> stuck with it forever.
>>>
>>> Exactly that; and besides the VM use-case, lately people stated using it in
>>> the context of interpreters (IIRC inside Meta) quite successfully as well.
>>
>> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
>> in cloud environments?
>
> Private clouds yes, that's where it is most commonly used for. I would
> assume that nobody for
forgot to complete that sentence: "... nobody really should be using
that in public clouds."
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
2025-02-28 3:03 ` Mathieu Desnoyers
@ 2025-02-28 15:34 ` David Hildenbrand
2025-02-28 15:38 ` Matthew Wilcox
1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:34 UTC (permalink / raw)
To: Linus Torvalds, Mathieu Desnoyers
Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
linux-mm
On 28.02.25 03:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
>
> So I'm not interested in seeing *another* KSM version.
>
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
>
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".
Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?
Many/most use cases just leave THP scanning+collapsing to khugepaged;
selected ones might "know better" what to do, so they effectively
disable khugepaged, and manually collapse THPs using MADV_COLLAPSE.
If it would be similar to that, it would not be completely different KSM
version, just a different way to trigger merging: background scanning
vs. user-space triggered ("synchronous").
I could see use cases for such a synchronous interface, but I doubt it
could replace the background scanning that is actively getting used for
existing use cases; I have similar thoughts about khugepaged vs.
MADV_COLLAPSE.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 15:34 ` David Hildenbrand
@ 2025-02-28 15:38 ` Matthew Wilcox
0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2025-02-28 15:38 UTC (permalink / raw)
To: David Hildenbrand
Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
Olivier Dion, linux-mm
On Fri, Feb 28, 2025 at 04:34:50PM +0100, David Hildenbrand wrote:
> Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?
I think it is comparable ... because many people find khugepaged
unacceptable and there are proposals to move that to userspace.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 14:59 ` Mathieu Desnoyers
@ 2025-02-28 16:32 ` Peter Xu
2025-02-28 17:53 ` Mathieu Desnoyers
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-02-28 16:32 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> For the VM use-case, I wonder if we could just add a userfaultfd
> "COW" event that would notify userspace when a COW happens ?
I don't know what's the best for KSM and how well this will work, but we
have such event for years.. See UFFDIO_REGISTER_MODE_WP:
https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>
> This would allow userspace to replace ksmd by tracking the age of
> those anonymous pages, and issue madvise MADV_MERGE on them to
> write-protect+merge them when it is deemed useful.
>
> With both a new userfaultfd COW event and madvise MADV_MERGE,
> is there anything else that is fundamentally missing to move
> all the scanning complexity of KSM to userspace for the VM
> deduplication use-case ?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 16:32 ` Peter Xu
@ 2025-02-28 17:53 ` Mathieu Desnoyers
2025-02-28 22:32 ` Peter Xu
0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 17:53 UTC (permalink / raw)
To: Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-02-28 11:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>> For the VM use-case, I wonder if we could just add a userfaultfd
>> "COW" event that would notify userspace when a COW happens ?
>
> I don't know what's the best for KSM and how well this will work, but we
> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>
> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
resulting from a mmap mapping, but returns EINVAL if I pass a
page-aligned address which sits within a private file mapping
(e.g. executable data).
Also, I notice that do_wp_page() only calls handle_userfault
VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
set.
AFAIU, as it stands now userfaultfd would not help tracking COW faults
caused by stores to private file mappings. Am I missing something ?
Thanks,
Mathieu
>
>>
>> This would allow userspace to replace ksmd by tracking the age of
>> those anonymous pages, and issue madvise MADV_MERGE on them to
>> write-protect+merge them when it is deemed useful.
>>
>> With both a new userfaultfd COW event and madvise MADV_MERGE,
>> is there anything else that is fundamentally missing to move
>> all the scanning complexity of KSM to userspace for the VM
>> deduplication use-case ?
>
> Thanks,
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 15:10 ` David Hildenbrand
2025-02-28 15:19 ` David Hildenbrand
@ 2025-02-28 21:38 ` Mathieu Desnoyers
2025-02-28 21:45 ` David Hildenbrand
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 21:38 UTC (permalink / raw)
To: David Hildenbrand, Sean Christopherson
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-02-28 10:10, David Hildenbrand wrote:
[...]
> For example, QEMU will mark all guest memory is mergeable using MADV, to
> limit the deduplicaton to guest RAM only.
>
On a related note, I think the madvise(2) documentation is inaccurate.
It states:
MADV_MERGEABLE (since Linux 2.6.32)
Enable Kernel Samepage Merging (KSM) for the pages in the range
specified by addr and length. [...]
AFAIU, based on code review of ksm_madvise(), this is not strictly true.
The KSM implementation enables KSM for pages in the entire vma containing the range.
So if it so happens that two mmap areas with identical protection flags are merged,
both will be considered mergeable by KSM as soon as at least one page from any of
those areas is made mergeable.
This does not appear to be an issue in qemu because guard pages with different
protection are placed between distinct mappings, which should prevent combining
the vmas.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 21:38 ` Mathieu Desnoyers
@ 2025-02-28 21:45 ` David Hildenbrand
2025-02-28 21:49 ` Mathieu Desnoyers
0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 21:45 UTC (permalink / raw)
To: Mathieu Desnoyers, Sean Christopherson
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 28.02.25 22:38, Mathieu Desnoyers wrote:
> On 2025-02-28 10:10, David Hildenbrand wrote:
> [...]
>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>> limit the deduplicaton to guest RAM only.
>>
>
> On a related note, I think the madvise(2) documentation is inaccurate.
>
> It states:
>
> MADV_MERGEABLE (since Linux 2.6.32)
> Enable Kernel Samepage Merging (KSM) for the pages in the range
> specified by addr and length. [...]
>
> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
>
> The KSM implementation enables KSM for pages in the entire vma containing the range.
> So if it so happens that two mmap areas with identical protection flags are merged,
> both will be considered mergeable by KSM as soon as at least one page from any of
> those areas is made mergeable.
I *think* it does what is documented. In madvise_vma_behavior(),
ksm_madvise() will update "new_flags".
Then we call madvise_update_vma() to split the VMA if required and set
new_flags only on the split VMA. The handling is similar to other MADV
operations that end up modifying vm_flags.
If I am missing something and this is indeed broken, we should
definitely write a selftest for it and fix it.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 21:45 ` David Hildenbrand
@ 2025-02-28 21:49 ` Mathieu Desnoyers
0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 21:49 UTC (permalink / raw)
To: David Hildenbrand, Sean Christopherson
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-02-28 16:45, David Hildenbrand wrote:
> On 28.02.25 22:38, Mathieu Desnoyers wrote:
>> On 2025-02-28 10:10, David Hildenbrand wrote:
>> [...]
>>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>>> limit the deduplicaton to guest RAM only.
>>>
>>
>> On a related note, I think the madvise(2) documentation is inaccurate.
>>
>> It states:
>>
>> MADV_MERGEABLE (since Linux 2.6.32)
>> Enable Kernel Samepage Merging (KSM) for the pages in
>> the range
>> specified by addr and length. [...]
>>
>> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
>>
>> The KSM implementation enables KSM for pages in the entire vma
>> containing the range.
>> So if it so happens that two mmap areas with identical protection
>> flags are merged,
>> both will be considered mergeable by KSM as soon as at least one page
>> from any of
>> those areas is made mergeable.
>
> I *think* it does what is documented. In madvise_vma_behavior(),
> ksm_madvise() will update "new_flags".
>
> Then we call madvise_update_vma() to split the VMA if required and set
> new_flags only on the split VMA. The handling is similar to other MADV
> operations that end up modifying vm_flags.
>
> If I am missing something and this is indeed broken, we should
> definitely write a selftest for it and fix it.
>
You are correct, I missed that part. Thanks for the clarification!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 17:53 ` Mathieu Desnoyers
@ 2025-02-28 22:32 ` Peter Xu
2025-03-01 15:44 ` Mathieu Desnoyers
2025-03-03 20:01 ` Mathieu Desnoyers
0 siblings, 2 replies; 29+ messages in thread
From: Peter Xu @ 2025-02-28 22:32 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 11:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > "COW" event that would notify userspace when a COW happens ?
> >
> > I don't know what's the best for KSM and how well this will work, but we
> > have such event for years.. See UFFDIO_REGISTER_MODE_WP:
> >
> > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>
> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> resulting from a mmap mapping, but returns EINVAL if I pass a
> page-aligned address which sits within a private file mapping
> (e.g. executable data).
Yes, so far sync traps only supports RAM-based file systems, or anonymous.
Generic private file mappings (that stores executables and libraries) are
not yet supported.
>
> Also, I notice that do_wp_page() only calls handle_userfault
> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> set.
AFAICT that's expected, unshare should only be set on reads, never writes.
So uffd-wp shouldn't trap any of those.
>
> AFAIU, as it stands now userfaultfd would not help tracking COW faults
> caused by stores to private file mappings. Am I missing something ?
I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
most mappings. That one is async, though, so more like soft-dirty. It
might be doable to try making it sync too without a lot of changes based on
how async tracking works.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 22:32 ` Peter Xu
@ 2025-03-01 15:44 ` Mathieu Desnoyers
2025-03-03 15:01 ` Peter Xu
2025-03-03 20:01 ` Mathieu Desnoyers
1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-01 15:44 UTC (permalink / raw)
To: Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
>
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.
OK, this confirms my observations.
>
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
>
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.
I'm confused by your comment. I thought unshare only applies to
*write* faults. What am I missing ?
>
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
>
> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings. That one is async, though, so more like soft-dirty. It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.
I'll try this out. It may not matter that it's async given a use-case
use-cases of tracking the age since the WP fault on the COW pages. We
don't need to react to the event in-place to alter its behavior, just
a notification should be fine AFAIU.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-01 15:44 ` Mathieu Desnoyers
@ 2025-03-03 15:01 ` Peter Xu
2025-03-03 16:36 ` David Hildenbrand
0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-03-03 15:01 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> >
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
>
> I'm confused by your comment. I thought unshare only applies to
> *write* faults. What am I missing ?
The major path so far to set unshare is here in GUP (ignoring two corner
cases used in either s390 and ksm):
if (unshare) {
fault_flags |= FAULT_FLAG_UNSHARE;
/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
}
See the VM_BUG_ON() - if it's write it'll crash already.
"unshare", in its earliest form of patch, used to be called COR
(Copy-On-Read), which might be more straightforward in this case.. so it's
the counterpart of COW but for read cases where a copy is required. The
patchset that introduced it has more information (e.g. a7f2266041).
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-03 15:01 ` Peter Xu
@ 2025-03-03 16:36 ` David Hildenbrand
0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-03-03 16:36 UTC (permalink / raw)
To: Peter Xu, Mathieu Desnoyers
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 03.03.25 16:01, Peter Xu wrote:
> On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never writes.
>>> So uffd-wp shouldn't trap any of those.
>>
>> I'm confused by your comment. I thought unshare only applies to
>> *write* faults. What am I missing ?
>
> The major path so far to set unshare is here in GUP (ignoring two corner
> cases used in either s390 and ksm):
"unshare" fault, in contrast to a write fault, will not turn the PTE
writable.
That's why it does not trigger userfaultfd-wp: there is no write access,
write-protection is left unchanged.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-02-28 22:32 ` Peter Xu
2025-03-01 15:44 ` Mathieu Desnoyers
@ 2025-03-03 20:01 ` Mathieu Desnoyers
2025-03-03 20:45 ` Peter Xu
2025-03-03 20:49 ` David Hildenbrand
1 sibling, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-03 20:01 UTC (permalink / raw)
To: Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
>
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.
>
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
>
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.
>
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
>
> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings. That one is async, though, so more like soft-dirty. It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.
I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
be a good fit. Here is what I have in mind to replace the ksmd scanning
thread for the VM use-case by a purely user-space driven scanning:
Within qemu or similar user-space process:
1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
UFFDIO_REGISTER_MODE_WP mode.
2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
to detect memory which stays invariant for a long time.
3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
Keep track of memory which is frequently modified, so it can be left alone and
not write-protected nor merged anymore.
4) Whenever pages stay invariant for a given lapse of time, merge them with the new
madvise(2) KSM_MERGE behavior.
Let me know if that makes sense.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-03 20:01 ` Mathieu Desnoyers
@ 2025-03-03 20:45 ` Peter Xu
2025-03-03 20:49 ` David Hildenbrand
1 sibling, 0 replies; 29+ messages in thread
From: Peter Xu @ 2025-03-03 20:45 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On Mon, Mar 03, 2025 at 03:01:38PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> > > On 2025-02-28 11:32, Peter Xu wrote:
> > > > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > > > "COW" event that would notify userspace when a COW happens ?
> > > >
> > > > I don't know what's the best for KSM and how well this will work, but we
> > > > have such event for years.. See UFFDIO_REGISTER_MODE_WP:
> > > >
> > > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
> > >
> > > userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> > > resulting from a mmap mapping, but returns EINVAL if I pass a
> > > page-aligned address which sits within a private file mapping
> > > (e.g. executable data).
> >
> > Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> > Generic private file mappings (that stores executables and libraries) are
> > not yet supported.
> >
> > >
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> >
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
> >
> > >
> > > AFAIU, as it stands now userfaultfd would not help tracking COW faults
> > > caused by stores to private file mappings. Am I missing something ?
> >
> > I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
> > most mappings. That one is async, though, so more like soft-dirty. It
> > might be doable to try making it sync too without a lot of changes based on
> > how async tracking works.
>
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
>
> Within qemu or similar user-space process:
>
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
> UFFDIO_REGISTER_MODE_WP mode.
>
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
> to detect memory which stays invariant for a long time.
>
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
> Keep track of memory which is frequently modified, so it can be left alone and
> not write-protected nor merged anymore.
>
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
> madvise(2) KSM_MERGE behavior.
>
> Let me know if that makes sense.
I can't speak of how KSM should go from there, but from userfault tracking
POV, that makes sense to me.
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-03 20:01 ` Mathieu Desnoyers
2025-03-03 20:45 ` Peter Xu
@ 2025-03-03 20:49 ` David Hildenbrand
2025-03-05 14:06 ` Mathieu Desnoyers
1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-03-03 20:49 UTC (permalink / raw)
To: Mathieu Desnoyers, Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 03.03.25 21:01, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>
>>>> I don't know what's the best for KSM and how well this will work, but we
>>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>>>>
>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>
>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>> page-aligned address which sits within a private file mapping
>>> (e.g. executable data).
>>
>> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
>> Generic private file mappings (that stores executables and libraries) are
>> not yet supported.
>>
>>>
>>> Also, I notice that do_wp_page() only calls handle_userfault
>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>> set.
>>
>> AFAICT that's expected, unshare should only be set on reads, never writes.
>> So uffd-wp shouldn't trap any of those.
>>
>>>
>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>> caused by stores to private file mappings. Am I missing something ?
>>
>> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should work on
>> most mappings. That one is async, though, so more like soft-dirty. It
>> might be doable to try making it sync too without a lot of changes based on
>> how async tracking works.
>
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
>
> Within qemu or similar user-space process:
>
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
> UFFDIO_REGISTER_MODE_WP mode.
>
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
> to detect memory which stays invariant for a long time.
>
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
> Keep track of memory which is frequently modified, so it can be left alone and
> not write-protected nor merged anymore.
>
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
> madvise(2) KSM_MERGE behavior.
>
> Let me know if that makes sense.
Note that one of the strengths of ksm in the kernel right now is that we
write-protect + try-deduplicate only when we are fairly sure that we can
deduplicate (unstable tree), and that the interaction with THPs / large
folios is fairly well thought-through.
Also note that, just because data hasn't been written in some time
interval, doesn't mean that it should be deduplicated and result in CoW
on next write access.
One probably would have to mimic what the KSM implementation in the
kernel does, and built something like the unstable tree, to find
candidates where we can actually deduplciate. Then, have a way to
not-deduplicate if the content changed.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-03 20:49 ` David Hildenbrand
@ 2025-03-05 14:06 ` Mathieu Desnoyers
2025-03-05 19:22 ` David Hildenbrand
0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-05 14:06 UTC (permalink / raw)
To: David Hildenbrand, Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 2025-03-03 15:49, David Hildenbrand wrote:
> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>> On 2025-02-28 17:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>
>>>>> I don't know what's the best for KSM and how well this will work,
>>>>> but we
>>>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>>>>>
>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>
>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>> page-aligned address which sits within a private file mapping
>>>> (e.g. executable data).
>>>
>>> Yes, so far sync traps only supports RAM-based file systems, or
>>> anonymous.
>>> Generic private file mappings (that stores executables and libraries)
>>> are
>>> not yet supported.
>>>
>>>>
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never
>>> writes.
>>> So uffd-wp shouldn't trap any of those.
>>>
>>>>
>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>> caused by stores to private file mappings. Am I missing something ?
>>>
>>> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should
>>> work on
>>> most mappings. That one is async, though, so more like soft-dirty. It
>>> might be doable to try making it sync too without a lot of changes
>>> based on
>>> how async tracking works.
>>
>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>> thread for the VM use-case by a purely user-space driven scanning:
>>
>> Within qemu or similar user-space process:
>>
>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
>> feature and
>> UFFDIO_REGISTER_MODE_WP mode.
>>
>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl
>> PM_SCAN_WP_MATCHING flag
>> to detect memory which stays invariant for a long time.
>>
>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
>> pages are written to.
>> Keep track of memory which is frequently modified, so it can be
>> left alone and
>> not write-protected nor merged anymore.
>>
>> 4) Whenever pages stay invariant for a given lapse of time, merge them
>> with the new
>> madvise(2) KSM_MERGE behavior.
>>
>> Let me know if that makes sense.
>
> Note that one of the strengths of ksm in the kernel right now is that we
> write-protect + try-deduplicate only when we are fairly sure that we can
> deduplicate (unstable tree), and that the interaction with THPs / large
> folios is fairly well thought-through.
>
> Also note that, just because data hasn't been written in some time
> interval, doesn't mean that it should be deduplicated and result in CoW
> on next write access.
Right. This tracking of address range access pattern would have to be
implemented in user-space.
> One probably would have to mimic what the KSM implementation in the
> kernel does, and built something like the unstable tree, to find
> candidates where we can actually deduplciate. Then, have a way to not-
> deduplicate if the content changed.
With madvise MADV_MERGE, there is no need to "unmerge". The merge
write-protects the page and merges its content at the time of the
MADV_MERGE with exact duplicates, and keeps that write protected page in
a global hash table indexed by checksum.
However, unlike KSM, it won't track that range on an ongoing basis.
"Unmerging" the page is done naturally by writing to the merged address
range. Because it is write-protected, this will trigger COW, and will
therefore provide a new anonymous page to the process, thus "unmerging"
that page.
It's really just up to userspace to track COW faults and figure out
that it really should not try to merge that range anymore, based on the
the access pattern monitored through write-protection faults.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
2025-03-05 14:06 ` Mathieu Desnoyers
@ 2025-03-05 19:22 ` David Hildenbrand
0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:22 UTC (permalink / raw)
To: Mathieu Desnoyers, Peter Xu
Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
Olivier Dion, linux-mm
On 05.03.25 15:06, Mathieu Desnoyers wrote:
> On 2025-03-03 15:49, David Hildenbrand wrote:
>> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>>> On 2025-02-28 17:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>>
>>>>>> I don't know what's the best for KSM and how well this will work,
>>>>>> but we
>>>>>> have such event for years.. See UFFDIO_REGISTER_MODE_WP:
>>>>>>
>>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>>
>>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>>> page-aligned address which sits within a private file mapping
>>>>> (e.g. executable data).
>>>>
>>>> Yes, so far sync traps only supports RAM-based file systems, or
>>>> anonymous.
>>>> Generic private file mappings (that stores executables and libraries)
>>>> are
>>>> not yet supported.
>>>>
>>>>>
>>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>>> set.
>>>>
>>>> AFAICT that's expected, unshare should only be set on reads, never
>>>> writes.
>>>> So uffd-wp shouldn't trap any of those.
>>>>
>>>>>
>>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>>> caused by stores to private file mappings. Am I missing something ?
>>>>
>>>> I think you're right. So we have UFFD_FEATURE_WP_ASYNC that should
>>>> work on
>>>> most mappings. That one is async, though, so more like soft-dirty. It
>>>> might be doable to try making it sync too without a lot of changes
>>>> based on
>>>> how async tracking works.
>>>
>>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>>> thread for the VM use-case by a purely user-space driven scanning:
>>>
>>> Within qemu or similar user-space process:
>>>
>>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
>>> feature and
>>> UFFDIO_REGISTER_MODE_WP mode.
>>>
>>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl
>>> PM_SCAN_WP_MATCHING flag
>>> to detect memory which stays invariant for a long time.
>>>
>>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
>>> pages are written to.
>>> Keep track of memory which is frequently modified, so it can be
>>> left alone and
>>> not write-protected nor merged anymore.
>>>
>>> 4) Whenever pages stay invariant for a given lapse of time, merge them
>>> with the new
>>> madvise(2) KSM_MERGE behavior.
>>>
>>> Let me know if that makes sense.
>>
>> Note that one of the strengths of ksm in the kernel right now is that we
>> write-protect + try-deduplicate only when we are fairly sure that we can
>> deduplicate (unstable tree), and that the interaction with THPs / large
>> folios is fairly well thought-through.
>>
>> Also note that, just because data hasn't been written in some time
>> interval, doesn't mean that it should be deduplicated and result in CoW
>> on next write access.
>
> Right. This tracking of address range access pattern would have to be
> implemented in user-space.
>
>> One probably would have to mimic what the KSM implementation in the
>> kernel does, and built something like the unstable tree, to find
>> candidates where we can actually deduplciate. Then, have a way to not-
>> deduplicate if the content changed.
>
> With madvise MADV_MERGE, there is no need to "unmerge". The merge
> write-protects the page and merges its content at the time of the
> MADV_MERGE with exact duplicates, and keeps that write protected page in
> a global hash table indexed by checksum.
Right, and that's a real problem.
>
> However, unlike KSM, it won't track that range on an ongoing basis.
>
> "Unmerging" the page is done naturally by writing to the merged address
> range. Because it is write-protected, this will trigger COW, and will
> therefore provide a new anonymous page to the process, thus "unmerging"
> that page.
>
> It's really just up to userspace to track COW faults and figure out
> that it really should not try to merge that range anymore, based on the
> the access pattern monitored through write-protection faults.
>
Just to be clear, what you described here is very likely not
performance-wise any feasible replacement for the in-tree ksm for the VM
use case (again, the thing that was primarily invented for VMs).
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-03-05 19:22 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-28 2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
2025-02-28 2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
2025-02-28 2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
2025-02-28 3:03 ` Mathieu Desnoyers
2025-02-28 5:17 ` Linus Torvalds
2025-02-28 13:59 ` David Hildenbrand
2025-02-28 14:59 ` Sean Christopherson
2025-02-28 15:10 ` David Hildenbrand
2025-02-28 15:19 ` David Hildenbrand
2025-02-28 21:38 ` Mathieu Desnoyers
2025-02-28 21:45 ` David Hildenbrand
2025-02-28 21:49 ` Mathieu Desnoyers
2025-02-28 15:01 ` Mathieu Desnoyers
2025-02-28 15:18 ` David Hildenbrand
2025-02-28 14:59 ` Mathieu Desnoyers
2025-02-28 16:32 ` Peter Xu
2025-02-28 17:53 ` Mathieu Desnoyers
2025-02-28 22:32 ` Peter Xu
2025-03-01 15:44 ` Mathieu Desnoyers
2025-03-03 15:01 ` Peter Xu
2025-03-03 16:36 ` David Hildenbrand
2025-03-03 20:01 ` Mathieu Desnoyers
2025-03-03 20:45 ` Peter Xu
2025-03-03 20:49 ` David Hildenbrand
2025-03-05 14:06 ` Mathieu Desnoyers
2025-03-05 19:22 ` David Hildenbrand
2025-02-28 15:34 ` David Hildenbrand
2025-02-28 15:38 ` Matthew Wilcox
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).