linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
@ 2025-02-28  2:30 Mathieu Desnoyers
  2025-02-28  2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
	Olivier Dion, linux-mm

This series introduces SKSM, a new page deduplication ABI,
aiming to fix the limitations inherent to the KSM ABI.

The implementation is simple enough: SKSM is implemented in about 100
LOC compared to 2.5k LOC for KSM (on top of the common KSM helpers).

This is sent as a proof of concept. It applies on top of v6.13.

Feedback is welcome!

Mathieu

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org

Mathieu Desnoyers (2):
  mm: Introduce SKSM: Synchronous Kernel Samepage Merging
  selftests/kskm: Introduce SKSM basic test

 include/linux/ksm.h                       |   4 +
 include/linux/mm_types.h                  |   7 +
 include/linux/page-flags.h                |  42 ++++
 include/linux/sksm.h                      |  27 +++
 include/uapi/asm-generic/mman-common.h    |   2 +
 mm/Kconfig                                |   5 +
 mm/Makefile                               |   1 +
 mm/ksm-common.h                           | 228 ++++++++++++++++++++++
 mm/ksm.c                                  | 219 +--------------------
 mm/madvise.c                              |   6 +
 mm/memory.c                               |   2 +
 mm/page_alloc.c                           |   3 +
 mm/sksm.c                                 | 190 ++++++++++++++++++
 tools/testing/selftests/sksm/.gitignore   |   2 +
 tools/testing/selftests/sksm/Makefile     |  14 ++
 tools/testing/selftests/sksm/basic_test.c | 217 ++++++++++++++++++++
 16 files changed, 751 insertions(+), 218 deletions(-)
 create mode 100644 include/linux/sksm.h
 create mode 100644 mm/ksm-common.h
 create mode 100644 mm/sksm.c
 create mode 100644 tools/testing/selftests/sksm/.gitignore
 create mode 100644 tools/testing/selftests/sksm/Makefile
 create mode 100644 tools/testing/selftests/sksm/basic_test.c

-- 
2.39.5


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 1/2] mm: Introduce SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
@ 2025-02-28  2:30 ` Mathieu Desnoyers
  2025-02-28  2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
  2025-02-28  2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
  2 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
	Olivier Dion, linux-mm

* Main use-case targeted by SKSM: Code patching

The main use-case targeted by SKSM is deduplication of anonymous
pages created by COW (Copy-On-Write) triggered by patching executable
and library code for a user-space implementation of "static keys" and
"alternative" code patching. Code patching improves:

- Runtime feature detection, where a constructor can dynamically
  enable a feature by turning a no-op into a jump.

- Instrumentation activation at runtime (e.g. tracepoints) (patch
  a dormant no-op instrumentation into a jump).

- Runtime assembler specialisation, where a constructor can dynamically
  modify assembler instructions to select the best alternative for the
  detected hardware and software environment (e.g. CPU features, rseq
  availability).

The main distinction between doing code patching at kernel-level and at
user-space level is that in user-space, executable and library code is
shared across all processes mapping the same executable or library
files. This reduces memory use and improves cache locality by sharing
executable pages across processes.

Writing to those private mappings trigger COW, which allocates anonymous
pages within each process, and thus lose the benefit from sharing the
same pages from the backing storage.

Without memory deduplication, this increases memory use, and therefore
degrades cache locality: populating the patched content into separate
COW pages within each process ends up using distinct CPU cache lines,
thus trashing the CPU instruction and data caches.

* Why not use KSM ?

The KSM mechanism has the following downsides which the SKSM ABI aims to
overcome:

- KSM requires careful tuning of scan parameters for the workload by the
  system administrator.

  A) This makes KSM mostly useless with a standard distro config.

  B) KSM is workload-specific.

  C) Scanning pages adds overhead to the system, which is the reason
     why the scan parameters must be tuned for the workload.

- KSM has security implications, because it allows processes to
  confirm that an unrelated process has a page which contains a known
  content.

  A) The documentation of madvise(2) MADV_MERGEABLE would benefit from
     advising against targeting memory that contains secret data,
     due to the risk of discovery through side-channel timing attack.

  B) prctl(2) PR_SET_MEMORY_MERGE inherently marks the entire process
     memory as mergeable, which makes it incompatible with security
     oriented use-cases.

* SKSM Overview

SKSM enables synchronous dynamic sharing of identical pages found in
different memory areas, even if they are not shared by fork().

Userspace must explicitly request for pages within specific address
ranges to be merged with madvise MADV_MERGE. Those should *not* contain
secrets, as side-channel timing attacks can allow a process to learn the
existence of a known content within another process.

The synchronous memory merging performs the memory merging synchronously
within madvise. There is no global scan and no need for background
daemon.

The anonymous pages targeted for merge are write-protected and
checksummed. They are then compared to other pages targeted for merge.

The mergeable pages are added to a hash table indexed by checksum of
their content. The hash value is derived from the page content checksum,
and its comparison function is based on comparison of the page content.

If a page is written to after being targeted for merge, a COW will be
triggered, and thus a new page will be populated in its stead.

* Expected Use

User-space is expected to perform code patching, e.g. from a library
constructor, and then when the text pages are expected to stay invariant
for a long time, issue madvise(2) MADV_MERGE on those pages. At this
point, the pages will be write-protected, and merged with identical SKSM
pages. Further stores to those pages will trigger COW again.

* Results

Output of "cat /proc/vmstat | grep nr_anon_pages" while running 1000
instances of "sleep 500":

- Baseline (no preload):                nr_anon_pages  39721
- COW each executable page from libc:   nr_anon_pages 419927
- madvise MADV_MERGE after COW of libc: nr_anon_pages  45525

* Limitations

- This is a Proof-of-concept !
- It is incompatible with SKM (depends on !KSM) for now.
- Swap behavior under memory pressure is untested.
- The size of the hash table is static (65536 buckets) for now.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org
---
 include/linux/ksm.h                    |   4 +
 include/linux/mm_types.h               |   7 +
 include/linux/page-flags.h             |  42 +++++
 include/linux/sksm.h                   |  27 +++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/Kconfig                             |   5 +
 mm/Makefile                            |   1 +
 mm/ksm-common.h                        | 228 +++++++++++++++++++++++++
 mm/ksm.c                               | 219 +-----------------------
 mm/madvise.c                           |   6 +
 mm/memory.c                            |   2 +
 mm/page_alloc.c                        |   3 +
 mm/sksm.c                              | 190 +++++++++++++++++++++
 13 files changed, 518 insertions(+), 218 deletions(-)
 create mode 100644 include/linux/sksm.h
 create mode 100644 mm/ksm-common.h
 create mode 100644 mm/sksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 6a53ac4885bb..dc3ce855863c 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -118,6 +118,10 @@ static inline void ksm_exit(struct mm_struct *mm)
 {
 }
 
+static inline void ksm_map_zero_page(struct mm_struct *mm)
+{
+}
+
 static inline void ksm_might_unmap_zero_page(struct mm_struct *mm, pte_t pte)
 {
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 332cee285662..e4940562cb81 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -19,6 +19,7 @@
 #include <linux/workqueue.h>
 #include <linux/seqlock.h>
 #include <linux/percpu_counter.h>
+#include <linux/types.h>
 
 #include <asm/mmu.h>
 
@@ -216,6 +217,12 @@ struct page {
 	struct page *kmsan_shadow;
 	struct page *kmsan_origin;
 #endif
+
+#ifdef CONFIG_SKSM
+	/* TODO: move those fields into unused union fields instead. */
+	struct hlist_node sksm_node;
+	u32 checksum;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 691506bdf2c5..4e96437ab94e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -701,6 +701,48 @@ static __always_inline bool PageAnon(const struct page *page)
 	return folio_test_anon(page_folio(page));
 }
 
+#ifdef CONFIG_SKSM
+static __always_inline bool folio_test_sksm(const struct folio *folio)
+{
+	return !hlist_unhashed_lockless(&folio->page.sksm_node);
+}
+#else
+static __always_inline bool folio_test_sksm(const struct folio *folio)
+{
+	return false;
+}
+#endif
+
+static __always_inline bool PageSKSM(const struct page *page)
+{
+	return folio_test_sksm(page_folio(page));
+}
+
+#ifdef CONFIG_SKSM
+static inline void set_page_checksum(struct page *page, u32 checksum)
+{
+	page->checksum = checksum;
+}
+
+static inline void init_page_sksm_node(struct page *page)
+{
+	INIT_HLIST_NODE(&page->sksm_node);
+}
+
+void __sksm_page_remove(struct page *page);
+
+static inline void sksm_page_remove(struct page *page)
+{
+	if (!PageSKSM(page))
+		return;
+	__sksm_page_remove(page);
+}
+#else
+static inline void set_page_checksum(struct page *page, u32 checksum) { }
+static inline void init_page_sksm_node(struct page *page) { }
+static inline void sksm_page_remove(struct page *page) { }
+#endif
+
 static __always_inline bool __folio_test_movable(const struct folio *folio)
 {
 	return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) ==
diff --git a/include/linux/sksm.h b/include/linux/sksm.h
new file mode 100644
index 000000000000..4f3aaec512df
--- /dev/null
+++ b/include/linux/sksm.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_SKSM_H
+#define __LINUX_SKSM_H
+/*
+ * Synchronous memory merging support.
+ *
+ * This code enables synchronous dynamic sharing of identical pages
+ * found in different memory areas, even if they are not shared by
+ * fork().
+ */
+
+#ifdef CONFIG_SKSM
+
+int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+	       unsigned long end);
+
+#else  /* !CONFIG_KSM */
+
+static inline int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+			     unsigned long end)
+{
+	return 0;
+}
+
+#endif	/* !CONFIG_KSM */
+
+#endif /* __LINUX_SKSM_H */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1ea2c4c33b86..8bd57eb21c12 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -79,6 +79,8 @@
 
 #define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
 
+#define MADV_MERGE   	26		/* Synchronously merge identical pages */
+
 #define MADV_GUARD_INSTALL 102		/* fatal signal on access to range */
 #define MADV_GUARD_REMOVE 103		/* unguard range */
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 84000b016808..067d4c3aa21c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -740,6 +740,11 @@ config KSM
 	  until a program has madvised that an area is MADV_MERGEABLE, and
 	  root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set).
 
+config SKSM
+	bool "Enable Synchronous KSM for page merging"
+	depends on MMU && !KSM
+	select XXHASH
+
 config DEFAULT_MMAP_MIN_ADDR
 	int "Low address space to protect from user allocation"
 	depends on MMU
diff --git a/mm/Makefile b/mm/Makefile
index dba52bb0da8a..8722c3ea572c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -85,6 +85,7 @@ obj-$(CONFIG_SPARSEMEM)	+= sparse.o
 obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
 obj-$(CONFIG_KSM) += ksm.o
+obj-$(CONFIG_SKSM) += sksm.o
 obj-$(CONFIG_PAGE_POISONING) += page_poison.o
 obj-$(CONFIG_KASAN)	+= kasan/
 obj-$(CONFIG_KFENCE) += kfence/
diff --git a/mm/ksm-common.h b/mm/ksm-common.h
new file mode 100644
index 000000000000..b676f1f5c10f
--- /dev/null
+++ b/mm/ksm-common.h
@@ -0,0 +1,228 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Memory merging support, common code.
+ */
+#ifndef _KSM_COMMON_H
+#define _KSM_COMMON_H
+
+#include <linux/ksm.h>
+
+static bool vma_ksm_compatible(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_SHARED  | VM_MAYSHARE   | VM_PFNMAP  |
+			     VM_IO      | VM_DONTEXPAND | VM_HUGETLB |
+			     VM_MIXEDMAP| VM_DROPPABLE))
+		return false;		/* just ignore the advice */
+
+	if (vma_is_dax(vma))
+		return false;
+
+#ifdef VM_SAO
+	if (vma->vm_flags & VM_SAO)
+		return false;
+#endif
+#ifdef VM_SPARC_ADI
+	if (vma->vm_flags & VM_SPARC_ADI)
+		return false;
+#endif
+
+	return true;
+}
+
+static u32 calc_checksum(struct page *page)
+{
+	u32 checksum;
+	void *addr = kmap_local_page(page);
+	checksum = xxhash(addr, PAGE_SIZE, 0);
+	kunmap_local(addr);
+	return checksum;
+}
+
+static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
+			      pte_t *orig_pte)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
+	int swapped;
+	int err = -EFAULT;
+	struct mmu_notifier_range range;
+	bool anon_exclusive;
+	pte_t entry;
+
+	if (WARN_ON_ONCE(folio_test_large(folio)))
+		return err;
+
+	pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
+	if (pvmw.address == -EFAULT)
+		goto out;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
+				pvmw.address + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	if (!page_vma_mapped_walk(&pvmw))
+		goto out_mn;
+	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
+		goto out_unlock;
+
+	anon_exclusive = PageAnonExclusive(&folio->page);
+	entry = ptep_get(pvmw.pte);
+	if (pte_write(entry) || pte_dirty(entry) ||
+	    anon_exclusive || mm_tlb_flush_pending(mm)) {
+		swapped = folio_test_swapcache(folio);
+		flush_cache_page(vma, pvmw.address, folio_pfn(folio));
+		/*
+		 * Ok this is tricky, when get_user_pages_fast() run it doesn't
+		 * take any lock, therefore the check that we are going to make
+		 * with the pagecount against the mapcount is racy and
+		 * O_DIRECT can happen right after the check.
+		 * So we clear the pte and flush the tlb before the check
+		 * this assure us that no O_DIRECT can happen after the check
+		 * or in the middle of the check.
+		 *
+		 * No need to notify as we are downgrading page table to read
+		 * only not changing it to point to a new page.
+		 *
+		 * See Documentation/mm/mmu_notifier.rst
+		 */
+		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
+		/*
+		 * Check that no O_DIRECT or similar I/O is in progress on the
+		 * page
+		 */
+		if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) {
+			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+			goto out_unlock;
+		}
+
+		/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
+		if (anon_exclusive &&
+		    folio_try_share_anon_rmap_pte(folio, &folio->page)) {
+			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+			goto out_unlock;
+		}
+
+		if (pte_dirty(entry))
+			folio_mark_dirty(folio);
+		entry = pte_mkclean(entry);
+
+		if (pte_write(entry))
+			entry = pte_wrprotect(entry);
+
+		set_pte_at(mm, pvmw.address, pvmw.pte, entry);
+	}
+	*orig_pte = entry;
+	err = 0;
+
+out_unlock:
+	page_vma_mapped_walk_done(&pvmw);
+out_mn:
+	mmu_notifier_invalidate_range_end(&range);
+out:
+	return err;
+}
+
+/**
+ * replace_page - replace page in vma by new ksm page
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the page we are replacing by kpage
+ * @kpage:    the ksm page we replace page by
+ * @orig_pte: the original value of the pte
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int replace_page(struct vm_area_struct *vma, struct page *page,
+			struct page *kpage, pte_t orig_pte)
+{
+	struct folio *kfolio = page_folio(kpage);
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio = page_folio(page);
+	pmd_t *pmd;
+	pmd_t pmde;
+	pte_t *ptep;
+	pte_t newpte;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int err = -EFAULT;
+	struct mmu_notifier_range range;
+
+	addr = page_address_in_vma(folio, page, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pmd = mm_find_pmd(mm, addr);
+	if (!pmd)
+		goto out;
+	/*
+	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
+	 * without holding anon_vma lock for write.  So when looking for a
+	 * genuine pmde (in which to find pte), test present and !THP together.
+	 */
+	pmde = pmdp_get_lockless(pmd);
+	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+		goto out;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
+				addr + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out_mn;
+	if (!pte_same(ptep_get(ptep), orig_pte)) {
+		pte_unmap_unlock(ptep, ptl);
+		goto out_mn;
+	}
+	VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
+	VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
+			kfolio);
+
+	/*
+	 * No need to check ksm_use_zero_pages here: we can only have a
+	 * zero_page here if ksm_use_zero_pages was enabled already.
+	 */
+	if (!is_zero_pfn(page_to_pfn(kpage))) {
+		folio_get(kfolio);
+		folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
+		newpte = mk_pte(kpage, vma->vm_page_prot);
+	} else {
+		/*
+		 * Use pte_mkdirty to mark the zero page mapped by KSM, and then
+		 * we can easily track all KSM-placed zero pages by checking if
+		 * the dirty bit in zero page's PTE is set.
+		 */
+		newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
+		ksm_map_zero_page(mm);
+		/*
+		 * We're replacing an anonymous page with a zero page, which is
+		 * not anonymous. We need to do proper accounting otherwise we
+		 * will get wrong values in /proc, and a BUG message in dmesg
+		 * when tearing down the mm.
+		 */
+		dec_mm_counter(mm, MM_ANONPAGES);
+	}
+
+	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
+	/*
+	 * No need to notify as we are replacing a read only page with another
+	 * read only page with the same content.
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at(mm, addr, ptep, newpte);
+
+	folio_remove_rmap_pte(folio, page, vma);
+	if (!folio_mapped(folio))
+		folio_free_swap(folio);
+	folio_put(folio);
+
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+out_mn:
+	mmu_notifier_invalidate_range_end(&range);
+out:
+	return err;
+}
+
+#endif /* _KSM_COMMON_H */
diff --git a/mm/ksm.c b/mm/ksm.c
index 31a9bc365437..c495469a8329 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -44,6 +44,7 @@
 #include <asm/tlbflush.h>
 #include "internal.h"
 #include "mm_slot.h"
+#include "ksm-common.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/ksm.h>
@@ -677,28 +678,6 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_v
 	return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
 }
 
-static bool vma_ksm_compatible(struct vm_area_struct *vma)
-{
-	if (vma->vm_flags & (VM_SHARED  | VM_MAYSHARE   | VM_PFNMAP  |
-			     VM_IO      | VM_DONTEXPAND | VM_HUGETLB |
-			     VM_MIXEDMAP| VM_DROPPABLE))
-		return false;		/* just ignore the advice */
-
-	if (vma_is_dax(vma))
-		return false;
-
-#ifdef VM_SAO
-	if (vma->vm_flags & VM_SAO)
-		return false;
-#endif
-#ifdef VM_SPARC_ADI
-	if (vma->vm_flags & VM_SPARC_ADI)
-		return false;
-#endif
-
-	return true;
-}
-
 static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
 		unsigned long addr)
 {
@@ -1234,202 +1213,6 @@ static int unmerge_and_remove_all_rmap_items(void)
 }
 #endif /* CONFIG_SYSFS */
 
-static u32 calc_checksum(struct page *page)
-{
-	u32 checksum;
-	void *addr = kmap_local_page(page);
-	checksum = xxhash(addr, PAGE_SIZE, 0);
-	kunmap_local(addr);
-	return checksum;
-}
-
-static int write_protect_page(struct vm_area_struct *vma, struct folio *folio,
-			      pte_t *orig_pte)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, 0, 0);
-	int swapped;
-	int err = -EFAULT;
-	struct mmu_notifier_range range;
-	bool anon_exclusive;
-	pte_t entry;
-
-	if (WARN_ON_ONCE(folio_test_large(folio)))
-		return err;
-
-	pvmw.address = page_address_in_vma(folio, folio_page(folio, 0), vma);
-	if (pvmw.address == -EFAULT)
-		goto out;
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, pvmw.address,
-				pvmw.address + PAGE_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-
-	if (!page_vma_mapped_walk(&pvmw))
-		goto out_mn;
-	if (WARN_ONCE(!pvmw.pte, "Unexpected PMD mapping?"))
-		goto out_unlock;
-
-	anon_exclusive = PageAnonExclusive(&folio->page);
-	entry = ptep_get(pvmw.pte);
-	if (pte_write(entry) || pte_dirty(entry) ||
-	    anon_exclusive || mm_tlb_flush_pending(mm)) {
-		swapped = folio_test_swapcache(folio);
-		flush_cache_page(vma, pvmw.address, folio_pfn(folio));
-		/*
-		 * Ok this is tricky, when get_user_pages_fast() run it doesn't
-		 * take any lock, therefore the check that we are going to make
-		 * with the pagecount against the mapcount is racy and
-		 * O_DIRECT can happen right after the check.
-		 * So we clear the pte and flush the tlb before the check
-		 * this assure us that no O_DIRECT can happen after the check
-		 * or in the middle of the check.
-		 *
-		 * No need to notify as we are downgrading page table to read
-		 * only not changing it to point to a new page.
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
-		 */
-		entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
-		/*
-		 * Check that no O_DIRECT or similar I/O is in progress on the
-		 * page
-		 */
-		if (folio_mapcount(folio) + 1 + swapped != folio_ref_count(folio)) {
-			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
-			goto out_unlock;
-		}
-
-		/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
-		if (anon_exclusive &&
-		    folio_try_share_anon_rmap_pte(folio, &folio->page)) {
-			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
-			goto out_unlock;
-		}
-
-		if (pte_dirty(entry))
-			folio_mark_dirty(folio);
-		entry = pte_mkclean(entry);
-
-		if (pte_write(entry))
-			entry = pte_wrprotect(entry);
-
-		set_pte_at(mm, pvmw.address, pvmw.pte, entry);
-	}
-	*orig_pte = entry;
-	err = 0;
-
-out_unlock:
-	page_vma_mapped_walk_done(&pvmw);
-out_mn:
-	mmu_notifier_invalidate_range_end(&range);
-out:
-	return err;
-}
-
-/**
- * replace_page - replace page in vma by new ksm page
- * @vma:      vma that holds the pte pointing to page
- * @page:     the page we are replacing by kpage
- * @kpage:    the ksm page we replace page by
- * @orig_pte: the original value of the pte
- *
- * Returns 0 on success, -EFAULT on failure.
- */
-static int replace_page(struct vm_area_struct *vma, struct page *page,
-			struct page *kpage, pte_t orig_pte)
-{
-	struct folio *kfolio = page_folio(kpage);
-	struct mm_struct *mm = vma->vm_mm;
-	struct folio *folio = page_folio(page);
-	pmd_t *pmd;
-	pmd_t pmde;
-	pte_t *ptep;
-	pte_t newpte;
-	spinlock_t *ptl;
-	unsigned long addr;
-	int err = -EFAULT;
-	struct mmu_notifier_range range;
-
-	addr = page_address_in_vma(folio, page, vma);
-	if (addr == -EFAULT)
-		goto out;
-
-	pmd = mm_find_pmd(mm, addr);
-	if (!pmd)
-		goto out;
-	/*
-	 * Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
-	 * without holding anon_vma lock for write.  So when looking for a
-	 * genuine pmde (in which to find pte), test present and !THP together.
-	 */
-	pmde = pmdp_get_lockless(pmd);
-	if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-		goto out;
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + PAGE_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-
-	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
-	if (!ptep)
-		goto out_mn;
-	if (!pte_same(ptep_get(ptep), orig_pte)) {
-		pte_unmap_unlock(ptep, ptl);
-		goto out_mn;
-	}
-	VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
-	VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
-			kfolio);
-
-	/*
-	 * No need to check ksm_use_zero_pages here: we can only have a
-	 * zero_page here if ksm_use_zero_pages was enabled already.
-	 */
-	if (!is_zero_pfn(page_to_pfn(kpage))) {
-		folio_get(kfolio);
-		folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
-		newpte = mk_pte(kpage, vma->vm_page_prot);
-	} else {
-		/*
-		 * Use pte_mkdirty to mark the zero page mapped by KSM, and then
-		 * we can easily track all KSM-placed zero pages by checking if
-		 * the dirty bit in zero page's PTE is set.
-		 */
-		newpte = pte_mkdirty(pte_mkspecial(pfn_pte(page_to_pfn(kpage), vma->vm_page_prot)));
-		ksm_map_zero_page(mm);
-		/*
-		 * We're replacing an anonymous page with a zero page, which is
-		 * not anonymous. We need to do proper accounting otherwise we
-		 * will get wrong values in /proc, and a BUG message in dmesg
-		 * when tearing down the mm.
-		 */
-		dec_mm_counter(mm, MM_ANONPAGES);
-	}
-
-	flush_cache_page(vma, addr, pte_pfn(ptep_get(ptep)));
-	/*
-	 * No need to notify as we are replacing a read only page with another
-	 * read only page with the same content.
-	 *
-	 * See Documentation/mm/mmu_notifier.rst
-	 */
-	ptep_clear_flush(vma, addr, ptep);
-	set_pte_at(mm, addr, ptep, newpte);
-
-	folio_remove_rmap_pte(folio, page, vma);
-	if (!folio_mapped(folio))
-		folio_free_swap(folio);
-	folio_put(folio);
-
-	pte_unmap_unlock(ptep, ptl);
-	err = 0;
-out_mn:
-	mmu_notifier_invalidate_range_end(&range);
-out:
-	return err;
-}
-
 /*
  * try_to_merge_one_page - take two pages and merge them into one
  * @vma: the vma that holds the pte pointing to page
diff --git a/mm/madvise.c b/mm/madvise.c
index 0ceae57da7da..d9d678053ca2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -22,6 +22,7 @@
 #include <linux/string.h>
 #include <linux/uio.h>
 #include <linux/ksm.h>
+#include <linux/sksm.h>
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/blkdev.h>
@@ -1318,6 +1319,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
 		return madvise_guard_install(vma, prev, start, end);
 	case MADV_GUARD_REMOVE:
 		return madvise_guard_remove(vma, prev, start, end);
+	case MADV_MERGE:
+		return sksm_merge(vma, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1422,6 +1425,9 @@ madvise_behavior_valid(int behavior)
 #ifdef CONFIG_MEMORY_FAILURE
 	case MADV_SOFT_OFFLINE:
 	case MADV_HWPOISON:
+#endif
+#ifdef CONFIG_SKSM
+	case MADV_MERGE:
 #endif
 		return true;
 
diff --git a/mm/memory.c b/mm/memory.c
index 398c031be9ba..782363315b31 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3618,6 +3618,8 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
 	 */
 	if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
 		return false;
+	if (folio_test_sksm(folio))
+		return false;
 	if (!folio_test_lru(folio))
 		/*
 		 * We cannot easily detect+handle references from
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01eab25edf89..0bb9755896ce 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1122,6 +1122,7 @@ __always_inline bool free_pages_prepare(struct page *page,
 			return false;
 	}
 
+	sksm_page_remove(page);
 	page_cpupid_reset_last(page);
 	page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	reset_page_owner(page, order);
@@ -1509,6 +1510,8 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 
 	set_page_private(page, 0);
 	set_page_refcounted(page);
+	set_page_checksum(page, 0);
+	init_page_sksm_node(page);
 
 	arch_alloc_page(page, order);
 	debug_pagealloc_map_pages(page, 1 << order);
diff --git a/mm/sksm.c b/mm/sksm.c
new file mode 100644
index 000000000000..190f6bc05f2d
--- /dev/null
+++ b/mm/sksm.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Synchronous memory merging support.
+ *
+ * This code enables synchronous dynamic sharing of identical pages
+ * found in different memory areas, even if they are not shared by
+ * fork().
+ *
+ * Userspace must explicitly request for pages within specific address
+ * ranges to be merged with madvise MADV_MERGE. Those should *not*
+ * contain secrets, as side-channel timing attacks can allow a process
+ * to learn the existence of a known content within another process.
+ *
+ * The synchronous memory merging performs the memory merging
+ * synchronously within madvise. There is no global scan and no need
+ * for background daemon.
+ *
+ * The anonymous pages targeted for merge are write-protected and
+ * checksummed. They are then compared to other pages targeted for
+ * merge.
+ *
+ * The mergeable pages are added to a hash table indexed by checksum of
+ * their content. The hash value is derived from the page content
+ * checksum, and its comparison function is based on comparison of
+ * the page content.
+ *
+ * If a page is written to after being targeted for merge, a COW will be
+ * triggered, and thus a new page will be populated in its stead.
+ *
+ * The typical usage pattern expected from userspace is:
+ *
+ * 1) Userspace writes non-secret content to a MAP_PRIVATE page, thus
+ *    triggering COW.
+ *
+ * 2) After userspace has completed writing to the page, it issues
+ *    madvise MADV_MERGE on a range containing the page, which
+ *    write-protect, checksum, and add the page to the sksm hash
+ *    table. It then merges this page with other mergeable pages
+ *    that have the same content.
+ *
+ * 3) It is typically expected that this page's content stays invariant
+ *    for a long time. If userspace issues writes to the page after
+ *    madvise MADV_MERGE, another COW will be triggered, which will
+ *    populate a new page copy into the process page table and release
+ *    the reference to the old page.
+ */
+
+#include <linux/mutex.h>
+#include <linux/cleanup.h>
+#include <linux/mm_types.h>
+#include <linux/hashtable.h>
+#include <linux/highmem.h>
+#include <linux/xxhash.h>
+#include <linux/rmap.h>
+#include <linux/mm.h>
+#include <linux/pagewalk.h>
+#include <linux/sksm.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+
+#include "internal.h"
+#include "ksm-common.h"
+
+#define SKSM_HT_BITS	16
+
+static DEFINE_MUTEX(sksm_lock);
+
+/*
+ * The hash is derived from the page checksum.
+ */
+static DEFINE_HASHTABLE(sksm_ht, SKSM_HT_BITS);
+
+void __sksm_page_remove(struct page *page)
+{
+	guard(mutex)(&sksm_lock);
+	hash_del(&page->sksm_node);
+}
+
+static int sksm_merge_page(struct vm_area_struct *vma, struct page *page)
+{
+	struct folio *folio = page_folio(page);
+	pte_t orig_pte = __pte(0);
+	struct page *kpage;
+	int err = 0;
+
+	folio_lock(folio);
+
+	if (folio_test_large(folio)) {
+		if (split_huge_page(page))
+			goto out_unlock;
+		folio = page_folio(page);
+	}
+
+	/* Write protect page. */
+	if (write_protect_page(vma, folio, &orig_pte) != 0)
+		goto out_unlock;
+
+	/* Checksum page. */
+	page->checksum = calc_checksum(page);
+
+	guard(mutex)(&sksm_lock);
+
+	/* Merge page with duplicates. */
+	hash_for_each_possible(sksm_ht, kpage, sksm_node, page->checksum) {
+		if (page->checksum != kpage->checksum || !pages_identical(page, kpage))
+			continue;
+		if (!get_page_unless_zero(kpage))
+			continue;
+		err = replace_page(vma, page, kpage, orig_pte);
+		put_page(kpage);
+		if (!err)
+			goto out_unlock;
+	}
+
+	/*
+	 * This page is not linked to its address_space anymore because it
+	 * can be shared with other processes and replace pages originally
+	 * associated with other address spaces.
+	 */
+	page->mapping = (void *) PAGE_MAPPING_ANON;
+
+	/* Add page to hash table. */
+	hash_add(sksm_ht, &page->sksm_node, page->checksum);
+out_unlock:
+	folio_unlock(folio);
+	return err;
+}
+
+static struct page *get_vma_page_from_addr(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct page *page = NULL;
+	struct folio_walk fw;
+	struct folio *folio;
+
+	folio = folio_walk_start(&fw, vma, addr, 0);
+	if (folio) {
+		if (!folio_is_zone_device(folio) &&
+		    folio_test_anon(folio)) {
+			folio_get(folio);
+			page = fw.page;
+		}
+		folio_walk_end(&fw, vma);
+	}
+	if (page) {
+		flush_anon_page(vma, page, addr);
+		flush_dcache_page(page);
+	}
+	return page;
+}
+
+/* Called with mmap write lock held. */
+int sksm_merge(struct vm_area_struct *vma, unsigned long start,
+	       unsigned long end)
+{
+	unsigned long addr;
+	int err = 0;
+
+	if (!PAGE_ALIGNED(start) || !PAGE_ALIGNED(end))
+		return -EINVAL;
+	if (!vma_ksm_compatible(vma))
+		return 0;
+
+	/*
+	 * A number of pages can hang around indefinitely in per-cpu
+	 * LRU cache, raised page count preventing write_protect_page
+	 * from merging them.
+	 */
+	lru_add_drain_all();
+
+	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
+		struct page *page = get_vma_page_from_addr(vma, addr);
+
+		if (!page)
+			continue;
+		err = sksm_merge_page(vma, page);
+		put_page(page);
+	}
+	return err;
+}
+
+static int __init sksm_init(void)
+{
+	struct page *zero_page = ZERO_PAGE(0);
+
+	zero_page->checksum = calc_checksum(zero_page);
+	/* Add page to hash table. */
+	hash_add(sksm_ht, &zero_page->sksm_node, zero_page->checksum);
+	return 0;
+}
+subsys_initcall(sksm_init);
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test
  2025-02-28  2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
  2025-02-28  2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
@ 2025-02-28  2:30 ` Mathieu Desnoyers
  2025-02-28  2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
  2 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28  2:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Matthew Wilcox,
	Olivier Dion, linux-mm

Introduce a basic selftest for SKSM. See ./basic_test -h for
options.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Olivier Dion <odion@efficios.com>
Cc: linux-mm@kvack.org
---
 tools/testing/selftests/sksm/.gitignore   |   2 +
 tools/testing/selftests/sksm/Makefile     |  14 ++
 tools/testing/selftests/sksm/basic_test.c | 217 ++++++++++++++++++++++
 3 files changed, 233 insertions(+)
 create mode 100644 tools/testing/selftests/sksm/.gitignore
 create mode 100644 tools/testing/selftests/sksm/Makefile
 create mode 100644 tools/testing/selftests/sksm/basic_test.c

diff --git a/tools/testing/selftests/sksm/.gitignore b/tools/testing/selftests/sksm/.gitignore
new file mode 100644
index 000000000000..0f5b0baa91e7
--- /dev/null
+++ b/tools/testing/selftests/sksm/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+basic_test
diff --git a/tools/testing/selftests/sksm/Makefile b/tools/testing/selftests/sksm/Makefile
new file mode 100644
index 000000000000..ec1a10783bda
--- /dev/null
+++ b/tools/testing/selftests/sksm/Makefile
@@ -0,0 +1,14 @@
+# SPDX-License-Identifier: GPL-2.0+ OR MIT
+
+top_srcdir = ../../../..
+
+CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -L$(OUTPUT) -Wl,-rpath=./ \
+	  $(CLANG_FLAGS) -I$(top_srcdir)/tools/include
+LDLIBS += -lpthread
+
+TEST_GEN_PROGS = basic_test
+
+include ../lib.mk
+
+$(OUTPUT)/%: %.c
+	$(CC) $(CFLAGS) $< $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/sksm/basic_test.c b/tools/testing/selftests/sksm/basic_test.c
new file mode 100644
index 000000000000..1a7571a999d2
--- /dev/null
+++ b/tools/testing/selftests/sksm/basic_test.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: LGPL-2.1
+/*
+ * Basic test for SKSM.
+ */
+
+#include <stdlib.h>
+#include <sys/mman.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <unistd.h>
+#include <poll.h>
+
+#ifndef MADV_MERGE
+#define MADV_MERGE	26
+#endif
+
+#define PAGE_SIZE	4096
+
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *) &(x)) = (val))
+
+static int opt_stop_at = 0, opt_pause = 0;
+
+struct test_page {
+	char array[PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
+};
+
+struct test_page2 {
+	char array[2 * PAGE_SIZE] __attribute__((aligned(PAGE_SIZE)));
+};
+
+/* identical to zero page. */
+static struct test_page zero;
+
+/* a1 and a2 are identical. */
+static struct test_page a1 = {
+	.array[0] = 0x42,
+	.array[1] = 0x42,
+};
+
+static struct test_page a2 = {
+	.array[0] = 0x42,
+	.array[1] = 0x42,
+};
+
+/* b1 and b2 are identical. */
+static struct test_page2 b1 = {
+	.array[0] = 0x43,
+	.array[1] = 0x43,
+	.array[PAGE_SIZE] = 0x44,
+	.array[PAGE_SIZE + 1] = 0x44,
+};
+
+static struct test_page2 b2 = {
+	.array[0] = 0x43,
+	.array[1] = 0x43,
+	.array[PAGE_SIZE] = 0x44,
+	.array[PAGE_SIZE + 1] = 0x44,
+};
+
+static void touch_pages(void *p, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i += PAGE_SIZE)
+		WRITE_ONCE(((char *)p)[i], ((char *)p)[i]);
+}
+
+static void test_step(char step)
+{
+	printf("\nTest step: <%c>\n", step);
+	if (opt_pause) {
+		printf("Press ENTER to continue...\n");
+		getchar();
+	}
+	if (opt_stop_at == step) {
+		poll(NULL, 0, -1);
+		exit(0);
+	}
+}
+
+static void show_usage(int argc, char **argv)
+{
+	printf("Usage : %s <OPTIONS>\n",
+		argv[0]);
+	printf("OPTIONS:\n");
+	printf("	[-s stop_at] Stop test at step A, B, C, D, E, or F and wait forever.\n");
+	printf("	[-p] Pause test between steps (await newline from the console).\n");
+	printf("	[-h] Show this help.\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	int i;
+
+	for (i = 1; i < argc; i++) {
+		if (argv[i][0] != '-')
+			continue;
+		switch (argv[i][1]) {
+		case 's':
+			if (argc < i + 2) {
+				show_usage(argc, argv);
+				return -1;
+			}
+			opt_stop_at = *argv[i + 1];
+			switch (opt_stop_at) {
+			case 'A':
+			case 'B':
+			case 'C':
+			case 'D':
+			case 'E':
+			case 'F':
+				break;
+			default:
+				show_usage(argc, argv);
+				return -1;
+			}
+			i++;
+			break;
+		case 'p':
+			opt_pause = 1;
+			i++;
+			break;
+		case 'h':
+			show_usage(argc, argv);
+			return 0;
+		default:
+			show_usage(argc, argv);
+			return -1;
+		}
+	}
+
+
+	printf("PID: %d\n", getpid());
+	printf("Shared mapping (write-protected)\n");
+
+	test_step('A');
+
+	printf("madvise MADV_MERGE a1\n");
+	if (madvise(&a1, sizeof(a1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE a2\n");
+	if (madvise(&a2, sizeof(a2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b1\n");
+	if (madvise(&b1, sizeof(b1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b2\n");
+	if (madvise(&b2, sizeof(b2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE zero\n");
+	if (madvise(&zero, sizeof(zero), MADV_MERGE))
+		goto error;
+
+	test_step('B');
+
+	printf("Trigger COW\n");
+	touch_pages(&zero, sizeof(zero));
+	touch_pages(&a1, sizeof(a1));
+	touch_pages(&a2, sizeof(a2));
+	touch_pages(&b1, sizeof(b1));
+	touch_pages(&b2, sizeof(b2));
+
+	test_step('C');
+
+	printf("madvise MADV_MERGE a1\n");
+	if (madvise(&a1, sizeof(a1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE a2\n");
+	if (madvise(&a2, sizeof(a2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b1\n");
+	if (madvise(&b1, sizeof(b1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b2\n");
+	if (madvise(&b2, sizeof(b2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE zero\n");
+	if (madvise(&zero, sizeof(zero), MADV_MERGE))
+		goto error;
+
+	test_step('D');
+
+	printf("Trigger COW\n");
+	touch_pages(&zero, sizeof(zero));
+	touch_pages(&a1, sizeof(a1));
+	touch_pages(&a2, sizeof(a2));
+	touch_pages(&b1, sizeof(b1));
+	touch_pages(&b2, sizeof(b2));
+
+	test_step('E');
+
+	printf("madvise MADV_MERGE a1\n");
+	if (madvise(&a1, sizeof(a1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE a2\n");
+	if (madvise(&a2, sizeof(a2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b1\n");
+	if (madvise(&b1, sizeof(b1), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE b2\n");
+	if (madvise(&b2, sizeof(b2), MADV_MERGE))
+		goto error;
+	printf("madvise MADV_MERGE zero\n");
+	if (madvise(&zero, sizeof(zero), MADV_MERGE))
+		goto error;
+
+	test_step('F');
+
+	return 0;
+
+error:
+	perror("madvise");
+	return -1;
+}
-- 
2.39.5



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
  2025-02-28  2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
  2025-02-28  2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
@ 2025-02-28  2:51 ` Linus Torvalds
  2025-02-28  3:03   ` Mathieu Desnoyers
  2025-02-28 15:34   ` David Hildenbrand
  2 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2025-02-28  2:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This series introduces SKSM, a new page deduplication ABI,
> aiming to fix the limitations inherent to the KSM ABI.

So I'm not interested in seeing *another* KSM version.

Because I absolutely do *NOT* want a new chapter in the saga of SLUB
vs SLAB vs SLOB.

However, if the feeling is that this can *replace* the current horror
that is KSM, I'm a lot more interested. I suspect our current KSM
model has largely been a failure, and this might be "good enough".

             Linus


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
@ 2025-02-28  3:03   ` Mathieu Desnoyers
  2025-02-28  5:17     ` Linus Torvalds
  2025-02-28 15:34   ` David Hildenbrand
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28  3:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 2025-02-27 21:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
> 
> So I'm not interested in seeing *another* KSM version.
> 
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
> 
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".
I'd be fine with SKSM replacing KSM entirely. However, I don't
think we should try to re-implement the existing KSM userspace ABIs
over SKSM. I suspect that much of the problems KSM has today are
caused by the semantic of the ABI it exposes, which were targeted
solely for a host deduplicating guest VMs memory use-case.

KSM tracks memory meant to be mergeable on an ongoing
basis with a worker thread:

   madvise(2) MADV_{UN,}MERGEABLE
   prctl(2) PR_{SET,GET}_MEMORY_MERGE (security concern)
   ~2.5k LOC exclusing ksm-common code
   requires parameter fine-tuning from sysadmin

SKSM gets the hint from userspace that memory is a good
candidate for merging in its current state and is expected
to stay invariant:

   madvise(2) MADV_MERGE
   ~100 LOC exclusing ksm-common code

The main reason why SKSM could be implemented without all the
scanning complexity is because of this simpler ABI.

Thanks for the feedback!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  3:03   ` Mathieu Desnoyers
@ 2025-02-28  5:17     ` Linus Torvalds
  2025-02-28 13:59       ` David Hildenbrand
  2025-02-28 14:59       ` Mathieu Desnoyers
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2025-02-28  5:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> I'd be fine with SKSM replacing KSM entirely. However, I don't
> think we should try to re-implement the existing KSM userspace ABIs
> over SKSM.

No, absolutely. The only point (for me) for your new synchronous one
would be if it replaced the kernel thread async scanning, which would
make the old user space interface basically pointless.

But I don't actually know who uses KSM right now. My reaction really
comes from a "it's not nice code in the kernel", not from any actual
knowledge of the users.

Maybe it works really well in some cloud VM environment, and we're
stuck with it forever.

In which case I don't want to see some second different interface that
just makes it all worse.

                 Linus


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  5:17     ` Linus Torvalds
@ 2025-02-28 13:59       ` David Hildenbrand
  2025-02-28 14:59         ` Sean Christopherson
  2025-02-28 15:01         ` Mathieu Desnoyers
  2025-02-28 14:59       ` Mathieu Desnoyers
  1 sibling, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 13:59 UTC (permalink / raw)
  To: Linus Torvalds, Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 28.02.25 06:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
> 
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
> 
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
> 
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.

Exactly that; and besides the VM use-case, lately people stated using it 
in the context of interpreters (IIRC inside Meta) quite successfully as 
well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  5:17     ` Linus Torvalds
  2025-02-28 13:59       ` David Hildenbrand
@ 2025-02-28 14:59       ` Mathieu Desnoyers
  2025-02-28 16:32         ` Peter Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 14:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 2025-02-28 00:17, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>> think we should try to re-implement the existing KSM userspace ABIs
>> over SKSM.
> 
> No, absolutely. The only point (for me) for your new synchronous one
> would be if it replaced the kernel thread async scanning, which would
> make the old user space interface basically pointless.
> 
> But I don't actually know who uses KSM right now. My reaction really
> comes from a "it's not nice code in the kernel", not from any actual
> knowledge of the users.
> 
> Maybe it works really well in some cloud VM environment, and we're
> stuck with it forever.
> 

For the VM use-case, I wonder if we could just add a userfaultfd
"COW" event that would notify userspace when a COW happens ?

This would allow userspace to replace ksmd by tracking the age of
those anonymous pages, and issue madvise MADV_MERGE on them to
write-protect+merge them when it is deemed useful.

With both a new userfaultfd COW event and madvise MADV_MERGE,
is there anything else that is fundamentally missing to move
all the scanning complexity of KSM to userspace for the VM
deduplication use-case ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 13:59       ` David Hildenbrand
@ 2025-02-28 14:59         ` Sean Christopherson
  2025-02-28 15:10           ` David Hildenbrand
  2025-02-28 15:01         ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Sean Christopherson @ 2025-02-28 14:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
	Matthew Wilcox, Olivier Dion, linux-mm

On Fri, Feb 28, 2025, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
> > On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> > > 
> > > I'd be fine with SKSM replacing KSM entirely. However, I don't
> > > think we should try to re-implement the existing KSM userspace ABIs
> > > over SKSM.
> > 
> > No, absolutely. The only point (for me) for your new synchronous one
> > would be if it replaced the kernel thread async scanning, which would
> > make the old user space interface basically pointless.
> > 
> > But I don't actually know who uses KSM right now. My reaction really
> > comes from a "it's not nice code in the kernel", not from any actual
> > knowledge of the users.
> > 
> > Maybe it works really well in some cloud VM environment, and we're
> > stuck with it forever.
> 
> Exactly that; and besides the VM use-case, lately people stated using it in
> the context of interpreters (IIRC inside Meta) quite successfully as well.

Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
in cloud environments?

The security implications of scanning guest memory and having co-tenant VMs share
mappings (should) make it a complete non-starter for any scenario where VMs and/or
their workloads are owned by third parties.

I can imagine there might be first-party use cases, but I would expect many/most
of those to be able to explicitly share mappings, which would provide far, far
better power and performance characteristics.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 13:59       ` David Hildenbrand
  2025-02-28 14:59         ` Sean Christopherson
@ 2025-02-28 15:01         ` Mathieu Desnoyers
  2025-02-28 15:18           ` David Hildenbrand
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 15:01 UTC (permalink / raw)
  To: David Hildenbrand, Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 2025-02-28 08:59, David Hildenbrand wrote:
> On 28.02.25 06:17, Linus Torvalds wrote:
>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>> <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>> think we should try to re-implement the existing KSM userspace ABIs
>>> over SKSM.
>>
>> No, absolutely. The only point (for me) for your new synchronous one
>> would be if it replaced the kernel thread async scanning, which would
>> make the old user space interface basically pointless.
>>
>> But I don't actually know who uses KSM right now. My reaction really
>> comes from a "it's not nice code in the kernel", not from any actual
>> knowledge of the users.
>>
>> Maybe it works really well in some cloud VM environment, and we're
>> stuck with it forever.
> 
> Exactly that; and besides the VM use-case, lately people stated using it 
> in the context of interpreters (IIRC inside Meta) quite successfully as 
> well.
> 

I suspect that SKSM is a better fit for JIT and code patching than KSM,
because user-space knows better when a set of pages is going to become
invariant for a long time and thus benefit from merging. This removes
the background scanning from the picture.

Does the interpreter use-case require background scanning, or does
it know when a set of pages are meant to become invariant for a long
time ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 14:59         ` Sean Christopherson
@ 2025-02-28 15:10           ` David Hildenbrand
  2025-02-28 15:19             ` David Hildenbrand
  2025-02-28 21:38             ` Mathieu Desnoyers
  0 siblings, 2 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:10 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
	Matthew Wilcox, Olivier Dion, linux-mm

On 28.02.25 15:59, Sean Christopherson wrote:
> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it in
>> the context of interpreters (IIRC inside Meta) quite successfully as well.
> 
> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
> in cloud environments?

Private clouds yes, that's where it is most commonly used for. I would 
assume that nobody for

For example, there is some older documentation here:

https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/virtualization_administration_guide/chap-ksm#chap-KSM

which touches on the security aspects:

"The page deduplication technology (used also by the KSM implementation) 
may introduce side channels that could potentially be used to leak 
information across multiple guests. In case this is a concern, KSM can 
be disabled on a per-guest basis."

> 
> The security implications of scanning guest memory and having co-tenant VMs share
> mappings (should) make it a complete non-starter for any scenario where VMs and/or
> their workloads are owned by third parties.

Jep.

> 
> I can imagine there might be first-party use cases, but I would expect many/most
> of those to be able to explicitly share mappings, which would provide far, far
> better power and performance characteristics.

Note that KSM can be very efficient when you have multiple VMs running 
the same kernel,executable,libraries etc. If my memory doesn't trick me, 
that's precisely for what it was originally invented, and how it is 
getting used today in the context of VMs.

For example, QEMU will mark all guest memory is mergeable using MADV, to 
limit the deduplicaton to guest RAM only.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 15:01         ` Mathieu Desnoyers
@ 2025-02-28 15:18           ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:18 UTC (permalink / raw)
  To: Mathieu Desnoyers, Linus Torvalds
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 28.02.25 16:01, Mathieu Desnoyers wrote:
> On 2025-02-28 08:59, David Hildenbrand wrote:
>> On 28.02.25 06:17, Linus Torvalds wrote:
>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>> over SKSM.
>>>
>>> No, absolutely. The only point (for me) for your new synchronous one
>>> would be if it replaced the kernel thread async scanning, which would
>>> make the old user space interface basically pointless.
>>>
>>> But I don't actually know who uses KSM right now. My reaction really
>>> comes from a "it's not nice code in the kernel", not from any actual
>>> knowledge of the users.
>>>
>>> Maybe it works really well in some cloud VM environment, and we're
>>> stuck with it forever.
>>
>> Exactly that; and besides the VM use-case, lately people stated using it
>> in the context of interpreters (IIRC inside Meta) quite successfully as
>> well.
>>
> 
> I suspect that SKSM is a better fit for JIT and code patching than KSM,
> because user-space knows better when a set of pages is going to become
> invariant for a long time and thus benefit from merging. This removes
> the background scanning from the picture.
 > > Does the interpreter use-case require background scanning, or does
> it know when a set of pages are meant to become invariant for a long
> time ?

To make the JIT/interpreter use case happy, people wanted ways to 
*force* KSM on for *the whole process*, not just individual VMAs like 
the traditional VM use case would have done.

I recall one of the reasons being that you don't really want to modify 
your JIT/interpreter to just make KSM work.

See [1] "KSM at Meta" for some details, and in general, optimization 
work to adapt KSM to new use cases.

Regarding some concerns you raised, Stefan did a lot of optimization 
work like "smart scanning" (slide "Optimization - Smart Scan (6.7)") to 
reduce the scanning overhead and make it much more efficient.

So people started optimizing for that already and got pretty good results.

[1] 
https://lpc.events/event/17/contributions/1625/attachments/1320/2649/KSM.pdf

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 15:10           ` David Hildenbrand
@ 2025-02-28 15:19             ` David Hildenbrand
  2025-02-28 21:38             ` Mathieu Desnoyers
  1 sibling, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
	Matthew Wilcox, Olivier Dion, linux-mm

On 28.02.25 16:10, David Hildenbrand wrote:
> On 28.02.25 15:59, Sean Christopherson wrote:
>> On Fri, Feb 28, 2025, David Hildenbrand wrote:
>>> On 28.02.25 06:17, Linus Torvalds wrote:
>>>> On Thu, 27 Feb 2025 at 19:03, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>>
>>>>> I'd be fine with SKSM replacing KSM entirely. However, I don't
>>>>> think we should try to re-implement the existing KSM userspace ABIs
>>>>> over SKSM.
>>>>
>>>> No, absolutely. The only point (for me) for your new synchronous one
>>>> would be if it replaced the kernel thread async scanning, which would
>>>> make the old user space interface basically pointless.
>>>>
>>>> But I don't actually know who uses KSM right now. My reaction really
>>>> comes from a "it's not nice code in the kernel", not from any actual
>>>> knowledge of the users.
>>>>
>>>> Maybe it works really well in some cloud VM environment, and we're
>>>> stuck with it forever.
>>>
>>> Exactly that; and besides the VM use-case, lately people stated using it in
>>> the context of interpreters (IIRC inside Meta) quite successfully as well.
>>
>> Does Red Hat (or any other KVM supporters) actually recommend using KSM for VMs
>> in cloud environments?
> 
> Private clouds yes, that's where it is most commonly used for. I would
> assume that nobody for

forgot to complete that sentence: "... nobody really should be using 
that in public clouds."

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28  2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
  2025-02-28  3:03   ` Mathieu Desnoyers
@ 2025-02-28 15:34   ` David Hildenbrand
  2025-02-28 15:38     ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 15:34 UTC (permalink / raw)
  To: Linus Torvalds, Mathieu Desnoyers
  Cc: Andrew Morton, linux-kernel, Matthew Wilcox, Olivier Dion,
	linux-mm

On 28.02.25 03:51, Linus Torvalds wrote:
> On Thu, 27 Feb 2025 at 18:31, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This series introduces SKSM, a new page deduplication ABI,
>> aiming to fix the limitations inherent to the KSM ABI.
> 
> So I'm not interested in seeing *another* KSM version.
> 
> Because I absolutely do *NOT* want a new chapter in the saga of SLUB
> vs SLAB vs SLOB.
> 
> However, if the feeling is that this can *replace* the current horror
> that is KSM, I'm a lot more interested. I suspect our current KSM
> model has largely been a failure, and this might be "good enough".

Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?

Many/most use cases just leave THP scanning+collapsing to khugepaged; 
selected ones might "know better" what to do, so they effectively 
disable khugepaged, and manually collapse THPs using MADV_COLLAPSE.

If it would be similar to that, it would not be completely different KSM 
version, just a different way to trigger merging: background scanning 
vs. user-space triggered ("synchronous").

I could see use cases for such a synchronous interface, but I doubt it 
could replace the background scanning that is actively getting used for 
existing use cases; I have similar thoughts about khugepaged vs. 
MADV_COLLAPSE.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 15:34   ` David Hildenbrand
@ 2025-02-28 15:38     ` Matthew Wilcox
  0 siblings, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2025-02-28 15:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Linus Torvalds, Mathieu Desnoyers, Andrew Morton, linux-kernel,
	Olivier Dion, linux-mm

On Fri, Feb 28, 2025 at 04:34:50PM +0100, David Hildenbrand wrote:
> Maybe it would be comparable to khugepaged vs. MADV_COLLAPSE?

I think it is comparable ... because many people find khugepaged
unacceptable and there are proposals to move that to userspace.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 14:59       ` Mathieu Desnoyers
@ 2025-02-28 16:32         ` Peter Xu
  2025-02-28 17:53           ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-02-28 16:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> For the VM use-case, I wonder if we could just add a userfaultfd
> "COW" event that would notify userspace when a COW happens ?

I don't know what's the best for KSM and how well this will work, but we
have such event for years..  See UFFDIO_REGISTER_MODE_WP:

https://man7.org/linux/man-pages/man2/userfaultfd.2.html

> 
> This would allow userspace to replace ksmd by tracking the age of
> those anonymous pages, and issue madvise MADV_MERGE on them to
> write-protect+merge them when it is deemed useful.
> 
> With both a new userfaultfd COW event and madvise MADV_MERGE,
> is there anything else that is fundamentally missing to move
> all the scanning complexity of KSM to userspace for the VM
> deduplication use-case ?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 16:32         ` Peter Xu
@ 2025-02-28 17:53           ` Mathieu Desnoyers
  2025-02-28 22:32             ` Peter Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 17:53 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-02-28 11:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>> For the VM use-case, I wonder if we could just add a userfaultfd
>> "COW" event that would notify userspace when a COW happens ?
> 
> I don't know what's the best for KSM and how well this will work, but we
> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> 
> https://man7.org/linux/man-pages/man2/userfaultfd.2.html

userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
resulting from a mmap mapping, but returns EINVAL if I pass a
page-aligned address which sits within a private file mapping
(e.g. executable data).

Also, I notice that do_wp_page() only calls handle_userfault
VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
set.

AFAIU, as it stands now userfaultfd would not help tracking COW faults
caused by stores to private file mappings. Am I missing something ?

Thanks,

Mathieu

> 
>>
>> This would allow userspace to replace ksmd by tracking the age of
>> those anonymous pages, and issue madvise MADV_MERGE on them to
>> write-protect+merge them when it is deemed useful.
>>
>> With both a new userfaultfd COW event and madvise MADV_MERGE,
>> is there anything else that is fundamentally missing to move
>> all the scanning complexity of KSM to userspace for the VM
>> deduplication use-case ?
> 
> Thanks,
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 15:10           ` David Hildenbrand
  2025-02-28 15:19             ` David Hildenbrand
@ 2025-02-28 21:38             ` Mathieu Desnoyers
  2025-02-28 21:45               ` David Hildenbrand
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 21:38 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-02-28 10:10, David Hildenbrand wrote:
[...]
> For example, QEMU will mark all guest memory is mergeable using MADV, to 
> limit the deduplicaton to guest RAM only.
> 

On a related note, I think the madvise(2) documentation is inaccurate.

It states:

        MADV_MERGEABLE (since Linux 2.6.32)
               Enable  Kernel Samepage Merging (KSM) for the pages in the range
               specified by addr and length. [...]

AFAIU, based on code review of ksm_madvise(), this is not strictly true.

The KSM implementation enables KSM for pages in the entire vma containing the range.
So if it so happens that two mmap areas with identical protection flags are merged,
both will be considered mergeable by KSM as soon as at least one page from any of
those areas is made mergeable.

This does not appear to be an issue in qemu because guard pages with different
protection are placed between distinct mappings, which should prevent combining
the vmas.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 21:38             ` Mathieu Desnoyers
@ 2025-02-28 21:45               ` David Hildenbrand
  2025-02-28 21:49                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-02-28 21:45 UTC (permalink / raw)
  To: Mathieu Desnoyers, Sean Christopherson
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 28.02.25 22:38, Mathieu Desnoyers wrote:
> On 2025-02-28 10:10, David Hildenbrand wrote:
> [...]
>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>> limit the deduplicaton to guest RAM only.
>>
> 
> On a related note, I think the madvise(2) documentation is inaccurate.
> 
> It states:
> 
>          MADV_MERGEABLE (since Linux 2.6.32)
>                 Enable  Kernel Samepage Merging (KSM) for the pages in the range
>                 specified by addr and length. [...]
> 
> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
> 
> The KSM implementation enables KSM for pages in the entire vma containing the range.
> So if it so happens that two mmap areas with identical protection flags are merged,
> both will be considered mergeable by KSM as soon as at least one page from any of
> those areas is made mergeable.

I *think* it does what is documented. In madvise_vma_behavior(), 
ksm_madvise() will update "new_flags".

Then we call madvise_update_vma() to split the VMA if required and set 
new_flags only on the split VMA. The handling is similar to other MADV 
operations that end up modifying vm_flags.

If I am missing something and this is indeed broken, we should 
definitely write a selftest for it and fix it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 21:45               ` David Hildenbrand
@ 2025-02-28 21:49                 ` Mathieu Desnoyers
  0 siblings, 0 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-02-28 21:49 UTC (permalink / raw)
  To: David Hildenbrand, Sean Christopherson
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-02-28 16:45, David Hildenbrand wrote:
> On 28.02.25 22:38, Mathieu Desnoyers wrote:
>> On 2025-02-28 10:10, David Hildenbrand wrote:
>> [...]
>>> For example, QEMU will mark all guest memory is mergeable using MADV, to
>>> limit the deduplicaton to guest RAM only.
>>>
>>
>> On a related note, I think the madvise(2) documentation is inaccurate.
>>
>> It states:
>>
>>          MADV_MERGEABLE (since Linux 2.6.32)
>>                 Enable  Kernel Samepage Merging (KSM) for the pages in 
>> the range
>>                 specified by addr and length. [...]
>>
>> AFAIU, based on code review of ksm_madvise(), this is not strictly true.
>>
>> The KSM implementation enables KSM for pages in the entire vma 
>> containing the range.
>> So if it so happens that two mmap areas with identical protection 
>> flags are merged,
>> both will be considered mergeable by KSM as soon as at least one page 
>> from any of
>> those areas is made mergeable.
> 
> I *think* it does what is documented. In madvise_vma_behavior(), 
> ksm_madvise() will update "new_flags".
> 
> Then we call madvise_update_vma() to split the VMA if required and set 
> new_flags only on the split VMA. The handling is similar to other MADV 
> operations that end up modifying vm_flags.
> 
> If I am missing something and this is indeed broken, we should 
> definitely write a selftest for it and fix it.
> 

You are correct, I missed that part. Thanks for the clarification!

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 17:53           ` Mathieu Desnoyers
@ 2025-02-28 22:32             ` Peter Xu
  2025-03-01 15:44               ` Mathieu Desnoyers
  2025-03-03 20:01               ` Mathieu Desnoyers
  0 siblings, 2 replies; 29+ messages in thread
From: Peter Xu @ 2025-02-28 22:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 11:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > "COW" event that would notify userspace when a COW happens ?
> > 
> > I don't know what's the best for KSM and how well this will work, but we
> > have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> > 
> > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
> 
> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> resulting from a mmap mapping, but returns EINVAL if I pass a
> page-aligned address which sits within a private file mapping
> (e.g. executable data).

Yes, so far sync traps only supports RAM-based file systems, or anonymous.
Generic private file mappings (that stores executables and libraries) are
not yet supported.

> 
> Also, I notice that do_wp_page() only calls handle_userfault
> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> set.

AFAICT that's expected, unshare should only be set on reads, never writes.
So uffd-wp shouldn't trap any of those.

> 
> AFAIU, as it stands now userfaultfd would not help tracking COW faults
> caused by stores to private file mappings. Am I missing something ?

I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
most mappings.  That one is async, though, so more like soft-dirty.  It
might be doable to try making it sync too without a lot of changes based on
how async tracking works.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 22:32             ` Peter Xu
@ 2025-03-01 15:44               ` Mathieu Desnoyers
  2025-03-03 15:01                 ` Peter Xu
  2025-03-03 20:01               ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-01 15:44 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
> 
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.

OK, this confirms my observations.

> 
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
> 
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.

I'm confused by your comment. I thought unshare only applies to
*write* faults. What am I missing ?

> 
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
> 
> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings.  That one is async, though, so more like soft-dirty.  It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.

I'll try this out. It may not matter that it's async given a use-case
use-cases of tracking the age since the WP fault on the COW pages. We
don't need to react to the event in-place to alter its behavior, just
a notification should be fine AFAIU.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-01 15:44               ` Mathieu Desnoyers
@ 2025-03-03 15:01                 ` Peter Xu
  2025-03-03 16:36                   ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Xu @ 2025-03-03 15:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> > 
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
> 
> I'm confused by your comment. I thought unshare only applies to
> *write* faults. What am I missing ?

The major path so far to set unshare is here in GUP (ignoring two corner
cases used in either s390 and ksm):

	if (unshare) {
		fault_flags |= FAULT_FLAG_UNSHARE;
		/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
		VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
	}

See the VM_BUG_ON() - if it's write it'll crash already.

"unshare", in its earliest form of patch, used to be called COR
(Copy-On-Read), which might be more straightforward in this case.. so it's
the counterpart of COW but for read cases where a copy is required. The
patchset that introduced it has more information (e.g. a7f2266041).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-03 15:01                 ` Peter Xu
@ 2025-03-03 16:36                   ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-03-03 16:36 UTC (permalink / raw)
  To: Peter Xu, Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 03.03.25 16:01, Peter Xu wrote:
> On Sat, Mar 01, 2025 at 10:44:22AM -0500, Mathieu Desnoyers wrote:
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never writes.
>>> So uffd-wp shouldn't trap any of those.
>>
>> I'm confused by your comment. I thought unshare only applies to
>> *write* faults. What am I missing ?
> 
> The major path so far to set unshare is here in GUP (ignoring two corner
> cases used in either s390 and ksm):

"unshare" fault, in contrast to a write fault, will not turn the PTE 
writable.

That's why it does not trigger userfaultfd-wp: there is no write access, 
write-protection is left unchanged.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-02-28 22:32             ` Peter Xu
  2025-03-01 15:44               ` Mathieu Desnoyers
@ 2025-03-03 20:01               ` Mathieu Desnoyers
  2025-03-03 20:45                 ` Peter Xu
  2025-03-03 20:49                 ` David Hildenbrand
  1 sibling, 2 replies; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-03 20:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-02-28 17:32, Peter Xu wrote:
> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>> On 2025-02-28 11:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>> "COW" event that would notify userspace when a COW happens ?
>>>
>>> I don't know what's the best for KSM and how well this will work, but we
>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>
>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>
>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>> resulting from a mmap mapping, but returns EINVAL if I pass a
>> page-aligned address which sits within a private file mapping
>> (e.g. executable data).
> 
> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> Generic private file mappings (that stores executables and libraries) are
> not yet supported.
> 
>>
>> Also, I notice that do_wp_page() only calls handle_userfault
>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>> set.
> 
> AFAICT that's expected, unshare should only be set on reads, never writes.
> So uffd-wp shouldn't trap any of those.
> 
>>
>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>> caused by stores to private file mappings. Am I missing something ?
> 
> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> most mappings.  That one is async, though, so more like soft-dirty.  It
> might be doable to try making it sync too without a lot of changes based on
> how async tracking works.

I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
be a good fit. Here is what I have in mind to replace the ksmd scanning
thread for the VM use-case by a purely user-space driven scanning:

Within qemu or similar user-space process:

1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
    UFFDIO_REGISTER_MODE_WP mode.

2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
    to detect memory which stays invariant for a long time.

3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
    Keep track of memory which is frequently modified, so it can be left alone and
    not write-protected nor merged anymore.

4) Whenever pages stay invariant for a given lapse of time, merge them with the new
    madvise(2) KSM_MERGE behavior.

Let me know if that makes sense.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-03 20:01               ` Mathieu Desnoyers
@ 2025-03-03 20:45                 ` Peter Xu
  2025-03-03 20:49                 ` David Hildenbrand
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Xu @ 2025-03-03 20:45 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On Mon, Mar 03, 2025 at 03:01:38PM -0500, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
> > On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
> > > On 2025-02-28 11:32, Peter Xu wrote:
> > > > On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
> > > > > For the VM use-case, I wonder if we could just add a userfaultfd
> > > > > "COW" event that would notify userspace when a COW happens ?
> > > > 
> > > > I don't know what's the best for KSM and how well this will work, but we
> > > > have such event for years..  See UFFDIO_REGISTER_MODE_WP:
> > > > 
> > > > https://man7.org/linux/man-pages/man2/userfaultfd.2.html
> > > 
> > > userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
> > > resulting from a mmap mapping, but returns EINVAL if I pass a
> > > page-aligned address which sits within a private file mapping
> > > (e.g. executable data).
> > 
> > Yes, so far sync traps only supports RAM-based file systems, or anonymous.
> > Generic private file mappings (that stores executables and libraries) are
> > not yet supported.
> > 
> > > 
> > > Also, I notice that do_wp_page() only calls handle_userfault
> > > VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
> > > set.
> > 
> > AFAICT that's expected, unshare should only be set on reads, never writes.
> > So uffd-wp shouldn't trap any of those.
> > 
> > > 
> > > AFAIU, as it stands now userfaultfd would not help tracking COW faults
> > > caused by stores to private file mappings. Am I missing something ?
> > 
> > I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
> > most mappings.  That one is async, though, so more like soft-dirty.  It
> > might be doable to try making it sync too without a lot of changes based on
> > how async tracking works.
> 
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
> 
> Within qemu or similar user-space process:
> 
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
>    UFFDIO_REGISTER_MODE_WP mode.
> 
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
>    to detect memory which stays invariant for a long time.
> 
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
>    Keep track of memory which is frequently modified, so it can be left alone and
>    not write-protected nor merged anymore.
> 
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
>    madvise(2) KSM_MERGE behavior.
> 
> Let me know if that makes sense.

I can't speak of how KSM should go from there, but from userfault tracking
POV, that makes sense to me.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-03 20:01               ` Mathieu Desnoyers
  2025-03-03 20:45                 ` Peter Xu
@ 2025-03-03 20:49                 ` David Hildenbrand
  2025-03-05 14:06                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 29+ messages in thread
From: David Hildenbrand @ 2025-03-03 20:49 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 03.03.25 21:01, Mathieu Desnoyers wrote:
> On 2025-02-28 17:32, Peter Xu wrote:
>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>
>>>> I don't know what's the best for KSM and how well this will work, but we
>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>
>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>
>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>> page-aligned address which sits within a private file mapping
>>> (e.g. executable data).
>>
>> Yes, so far sync traps only supports RAM-based file systems, or anonymous.
>> Generic private file mappings (that stores executables and libraries) are
>> not yet supported.
>>
>>>
>>> Also, I notice that do_wp_page() only calls handle_userfault
>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>> set.
>>
>> AFAICT that's expected, unshare should only be set on reads, never writes.
>> So uffd-wp shouldn't trap any of those.
>>
>>>
>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>> caused by stores to private file mappings. Am I missing something ?
>>
>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should work on
>> most mappings.  That one is async, though, so more like soft-dirty.  It
>> might be doable to try making it sync too without a lot of changes based on
>> how async tracking works.
> 
> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
> be a good fit. Here is what I have in mind to replace the ksmd scanning
> thread for the VM use-case by a purely user-space driven scanning:
> 
> Within qemu or similar user-space process:
> 
> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC feature and
>      UFFDIO_REGISTER_MODE_WP mode.
> 
> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl PM_SCAN_WP_MATCHING flag
>      to detect memory which stays invariant for a long time.
> 
> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which pages are written to.
>      Keep track of memory which is frequently modified, so it can be left alone and
>      not write-protected nor merged anymore.
> 
> 4) Whenever pages stay invariant for a given lapse of time, merge them with the new
>      madvise(2) KSM_MERGE behavior.
> 
> Let me know if that makes sense.

Note that one of the strengths of ksm in the kernel right now is that we 
write-protect + try-deduplicate only when we are fairly sure that we can 
deduplicate (unstable tree), and that the interaction with THPs / large 
folios is fairly well thought-through.

Also note that, just because data hasn't been written in some time 
interval, doesn't mean that it should be deduplicated and result in CoW 
on next write access.

One probably would have to mimic what the KSM implementation in the 
kernel does, and built something like the unstable tree, to find 
candidates where we can actually deduplciate. Then, have a way to 
not-deduplicate if the content changed.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-03 20:49                 ` David Hildenbrand
@ 2025-03-05 14:06                   ` Mathieu Desnoyers
  2025-03-05 19:22                     ` David Hildenbrand
  0 siblings, 1 reply; 29+ messages in thread
From: Mathieu Desnoyers @ 2025-03-05 14:06 UTC (permalink / raw)
  To: David Hildenbrand, Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 2025-03-03 15:49, David Hildenbrand wrote:
> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>> On 2025-02-28 17:32, Peter Xu wrote:
>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>
>>>>> I don't know what's the best for KSM and how well this will work, 
>>>>> but we
>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>
>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>
>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>> page-aligned address which sits within a private file mapping
>>>> (e.g. executable data).
>>>
>>> Yes, so far sync traps only supports RAM-based file systems, or 
>>> anonymous.
>>> Generic private file mappings (that stores executables and libraries) 
>>> are
>>> not yet supported.
>>>
>>>>
>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>> set.
>>>
>>> AFAICT that's expected, unshare should only be set on reads, never 
>>> writes.
>>> So uffd-wp shouldn't trap any of those.
>>>
>>>>
>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>> caused by stores to private file mappings. Am I missing something ?
>>>
>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should 
>>> work on
>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>> might be doable to try making it sync too without a lot of changes 
>>> based on
>>> how async tracking works.
>>
>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>> thread for the VM use-case by a purely user-space driven scanning:
>>
>> Within qemu or similar user-space process:
>>
>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC 
>> feature and
>>      UFFDIO_REGISTER_MODE_WP mode.
>>
>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl 
>> PM_SCAN_WP_MATCHING flag
>>      to detect memory which stays invariant for a long time.
>>
>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which 
>> pages are written to.
>>      Keep track of memory which is frequently modified, so it can be 
>> left alone and
>>      not write-protected nor merged anymore.
>>
>> 4) Whenever pages stay invariant for a given lapse of time, merge them 
>> with the new
>>      madvise(2) KSM_MERGE behavior.
>>
>> Let me know if that makes sense.
> 
> Note that one of the strengths of ksm in the kernel right now is that we 
> write-protect + try-deduplicate only when we are fairly sure that we can 
> deduplicate (unstable tree), and that the interaction with THPs / large 
> folios is fairly well thought-through.
> 
> Also note that, just because data hasn't been written in some time 
> interval, doesn't mean that it should be deduplicated and result in CoW 
> on next write access.

Right. This tracking of address range access pattern would have to be
implemented in user-space.

> One probably would have to mimic what the KSM implementation in the 
> kernel does, and built something like the unstable tree, to find 
> candidates where we can actually deduplciate. Then, have a way to not- 
> deduplicate if the content changed.

With madvise MADV_MERGE, there is no need to "unmerge". The merge
write-protects the page and merges its content at the time of the
MADV_MERGE with exact duplicates, and keeps that write protected page in
a global hash table indexed by checksum.

However, unlike KSM, it won't track that range on an ongoing basis.

"Unmerging" the page is done naturally by writing to the merged address
range. Because it is write-protected, this will trigger COW, and will 
therefore provide a new anonymous page to the process, thus "unmerging"
that page.

It's really just up to userspace to track COW faults and figure out
that it really should not try to merge that range anymore, based on the
the access pattern monitored through write-protection faults.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging
  2025-03-05 14:06                   ` Mathieu Desnoyers
@ 2025-03-05 19:22                     ` David Hildenbrand
  0 siblings, 0 replies; 29+ messages in thread
From: David Hildenbrand @ 2025-03-05 19:22 UTC (permalink / raw)
  To: Mathieu Desnoyers, Peter Xu
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Matthew Wilcox,
	Olivier Dion, linux-mm

On 05.03.25 15:06, Mathieu Desnoyers wrote:
> On 2025-03-03 15:49, David Hildenbrand wrote:
>> On 03.03.25 21:01, Mathieu Desnoyers wrote:
>>> On 2025-02-28 17:32, Peter Xu wrote:
>>>> On Fri, Feb 28, 2025 at 12:53:02PM -0500, Mathieu Desnoyers wrote:
>>>>> On 2025-02-28 11:32, Peter Xu wrote:
>>>>>> On Fri, Feb 28, 2025 at 09:59:00AM -0500, Mathieu Desnoyers wrote:
>>>>>>> For the VM use-case, I wonder if we could just add a userfaultfd
>>>>>>> "COW" event that would notify userspace when a COW happens ?
>>>>>>
>>>>>> I don't know what's the best for KSM and how well this will work,
>>>>>> but we
>>>>>> have such event for years..  See UFFDIO_REGISTER_MODE_WP:
>>>>>>
>>>>>> https://man7.org/linux/man-pages/man2/userfaultfd.2.html
>>>>>
>>>>> userfaultfd UFFDIO_REGISTER only seems to work if I pass an address
>>>>> resulting from a mmap mapping, but returns EINVAL if I pass a
>>>>> page-aligned address which sits within a private file mapping
>>>>> (e.g. executable data).
>>>>
>>>> Yes, so far sync traps only supports RAM-based file systems, or
>>>> anonymous.
>>>> Generic private file mappings (that stores executables and libraries)
>>>> are
>>>> not yet supported.
>>>>
>>>>>
>>>>> Also, I notice that do_wp_page() only calls handle_userfault
>>>>> VM_UFFD_WP when vm_fault flags does not have FAULT_FLAG_UNSHARE
>>>>> set.
>>>>
>>>> AFAICT that's expected, unshare should only be set on reads, never
>>>> writes.
>>>> So uffd-wp shouldn't trap any of those.
>>>>
>>>>>
>>>>> AFAIU, as it stands now userfaultfd would not help tracking COW faults
>>>>> caused by stores to private file mappings. Am I missing something ?
>>>>
>>>> I think you're right.  So we have UFFD_FEATURE_WP_ASYNC that should
>>>> work on
>>>> most mappings.  That one is async, though, so more like soft-dirty.  It
>>>> might be doable to try making it sync too without a lot of changes
>>>> based on
>>>> how async tracking works.
>>>
>>> I'm looking more closely at admin-guide/mm/pagemap.rst and it appears to
>>> be a good fit. Here is what I have in mind to replace the ksmd scanning
>>> thread for the VM use-case by a purely user-space driven scanning:
>>>
>>> Within qemu or similar user-space process:
>>>
>>> 1) Track guest memory with the userfaultfd UFFD_FEATURE_WP_ASYNC
>>> feature and
>>>       UFFDIO_REGISTER_MODE_WP mode.
>>>
>>> 2) Protect user-space memory with the PAGEMAP_SCAN ioctl
>>> PM_SCAN_WP_MATCHING flag
>>>       to detect memory which stays invariant for a long time.
>>>
>>> 3) Use the PAGEMAP_SCAN ioctl with PAGE_IS_WRITTEN to detect which
>>> pages are written to.
>>>       Keep track of memory which is frequently modified, so it can be
>>> left alone and
>>>       not write-protected nor merged anymore.
>>>
>>> 4) Whenever pages stay invariant for a given lapse of time, merge them
>>> with the new
>>>       madvise(2) KSM_MERGE behavior.
>>>
>>> Let me know if that makes sense.
>>
>> Note that one of the strengths of ksm in the kernel right now is that we
>> write-protect + try-deduplicate only when we are fairly sure that we can
>> deduplicate (unstable tree), and that the interaction with THPs / large
>> folios is fairly well thought-through.
>>
>> Also note that, just because data hasn't been written in some time
>> interval, doesn't mean that it should be deduplicated and result in CoW
>> on next write access.
> 
> Right. This tracking of address range access pattern would have to be
> implemented in user-space.
> 
>> One probably would have to mimic what the KSM implementation in the
>> kernel does, and built something like the unstable tree, to find
>> candidates where we can actually deduplciate. Then, have a way to not-
>> deduplicate if the content changed.
> 
> With madvise MADV_MERGE, there is no need to "unmerge". The merge
> write-protects the page and merges its content at the time of the
> MADV_MERGE with exact duplicates, and keeps that write protected page in
> a global hash table indexed by checksum.

Right, and that's a real problem.

> 
> However, unlike KSM, it won't track that range on an ongoing basis.
> 
> "Unmerging" the page is done naturally by writing to the merged address
> range. Because it is write-protected, this will trigger COW, and will
> therefore provide a new anonymous page to the process, thus "unmerging"
> that page.
> 
> It's really just up to userspace to track COW faults and figure out
> that it really should not try to merge that range anymore, based on the
> the access pattern monitored through write-protection faults.
> 

Just to be clear, what you described here is very likely not 
performance-wise any feasible replacement for the in-tree ksm for the VM 
use case (again, the thing that was primarily invented for VMs).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-03-05 19:22 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-28  2:30 [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Mathieu Desnoyers
2025-02-28  2:30 ` [RFC PATCH 1/2] mm: Introduce " Mathieu Desnoyers
2025-02-28  2:30 ` [RFC PATCH 2/2] selftests/kskm: Introduce SKSM basic test Mathieu Desnoyers
2025-02-28  2:51 ` [RFC PATCH 0/2] SKSM: Synchronous Kernel Samepage Merging Linus Torvalds
2025-02-28  3:03   ` Mathieu Desnoyers
2025-02-28  5:17     ` Linus Torvalds
2025-02-28 13:59       ` David Hildenbrand
2025-02-28 14:59         ` Sean Christopherson
2025-02-28 15:10           ` David Hildenbrand
2025-02-28 15:19             ` David Hildenbrand
2025-02-28 21:38             ` Mathieu Desnoyers
2025-02-28 21:45               ` David Hildenbrand
2025-02-28 21:49                 ` Mathieu Desnoyers
2025-02-28 15:01         ` Mathieu Desnoyers
2025-02-28 15:18           ` David Hildenbrand
2025-02-28 14:59       ` Mathieu Desnoyers
2025-02-28 16:32         ` Peter Xu
2025-02-28 17:53           ` Mathieu Desnoyers
2025-02-28 22:32             ` Peter Xu
2025-03-01 15:44               ` Mathieu Desnoyers
2025-03-03 15:01                 ` Peter Xu
2025-03-03 16:36                   ` David Hildenbrand
2025-03-03 20:01               ` Mathieu Desnoyers
2025-03-03 20:45                 ` Peter Xu
2025-03-03 20:49                 ` David Hildenbrand
2025-03-05 14:06                   ` Mathieu Desnoyers
2025-03-05 19:22                     ` David Hildenbrand
2025-02-28 15:34   ` David Hildenbrand
2025-02-28 15:38     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).