linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v2 0/4]  Add support for sharing page tables across processes (Previously mshare)
@ 2023-04-26 16:49 Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 1/4] mm/ptshare: Add vm flag for shared PTE Khalid Aziz
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Khalid Aziz @ 2023-04-26 16:49 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, mike.kravetz
  Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes some of the
memory and as long as number of mappings being maintained is small
enough, this space consumed by page tables is not objectionable.
When very few memory pages are shared between processes, the number
of page table entries (PTEs) to maintain is mostly constrained by
the number of pages of memory on the system.  As the number of
shared pages and the number of times pages are shared goes up,
amount of memory consumed by page tables starts to become
significant. This issue does not apply to threads. Any number of
threads can share the same pages inside a process while sharing the
same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.

When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
is created to hold the shared page tables. Host mm struct is not
attached to a process. Start and size of host mm are set to the
start and size of the mmap region and a VMA covering this range is
also added to host mm struct. Existing page table entries from the
process that creates the mapping are copied over to the host mm
struct. All processes mapping this shared region are considered
guest processes. When a guest process mmap's the shared region, a vm
flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
flag is found, its corresponding PMD is updated with the PMD from
host mm struct so the PMD will point to the page tables in host mm
struct. vm_mm pointer of the VMA is also updated to point to host mm
struct for the duration of fault handling to ensure fault handling
happens in the context of host mm struct. When a new PTE is
created, it is created in the host mm struct page tables and the PMD
in guest mm points to the same PTEs.

This is a basic working implementation. It will need to go through
more testing and refinements. Some notes and questions:

- PMD size alignment and size requirement is currently hard coded
  in. Is there a need or desire to make this more flexible and work
  with other alignments/sizes? PMD size allows for adapting this
  infrastructure to form the basis for hugetlbfs page table sharing
  as well. More work will be needed to make that happen.

- Is there a reason to allow a userspace app to query this size and
  alignment requirement for MAP_SHARED_PT in some way?

- Shared PTEs means mprotect() call made by one process affects all
  processes sharing the same mapping and that behavior will need to
  be documented clearly. Effect of mprotect call being different for
  processes using shared page tables is the primary reason to
  require an explicit opt-in from userspace processes to share page
  tables. With a transparent sharing derived from MAP_SHARED alone,
  changed effect of mprotect can break significant number of
  userspace apps. One could work around that by unsharing whenever
  mprotect changes modes on shared mapping but that introduces
  complexity and the capability to execute a single mprotect to
  change modes across 1000's of processes sharing a mapped database
  is a feature explicitly asked for by database folks. This
  capability has significant performance advantage when compared to
  mechanism of sending messages to every process using shared
  mapping to call mprotect and change modes in each process, or
  using traps on permissions mismatch in each process.

- This implementation does not allow unmapping page table shared
  mappings partially. Should that be supported in future?

Some concerns in this RFC:

- When page tables for a process are freed upon process exit,
  pmd_free_tlb() gets called at one point to free all PMDs allocated
  by the process. For a shared page table, shared PMDs can not be
  released when a guest process exits. These shared PMDs are
  released when host mm struct is released upon end of last
  reference to page table shared region hosted by this mm. For now
  to stop PMDs being released, this RFC introduces following change
  in mm/memory.c which works but does not feel like the right
  approach. Any suggestions for a better long term approach will be
  very appreciated:

@@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
pud_t *pud,

        pmd = pmd_offset(pud, start);
        pud_clear(pud);
-       pmd_free_tlb(tlb, pmd, start);
-       mm_dec_nr_pmds(tlb->mm);
+       if (shared_pte) {
+               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
+               tlb->freed_tables = 1;
+       } else {
+               pmd_free_tlb(tlb, pmd, start);
+               mm_dec_nr_pmds(tlb->mm);
+       }
 }

 static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,

- This implementation requires an additional VM flag. Since all lower
  32 bits are currently in use, the new VM flag must come from upper
  32 bits which restricts this feature to 64-bit processors.

- This feature is implemented for file mappings only. Is there a
  need to support it for anonymous memory as well?

- Accounting for MAP_SHARED_PT mapped filepages in a process and
  pagetable bytes is not quite accurate yet in this RFC and will be
  fixed in the non-RFC version of patches.

I appreciate any feedback on these patches and ideas for
improvements before moving these patches out of RFC stage.


Changes from RFC v1:
- Broken the patches up into smaller patches
- Fixed a few bugs related to freeing PTEs and PMDs incorrectly
- Cleaned up the code a bit


Khalid Aziz (4):
  mm/ptshare: Add vm flag for shared PTE
  mm/ptshare: Add flag MAP_SHARED_PT to mmap()
  mm/ptshare: Create new mm struct for page table sharing
  mm/ptshare: Add page fault handling for page table shared regions

 include/linux/fs.h                     |   2 +
 include/linux/mm.h                     |   8 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |  21 ++
 mm/memory.c                            | 105 ++++++++--
 mm/mmap.c                              |  88 +++++++++
 mm/ptshare.c                           | 263 +++++++++++++++++++++++++
 9 files changed, 476 insertions(+), 17 deletions(-)
 create mode 100644 mm/ptshare.c

-- 
2.37.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 1/4] mm/ptshare: Add vm flag for shared PTE
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
@ 2023-04-26 16:49 ` Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap() Khalid Aziz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Khalid Aziz @ 2023-04-26 16:49 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, mike.kravetz
  Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 3 ++-
 mm/internal.h                  | 5 +++++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..539becab551a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -326,11 +326,13 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -372,6 +374,12 @@ extern unsigned int kobjsize(const void *objp);
 # define VM_MTE_ALLOWED	VM_NONE
 #endif
 
+#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS
+#define VM_SHARED_PT	VM_HIGH_ARCH_5
+#else
+#define VM_SHARED_PT	VM_NONE
+#endif
+
 #ifndef VM_GROWSUP
 # define VM_GROWSUP	VM_NONE
 #endif
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 9db52bc4ce19..26f751e76c46 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
-	{VM_MERGEABLE,			"mergeable"	}		\
+	{VM_MERGEABLE,			"mergeable"	},		\
+	{VM_SHARED_PT,			"sharedpt"	}		\
 
 #define show_vma_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",				\
diff --git a/mm/internal.h b/mm/internal.h
index 7920a8b7982e..4d60d2d5fe19 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1042,4 +1042,9 @@ struct vma_prepare {
 	struct vm_area_struct *remove;
 	struct vm_area_struct *remove2;
 };
+
+static inline bool vma_is_shared(const struct vm_area_struct *vma)
+{
+	return vma->vm_flags & VM_SHARED_PT;
+}
 #endif	/* __MM_INTERNAL_H */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap()
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 1/4] mm/ptshare: Add vm flag for shared PTE Khalid Aziz
@ 2023-04-26 16:49 ` Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing Khalid Aziz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Khalid Aziz @ 2023-04-26 16:49 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, mike.kravetz
  Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

When a process mmaps a file with MAP_SHARED flag, it is possible
that any other processes mmaping the same file with MAP_SHARED
flag with same permissions could share the page table entries as
well instead of creating duplicate entries. This patch introduces a
new flag MAP_SHARED_PT for mmap() which a process can use to hint
that it can share page tables with other processes using the same
mapping.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
---
 include/uapi/asm-generic/mman-common.h |  1 +
 mm/mmap.c                              | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6ce1f1ceb432..4d23456b5915 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -29,6 +29,7 @@
 #define MAP_HUGETLB		0x040000	/* create a huge page mapping */
 #define MAP_SYNC		0x080000 /* perform synchronous page faults for the mapping */
 #define MAP_FIXED_NOREPLACE	0x100000	/* MAP_FIXED which doesn't unmap underlying mapping */
+#define MAP_SHARED_PT		0x200000	/* Shared page table mappings */
 
 #define MAP_UNINITIALIZED 0x4000000	/* For anonymous mmap, memory could be
 					 * uninitialized */
diff --git a/mm/mmap.c b/mm/mmap.c
index d5475fbf5729..8b46d465f8d4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1197,6 +1197,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	vm_flags_t vm_flags;
 	int pkey = 0;
+	int ptshare = 0;
 
 	validate_mm(mm);
 	*populate = 0;
@@ -1234,6 +1235,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	if (mm->map_count > sysctl_max_map_count)
 		return -ENOMEM;
 
+	/*
+	 * If MAP_SHARED_PT is set, MAP_SHARED or MAP_SHARED_VALIDATE must
+	 * be set as well
+	 */
+	if (flags & MAP_SHARED_PT) {
+#if VM_SHARED_PT
+		if (flags & (MAP_SHARED | MAP_SHARED_VALIDATE))
+			ptshare = 1;
+		else
+			return -EINVAL;
+#else
+		return -EINVAL;
+#endif
+	}
+
 	/* Obtain the address to map to. we verify (or select) it and ensure
 	 * that it represents a valid section of the address space.
 	 */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 1/4] mm/ptshare: Add vm flag for shared PTE Khalid Aziz
  2023-04-26 16:49 ` [PATCH RFC v2 2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap() Khalid Aziz
@ 2023-04-26 16:49 ` Khalid Aziz
  2023-06-26  8:08   ` Karim Manaouil
  2023-04-26 16:49 ` [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions Khalid Aziz
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Khalid Aziz @ 2023-04-26 16:49 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, mike.kravetz
  Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

When a process passes MAP_SHARED_PT flag to mmap(), create a new mm
struct to hold the shareable page table entries for the newly mapped
region.  This new mm is not associated with a task.  Its lifetime is
until the last shared mapping is deleted.  This patch also adds a
new pointer "ptshare_data" to struct address_space which points to
the data structure that will contain pointer to this newly created
mm along with properties of the shared mapping. ptshare_data
maintains a refcount for the shared mapping so that it can be
cleaned up upon last unmap.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/fs.h |   2 +
 mm/Makefile        |   2 +-
 mm/internal.h      |  14 +++++
 mm/mmap.c          |  72 ++++++++++++++++++++++++++
 mm/ptshare.c       | 126 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 215 insertions(+), 1 deletion(-)
 create mode 100644 mm/ptshare.c

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..db8d3257c712 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -422,6 +422,7 @@ extern const struct address_space_operations empty_aops;
  * @private_lock: For use by the owner of the address_space.
  * @private_list: For use by the owner of the address_space.
  * @private_data: For use by the owner of the address_space.
+ * @ptshare_data: For shared page table use
  */
 struct address_space {
 	struct inode		*host;
@@ -443,6 +444,7 @@ struct address_space {
 	spinlock_t		private_lock;
 	struct list_head	private_list;
 	void			*private_data;
+	void			*ptshare_data;
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..d9bb14fdf220 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -40,7 +40,7 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o
+			   pgtable-generic.o rmap.o vmalloc.o ptshare.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
diff --git a/mm/internal.h b/mm/internal.h
index 4d60d2d5fe19..3efb8738e26f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1047,4 +1047,18 @@ static inline bool vma_is_shared(const struct vm_area_struct *vma)
 {
 	return vma->vm_flags & VM_SHARED_PT;
 }
+
+/*
+ * mm/ptshare.c
+ */
+struct ptshare_data {
+	struct mm_struct *mm;
+	refcount_t refcnt;
+	unsigned long start;
+	unsigned long size;
+	unsigned long mode;
+};
+int ptshare_new_mm(struct file *file, struct vm_area_struct *vma);
+void ptshare_del_mm(struct vm_area_struct *vm);
+int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 8b46d465f8d4..c5e9b7f6de90 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1382,6 +1382,60 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
 		*populate = len;
+
+#if VM_SHARED_PT
+	/*
+	 * Check if this mapping is a candidate for page table sharing
+	 * at PMD level. It is if following conditions hold:
+	 *	- It is not anonymous mapping
+	 *	- It is not hugetlbfs mapping (for now)
+	 *	- flags conatins MAP_SHARED or MAP_SHARED_VALIDATE and
+	 *	  MAP_SHARED_PT
+	 *	- Start address is aligned to PMD size
+	 *	- Mapping size is a multiple of PMD size
+	 */
+	if (ptshare && file && !is_file_hugepages(file)) {
+		struct vm_area_struct *vma;
+
+		vma = find_vma(mm, addr);
+		if (!((vma->vm_start | vma->vm_end) & (PMD_SIZE - 1))) {
+			struct ptshare_data *info = file->f_mapping->ptshare_data;
+			/*
+			 * If this mapping has not been set up for page table
+			 * sharing yet, do so by creating a new mm to hold the
+			 * shared page tables for this mapping
+			 */
+			if (info == NULL) {
+				int ret;
+
+				ret = ptshare_new_mm(file, vma);
+				if (ret < 0)
+					return ret;
+
+				info = file->f_mapping->ptshare_data;
+				ret = ptshare_insert_vma(info->mm, vma);
+				if (ret < 0)
+					addr = ret;
+				else
+					vm_flags_set(vma, VM_SHARED_PT);
+			} else {
+				/*
+				 * Page tables will be shared only if the
+				 * file is mapped in with the same permissions
+				 * across all mappers with same starting
+				 * address and size
+				 */
+				if (((prot & info->mode) == info->mode) &&
+					(addr == info->start) &&
+					(len == info->size)) {
+					vm_flags_set(vma, VM_SHARED_PT);
+					refcount_inc(&info->refcnt);
+				}
+			}
+		}
+	}
+#endif
+
 	return addr;
 }
 
@@ -2495,6 +2549,22 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (end == start)
 		return -EINVAL;
 
+	/*
+	 * Check if this vma uses shared page tables
+	 */
+	vma = find_vma_intersection(mm, start, end);
+	if (vma && unlikely(vma_is_shared(vma))) {
+		struct ptshare_data *info = NULL;
+
+		if (vma->vm_file && vma->vm_file->f_mapping)
+			info = vma->vm_file->f_mapping->ptshare_data;
+		/* Don't allow partial munmaps */
+		if (info && ((start != info->start) || (len != info->size)))
+			return -EINVAL;
+		ptshare_del_mm(vma);
+	}
+
+
 	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
@@ -2664,6 +2734,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			}
 		}
 
+		if (vm_flags & VM_SHARED_PT)
+			vm_flags_set(vma, VM_SHARED_PT);
 		vm_flags = vma->vm_flags;
 	} else if (vm_flags & VM_SHARED) {
 		error = shmem_zero_setup(vma);
diff --git a/mm/ptshare.c b/mm/ptshare.c
new file mode 100644
index 000000000000..f6784268958c
--- /dev/null
+++ b/mm/ptshare.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Share page table entries when possible to reduce the amount of extra
+ * memory consumed by page tables
+ *
+ * Copyright (C) 2022 Oracle Corp. All rights reserved.
+ * Authors:	Khalid Aziz <khalid.aziz@oracle.com>
+ *		Matthew Wilcox <willy@infradead.org>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+
+/*
+ * Create a new mm struct that will hold the shared PTEs. Pointer to
+ * this new mm is stored in the data structure ptshare_data which also
+ * includes a refcount for any current references to PTEs in this new
+ * mm. This refcount is used to determine when the mm struct for shared
+ * PTEs can be deleted.
+ */
+int
+ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
+{
+	struct mm_struct *new_mm;
+	struct ptshare_data *info = NULL;
+	int retval = 0;
+	unsigned long start = vma->vm_start;
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	new_mm = mm_alloc();
+	if (!new_mm) {
+		retval = -ENOMEM;
+		goto err_free;
+	}
+	new_mm->mmap_base = start;
+	new_mm->task_size = len;
+	if (!new_mm->task_size)
+		new_mm->task_size--;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		retval = -ENOMEM;
+		goto err_free;
+	}
+	info->mm = new_mm;
+	info->start = start;
+	info->size = len;
+	refcount_set(&info->refcnt, 1);
+	file->f_mapping->ptshare_data = info;
+
+	return retval;
+
+err_free:
+	if (new_mm)
+		mmput(new_mm);
+	kfree(info);
+	return retval;
+}
+
+/*
+ * insert vma into mm holding shared page tables
+ */
+int
+ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+	struct vm_area_struct *new_vma;
+	int err = 0;
+
+	new_vma = vm_area_dup(vma);
+	if (!new_vma)
+		return -ENOMEM;
+
+	new_vma->vm_file = NULL;
+	/*
+	 * This new vma belongs to host mm, so clear the VM_SHARED_PT
+	 * flag on this so we know this is the host vma when we clean
+	 * up page tables. Do not use THP for page table shared regions
+	 */
+	vm_flags_clear(new_vma, (VM_SHARED | VM_SHARED_PT));
+	vm_flags_set(new_vma, VM_NOHUGEPAGE);
+	new_vma->vm_mm = mm;
+
+	err = insert_vm_struct(mm, new_vma);
+	if (err)
+		return -ENOMEM;
+
+	return err;
+}
+
+/*
+ * Free the mm struct created to hold shared PTEs and associated data
+ * structures
+ */
+static inline void
+free_ptshare_mm(struct ptshare_data *info)
+{
+	mmput(info->mm);
+	kfree(info);
+}
+
+/*
+ * This function is called when a reference to the shared PTEs in mm
+ * struct is dropped. It updates refcount and checks to see if last
+ * reference to the mm struct holding shared PTEs has been dropped. If
+ * so, it cleans up the mm struct and associated data structures
+ */
+void
+ptshare_del_mm(struct vm_area_struct *vma)
+{
+	struct ptshare_data *info;
+	struct file *file = vma->vm_file;
+
+	if (!file || (!file->f_mapping))
+		return;
+	info = file->f_mapping->ptshare_data;
+	WARN_ON(!info);
+	if (!info)
+		return;
+
+	if (refcount_dec_and_test(&info->refcnt)) {
+		free_ptshare_mm(info);
+		file->f_mapping->ptshare_data = NULL;
+	}
+}
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
                   ` (2 preceding siblings ...)
  2023-04-26 16:49 ` [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing Khalid Aziz
@ 2023-04-26 16:49 ` Khalid Aziz
  2023-04-26 21:27 ` [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Mike Kravetz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Khalid Aziz @ 2023-04-26 16:49 UTC (permalink / raw)
  To: akpm, willy, markhemm, viro, david, mike.kravetz
  Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, brauner, arnd,
	ebiederm, catalin.marinas, linux-arch, linux-kernel, linux-mm,
	mhiramat, rostedt, vasily.averin, xhao, pcc, neilb, maz

Add support for creating a new set of shared page tables in a new
mm_struct upon mmap of an region that can potentially share page
tables. Add page fault handling for this now shared region. Modify
free_pgtables path to make sure page tables in the shared regions
are kept intact when a process using page table region exits and
there are other mappers for the shared region. Clean up mm_struct
holding shared page tables when the last process sharing the region
exits.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/internal.h |   2 +
 mm/memory.c   | 105 ++++++++++++++++++++++++++++++------
 mm/ptshare.c  | 143 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 232 insertions(+), 18 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3efb8738e26f..924065f721fe 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1061,4 +1061,6 @@ struct ptshare_data {
 int ptshare_new_mm(struct file *file, struct vm_area_struct *vma);
 void ptshare_del_mm(struct vm_area_struct *vm);
 int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma);
+extern vm_fault_t find_shared_vma(struct vm_area_struct **vmap,
+				unsigned long *addrp, unsigned int flags);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory.c b/mm/memory.c
index 01a23ad48a04..c67318ffd001 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -172,17 +172,28 @@ void mm_trace_rss_stat(struct mm_struct *mm, int member)
  * has been handled earlier when unmapping all the memory regions.
  */
 static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
-			   unsigned long addr)
+			   unsigned long addr, bool shared_pte)
 {
 	pgtable_t token = pmd_pgtable(*pmd);
 	pmd_clear(pmd);
+	/*
+	 * if this address range shares page tables with other processes,
+	 * do not release pte pages. Those pages will be released when
+	 * host mm that hosts these pte pages is released
+	 */
+	if (shared_pte) {
+		tlb_flush_pmd_range(tlb, addr, PAGE_SIZE);
+		tlb->freed_tables = 1;
+		return;
+	}
 	pte_free_tlb(tlb, token, addr);
 	mm_dec_nr_ptes(tlb->mm);
 }
 
 static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				unsigned long addr, unsigned long end,
-				unsigned long floor, unsigned long ceiling)
+				unsigned long floor, unsigned long ceiling,
+				bool shared_pte)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -194,7 +205,7 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		free_pte_range(tlb, pmd, addr);
+		free_pte_range(tlb, pmd, addr, shared_pte);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 
 	pmd = pmd_offset(pud, start);
 	pud_clear(pud);
-	pmd_free_tlb(tlb, pmd, start);
-	mm_dec_nr_pmds(tlb->mm);
+	if (shared_pte) {
+		tlb_flush_pud_range(tlb, start, PAGE_SIZE);
+		tlb->freed_tables = 1;
+	} else {
+		pmd_free_tlb(tlb, pmd, start);
+		mm_dec_nr_pmds(tlb->mm);
+	}
 }
 
 static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 				unsigned long addr, unsigned long end,
-				unsigned long floor, unsigned long ceiling)
+				unsigned long floor, unsigned long ceiling,
+				bool shared_pte)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -228,7 +245,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 		next = pud_addr_end(addr, end);
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		free_pmd_range(tlb, pud, addr, next, floor, ceiling);
+		free_pmd_range(tlb, pud, addr, next, floor, ceiling,
+				shared_pte);
 	} while (pud++, addr = next, addr != end);
 
 	start &= P4D_MASK;
@@ -250,7 +268,8 @@ static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
 
 static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 				unsigned long addr, unsigned long end,
-				unsigned long floor, unsigned long ceiling)
+				unsigned long floor, unsigned long ceiling,
+				bool shared_pte)
 {
 	p4d_t *p4d;
 	unsigned long next;
@@ -262,7 +281,8 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		free_pud_range(tlb, p4d, addr, next, floor, ceiling);
+		free_pud_range(tlb, p4d, addr, next, floor, ceiling,
+				shared_pte);
 	} while (p4d++, addr = next, addr != end);
 
 	start &= PGDIR_MASK;
@@ -284,9 +304,10 @@ static inline void free_p4d_range(struct mmu_gather *tlb, pgd_t *pgd,
 /*
  * This function frees user-level page tables of a process.
  */
-void free_pgd_range(struct mmu_gather *tlb,
+static void _free_pgd_range(struct mmu_gather *tlb,
 			unsigned long addr, unsigned long end,
-			unsigned long floor, unsigned long ceiling)
+			unsigned long floor, unsigned long ceiling,
+			bool shared_pte)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -342,10 +363,18 @@ void free_pgd_range(struct mmu_gather *tlb,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		free_p4d_range(tlb, pgd, addr, next, floor, ceiling);
+		free_p4d_range(tlb, pgd, addr, next, floor, ceiling,
+				shared_pte);
 	} while (pgd++, addr = next, addr != end);
 }
 
+void free_pgd_range(struct mmu_gather *tlb,
+			unsigned long addr, unsigned long end,
+			unsigned long floor, unsigned long ceiling)
+{
+	_free_pgd_range(tlb, addr, end, floor, ceiling, false);
+}
+
 void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		   struct vm_area_struct *vma, unsigned long floor,
 		   unsigned long ceiling)
@@ -375,16 +404,20 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
 		} else {
 			/*
 			 * Optimization: gather nearby vmas into one call down
+			 * but make sure vmas not sharing page tables do
+			 * not get combined with vmas sharing page tables
 			 */
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
-			       && !is_vm_hugetlb_page(next)) {
+			       && !is_vm_hugetlb_page(next)
+			       && (vma_is_shared(next) == vma_is_shared(vma))) {
 				vma = next;
 				next = mas_find(&mas, ceiling - 1);
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
-			free_pgd_range(tlb, addr, vma->vm_end,
-				floor, next ? next->vm_start : ceiling);
+			_free_pgd_range(tlb, addr, vma->vm_end,
+				floor, next ? next->vm_start : ceiling,
+				vma_is_shared(vma));
 		}
 		vma = next;
 	} while (vma);
@@ -5181,6 +5214,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			   unsigned int flags, struct pt_regs *regs)
 {
 	vm_fault_t ret;
+	bool shared = false;
+	struct mm_struct *orig_mm;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -5191,6 +5226,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 	if (ret)
 		return ret;
 
+	orig_mm = vma->vm_mm;
+	if (unlikely(vma_is_shared(vma))) {
+		ret = find_shared_vma(&vma, &address, flags);
+		if (ret)
+			return ret;
+		if (!vma)
+			return VM_FAULT_SIGSEGV;
+		shared = true;
+	}
+
 	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
 					    flags & FAULT_FLAG_INSTRUCTION,
 					    flags & FAULT_FLAG_REMOTE))
@@ -5212,6 +5257,36 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_exit_fault();
 
+	/*
+	 * Release the read lock on shared VMA's parent mm unless
+	 * __handle_mm_fault released the lock already.
+	 * __handle_mm_fault sets VM_FAULT_RETRY in return value if
+	 * it released mmap lock. If lock was released, that implies
+	 * the lock would have been released on task's original mm if
+	 * this were not a shared PTE vma. To keep lock state consistent,
+	 * make sure to release the lock on task's original mm
+	 */
+	if (shared) {
+		int release_mmlock = 1;
+
+		if (!(ret & VM_FAULT_RETRY)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		} else if ((flags & FAULT_FLAG_ALLOW_RETRY) &&
+			(flags & FAULT_FLAG_RETRY_NOWAIT)) {
+			mmap_read_unlock(vma->vm_mm);
+			release_mmlock = 0;
+		}
+
+		/*
+		 * Reset guest vma pointers that were set up in
+		 * find_shared_vma() to process this fault.
+		 */
+		vma->vm_mm = orig_mm;
+		if (release_mmlock)
+			mmap_read_unlock(orig_mm);
+	}
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/ptshare.c b/mm/ptshare.c
index f6784268958c..e0991a877355 100644
--- a/mm/ptshare.c
+++ b/mm/ptshare.c
@@ -13,6 +13,136 @@
 #include <asm/pgalloc.h>
 #include "internal.h"
 
+/*
+ */
+static pmd_t
+*get_pmd(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset(mm, addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d)) {
+		p4d = p4d_alloc(mm, pgd, addr);
+		if (!p4d)
+			return NULL;
+	}
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud)) {
+		pud = pud_alloc(mm, p4d, addr);
+		if (!pud)
+			return NULL;
+	}
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd)) {
+		pmd = pmd_alloc(mm, pud, addr);
+		if (!pmd)
+			return NULL;
+	}
+
+	return pmd;
+}
+
+/*
+ * Find the shared pmd entries in host mm struct and install them into
+ * guest page tables.
+ */
+static int
+ptshare_copy_pmd(struct mm_struct *host_mm, struct mm_struct *guest_mm,
+			struct vm_area_struct *vma, unsigned long addr)
+{
+	pgd_t *guest_pgd;
+	p4d_t *guest_p4d;
+	pud_t *guest_pud;
+	pmd_t *host_pmd;
+	spinlock_t *host_ptl, *guest_ptl;
+
+	guest_pgd = pgd_offset(guest_mm, addr);
+	guest_p4d = p4d_offset(guest_pgd, addr);
+	if (p4d_none(*guest_p4d)) {
+		guest_p4d = p4d_alloc(guest_mm, guest_pgd, addr);
+		if (!guest_p4d)
+			return 1;
+	}
+
+	guest_pud = pud_offset(guest_p4d, addr);
+	if (pud_none(*guest_pud)) {
+		host_pmd = get_pmd(host_mm, addr);
+		if (!host_pmd)
+			return 1;
+
+		get_page(virt_to_page(host_pmd));
+		host_ptl = pmd_lockptr(host_mm, host_pmd);
+		guest_ptl = pud_lockptr(guest_mm, guest_pud);
+		spin_lock(host_ptl);
+		spin_lock(guest_ptl);
+		pud_populate(guest_mm, guest_pud,
+			(pmd_t *)((unsigned long)host_pmd & PAGE_MASK));
+		put_page(virt_to_page(host_pmd));
+		spin_unlock(guest_ptl);
+		spin_unlock(host_ptl);
+	}
+
+	return 0;
+}
+
+/*
+ * Find the shared page tables in hosting mm struct and install those in
+ * the guest mm struct
+ */
+vm_fault_t
+find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp,
+			unsigned int flags)
+{
+	struct ptshare_data *info;
+	struct mm_struct *host_mm;
+	struct vm_area_struct *host_vma, *guest_vma = *vmap;
+	unsigned long host_addr;
+	pmd_t *guest_pmd, *host_pmd;
+
+	if ((!guest_vma->vm_file) || (!guest_vma->vm_file->f_mapping))
+		return 0;
+	info = guest_vma->vm_file->f_mapping->ptshare_data;
+	if (!info) {
+		pr_warn("VM_SHARED_PT vma with NULL ptshare_data");
+		dump_stack_print_info(KERN_WARNING);
+		return 0;
+	}
+	host_mm = info->mm;
+
+	mmap_read_lock(host_mm);
+	host_addr = *addrp - guest_vma->vm_start + host_mm->mmap_base;
+	host_pmd = get_pmd(host_mm, host_addr);
+	guest_pmd = get_pmd(guest_vma->vm_mm, *addrp);
+	if (!pmd_same(*guest_pmd, *host_pmd)) {
+		set_pmd(guest_pmd, *host_pmd);
+		mmap_read_unlock(host_mm);
+		return VM_FAULT_NOPAGE;
+	}
+
+	*addrp = host_addr;
+	host_vma = find_vma(host_mm, host_addr);
+	if (!host_vma)
+		return VM_FAULT_SIGSEGV;
+
+	/*
+	 * Point vm_mm for the faulting vma to the mm struct holding shared
+	 * page tables so the fault handling will happen in the right
+	 * shared context
+	 */
+	guest_vma->vm_mm = host_mm;
+
+	return 0;
+}
+
 /*
  * Create a new mm struct that will hold the shared PTEs. Pointer to
  * this new mm is stored in the data structure ptshare_data which also
@@ -38,6 +168,7 @@ ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
 	new_mm->task_size = len;
 	if (!new_mm->task_size)
 		new_mm->task_size--;
+	new_mm->owner = NULL;
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info) {
@@ -63,7 +194,7 @@ ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
  * insert vma into mm holding shared page tables
  */
 int
-ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
+ptshare_insert_vma(struct mm_struct *host_mm, struct vm_area_struct *vma)
 {
 	struct vm_area_struct *new_vma;
 	int err = 0;
@@ -80,12 +211,18 @@ ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 	 */
 	vm_flags_clear(new_vma, (VM_SHARED | VM_SHARED_PT));
 	vm_flags_set(new_vma, VM_NOHUGEPAGE);
-	new_vma->vm_mm = mm;
+	new_vma->vm_mm = host_mm;
 
-	err = insert_vm_struct(mm, new_vma);
+	err = insert_vm_struct(host_mm, new_vma);
 	if (err)
 		return -ENOMEM;
 
+	/*
+	 * Copy the PMD entries from host mm to guest so they use the
+	 * same PTEs
+	 */
+	err = ptshare_copy_pmd(host_mm, vma->vm_mm, vma, vma->vm_start);
+
 	return err;
 }
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4]  Add support for sharing page tables across processes (Previously mshare)
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
                   ` (3 preceding siblings ...)
  2023-04-26 16:49 ` [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions Khalid Aziz
@ 2023-04-26 21:27 ` Mike Kravetz
  2023-04-27 16:40   ` Khalid Aziz
  2023-06-12 16:25 ` Peter Xu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mike Kravetz @ 2023-04-26 21:27 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, markhemm, viro, david, andreyknvl, dave.hansen, luto,
	brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz

On 04/26/23 10:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.

When hugetlb pmd sharing was introduced the motivating factor was performance
as much as memory savings.  See commit,
39dde65c9940 [PATCH] shared page table for hugetlb page

I have not verified the claims in that commit message, but it does sound
reasonable.  My assumption is that the same would apply here.

> 
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.

I would think the same virtual address requirement is problematic in the
case of ASLR.  hugetlb pmd sharing does not have the 'same virtual
addresses' requirement.  My guess is that requiring the same virtual
address does not make code simpler.

> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct.

It seems there is one host mm struct per shared mapping.  Is that
correct?  And, the host mm is the size of that single mapping?
Suppose the following:
- process A maps shared file offset 0 to 2xPMD_SIZE
- process B maps shared file offset 0 to 2xPMD_SIZE
It then makes sense A and B would use the same host mm.  Now,
- process C maps shared file offset 0 to 4xPMD_SIZE
Does C share anything with A and B?  Are there multiple host mm structs?

Just wondering at a high level how this would work without looking at
the code.

>         All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
> 
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
> 
> - PMD size alignment and size requirement is currently hard coded
>   in. Is there a need or desire to make this more flexible and work
>   with other alignments/sizes? PMD size allows for adapting this
>   infrastructure to form the basis for hugetlbfs page table sharing
>   as well. More work will be needed to make that happen.
> 
> - Is there a reason to allow a userspace app to query this size and
>   alignment requirement for MAP_SHARED_PT in some way?
> 
> - Shared PTEs means mprotect() call made by one process affects all
>   processes sharing the same mapping and that behavior will need to
>   be documented clearly. Effect of mprotect call being different for
>   processes using shared page tables is the primary reason to
>   require an explicit opt-in from userspace processes to share page
>   tables. With a transparent sharing derived from MAP_SHARED alone,
>   changed effect of mprotect can break significant number of
>   userspace apps. One could work around that by unsharing whenever
>   mprotect changes modes on shared mapping but that introduces
>   complexity and the capability to execute a single mprotect to
>   change modes across 1000's of processes sharing a mapped database
>   is a feature explicitly asked for by database folks. This
>   capability has significant performance advantage when compared to
>   mechanism of sending messages to every process using shared
>   mapping to call mprotect and change modes in each process, or
>   using traps on permissions mismatch in each process.

I would guess this is more than just mprotect, and anything that impacts
page tables.  Correct?  For example MADV_DONTNEED, MADV_HUGEPAGE,
MADV_NOHUGEPAGE.

> - This implementation does not allow unmapping page table shared
>   mappings partially. Should that be supported in future?
> 
> Some concerns in this RFC:
> 
> - When page tables for a process are freed upon process exit,
>   pmd_free_tlb() gets called at one point to free all PMDs allocated
>   by the process. For a shared page table, shared PMDs can not be
>   released when a guest process exits. These shared PMDs are
>   released when host mm struct is released upon end of last
>   reference to page table shared region hosted by this mm. For now
>   to stop PMDs being released, this RFC introduces following change
>   in mm/memory.c which works but does not feel like the right
>   approach. Any suggestions for a better long term approach will be
>   very appreciated:
> 
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
> 
>         pmd = pmd_offset(pud, start);
>         pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>  }
> 
>  static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> 
> - This implementation requires an additional VM flag. Since all lower
>   32 bits are currently in use, the new VM flag must come from upper
>   32 bits which restricts this feature to 64-bit processors.
> 
> - This feature is implemented for file mappings only. Is there a
>   need to support it for anonymous memory as well?

I have not looked at the implementation, are 'file mappings' only
mappings using the page cache?  Or, do you just need a file descriptor?
Would a file descriptor created via memfd_create work for anonymous
memory?

-- 
Mike Kravetz

> 
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
>   pagetable bytes is not quite accurate yet in this RFC and will be
>   fixed in the non-RFC version of patches.
> 
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
> 
> 
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
> 
> 
> Khalid Aziz (4):
>   mm/ptshare: Add vm flag for shared PTE
>   mm/ptshare: Add flag MAP_SHARED_PT to mmap()
>   mm/ptshare: Create new mm struct for page table sharing
>   mm/ptshare: Add page fault handling for page table shared regions
> 
>  include/linux/fs.h                     |   2 +
>  include/linux/mm.h                     |   8 +
>  include/trace/events/mmflags.h         |   3 +-
>  include/uapi/asm-generic/mman-common.h |   1 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |  21 ++
>  mm/memory.c                            | 105 ++++++++--
>  mm/mmap.c                              |  88 +++++++++
>  mm/ptshare.c                           | 263 +++++++++++++++++++++++++
>  9 files changed, 476 insertions(+), 17 deletions(-)
>  create mode 100644 mm/ptshare.c
> 
> -- 
> 2.37.2
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-04-26 21:27 ` [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Mike Kravetz
@ 2023-04-27 16:40   ` Khalid Aziz
  0 siblings, 0 replies; 21+ messages in thread
From: Khalid Aziz @ 2023-04-27 16:40 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: akpm, willy, markhemm, viro, david, andreyknvl, dave.hansen, luto,
	brauner, arnd, ebiederm, catalin.marinas, linux-arch,
	linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao,
	pcc, neilb, maz

On 4/26/23 15:27, Mike Kravetz wrote:
> On 04/26/23 10:49, Khalid Aziz wrote:
>> Memory pages shared between processes require a page table entry
>> (PTE) for each process. Each of these PTE consumes some of the
>> memory and as long as number of mappings being maintained is small
>> enough, this space consumed by page tables is not objectionable.
>> When very few memory pages are shared between processes, the number
>> of page table entries (PTEs) to maintain is mostly constrained by
>> the number of pages of memory on the system.  As the number of
>> shared pages and the number of times pages are shared goes up,
>> amount of memory consumed by page tables starts to become
>> significant. This issue does not apply to threads. Any number of
>> threads can share the same pages inside a process while sharing the
>> same PTEs. Extending this same model to sharing pages across
>> processes can eliminate this issue for sharing across processes as
>> well.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
> 
> When hugetlb pmd sharing was introduced the motivating factor was performance
> as much as memory savings.  See commit,
> 39dde65c9940 [PATCH] shared page table for hugetlb page
> 
> I have not verified the claims in that commit message, but it does sound
> reasonable.  My assumption is that the same would apply here.

Hi Mike,

Thanks for the feedback. I looked up the commit log for 39dde65c9940. You are right, same does apply here.

> 
>>
>> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
>> This flag can be specified along with MAP_SHARED by a process to
>> hint to kernel that it wishes to share page table entries for this
>> file mapping mmap region with other processes. Any other process
>> that mmaps the same file with MAP_SHARED_PT flag can then share the
>> same page table entries. Besides specifying MAP_SHARED_PT flag, the
>> processes must map the files at a PMD aligned address with a size
>> that is a multiple of PMD size and at the same virtual addresses.
>> This last requirement of same virtual addresses can possibly be
>> relaxed if that is the consensus.
> 
> I would think the same virtual address requirement is problematic in the
> case of ASLR.  hugetlb pmd sharing does not have the 'same virtual
> addresses' requirement.  My guess is that requiring the same virtual
> address does not make code simpler.

Same virtual address requirement make pevious mshare implementation quite a bit simpler, but this reimplementation in 
this RFC does not rely upon virtual addresses being same as much. This does open up the possibility of removing this 
requirement. I just haven't looked through any potential complications enough.

> 
>> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
>> is created to hold the shared page tables. Host mm struct is not
>> attached to a process. Start and size of host mm are set to the
>> start and size of the mmap region and a VMA covering this range is
>> also added to host mm struct. Existing page table entries from the
>> process that creates the mapping are copied over to the host mm
>> struct.
> 
> It seems there is one host mm struct per shared mapping.  Is that
> correct?  And, the host mm is the size of that single mapping?

There is one host mm per shared file. This was done to allow for multiple VMAs to exist in this mm for a file that can 
potentially represent multiple shared regions of the file. In the RFC, I am creating only one shared region per file for 
now.

> Suppose the following:
> - process A maps shared file offset 0 to 2xPMD_SIZE
> - process B maps shared file offset 0 to 2xPMD_SIZE
> It then makes sense A and B would use the same host mm.  Now,

Yes, that is correct.

> - process C maps shared file offset 0 to 4xPMD_SIZE
> Does C share anything with A and B?  Are there multiple host mm structs?

In the RFC implementation, C does not share anything with A. The code is written to allow for expansion to support this 
sharing as well as sharing multiple non-contiguous regions of the file. My goal is to debug and finalize sharing 
infrastructure for the simple case and then start expanding it. In this specific case, implementation would expand the 
existing shared VMAs to cover offset 0 to 4xPMD.

> 
> Just wondering at a high level how this would work without looking at
> the code.
> 
>>          All processes mapping this shared region are considered
>> guest processes. When a guest process mmap's the shared region, a vm
>> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
>> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
>> flag is found, its corresponding PMD is updated with the PMD from
>> host mm struct so the PMD will point to the page tables in host mm
>> struct. vm_mm pointer of the VMA is also updated to point to host mm
>> struct for the duration of fault handling to ensure fault handling
>> happens in the context of host mm struct. When a new PTE is
>> created, it is created in the host mm struct page tables and the PMD
>> in guest mm points to the same PTEs.
>>
>> This is a basic working implementation. It will need to go through
>> more testing and refinements. Some notes and questions:
>>
>> - PMD size alignment and size requirement is currently hard coded
>>    in. Is there a need or desire to make this more flexible and work
>>    with other alignments/sizes? PMD size allows for adapting this
>>    infrastructure to form the basis for hugetlbfs page table sharing
>>    as well. More work will be needed to make that happen.
>>
>> - Is there a reason to allow a userspace app to query this size and
>>    alignment requirement for MAP_SHARED_PT in some way?
>>
>> - Shared PTEs means mprotect() call made by one process affects all
>>    processes sharing the same mapping and that behavior will need to
>>    be documented clearly. Effect of mprotect call being different for
>>    processes using shared page tables is the primary reason to
>>    require an explicit opt-in from userspace processes to share page
>>    tables. With a transparent sharing derived from MAP_SHARED alone,
>>    changed effect of mprotect can break significant number of
>>    userspace apps. One could work around that by unsharing whenever
>>    mprotect changes modes on shared mapping but that introduces
>>    complexity and the capability to execute a single mprotect to
>>    change modes across 1000's of processes sharing a mapped database
>>    is a feature explicitly asked for by database folks. This
>>    capability has significant performance advantage when compared to
>>    mechanism of sending messages to every process using shared
>>    mapping to call mprotect and change modes in each process, or
>>    using traps on permissions mismatch in each process.
> 
> I would guess this is more than just mprotect, and anything that impacts
> page tables.  Correct?  For example MADV_DONTNEED, MADV_HUGEPAGE,
> MADV_NOHUGEPAGE.

Yes. For instance, the RFC implementation allows one process to set MADV_DONTNEED and remove PTEs which removes them for 
all other processes sharing the mapping. That behavior does not quite sound right. A more robust implementation would 
possibly remove PTEs only if all sharing processes have set MADV_DONTNEED. Other flags might need different treatment, 
for instance MADV_WILLNEED should trigger read-ahead even if set by just one process.


> 
>> - This implementation does not allow unmapping page table shared
>>    mappings partially. Should that be supported in future?
>>
>> Some concerns in this RFC:
>>
>> - When page tables for a process are freed upon process exit,
>>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>>    by the process. For a shared page table, shared PMDs can not be
>>    released when a guest process exits. These shared PMDs are
>>    released when host mm struct is released upon end of last
>>    reference to page table shared region hosted by this mm. For now
>>    to stop PMDs being released, this RFC introduces following change
>>    in mm/memory.c which works but does not feel like the right
>>    approach. Any suggestions for a better long term approach will be
>>    very appreciated:
>>
>> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
>> pud_t *pud,
>>
>>          pmd = pmd_offset(pud, start);
>>          pud_clear(pud);
>> -       pmd_free_tlb(tlb, pmd, start);
>> -       mm_dec_nr_pmds(tlb->mm);
>> +       if (shared_pte) {
>> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
>> +               tlb->freed_tables = 1;
>> +       } else {
>> +               pmd_free_tlb(tlb, pmd, start);
>> +               mm_dec_nr_pmds(tlb->mm);
>> +       }
>>   }
>>
>>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>>
>> - This implementation requires an additional VM flag. Since all lower
>>    32 bits are currently in use, the new VM flag must come from upper
>>    32 bits which restricts this feature to 64-bit processors.
>>
>> - This feature is implemented for file mappings only. Is there a
>>    need to support it for anonymous memory as well?
> 
> I have not looked at the implementation, are 'file mappings' only
> mappings using the page cache?  Or, do you just need a file descriptor?
> Would a file descriptor created via memfd_create work for anonymous
> memory?
> 

I look for just a file pointer, so an fd created by memfd would work for anonymous memory.

Thanks,
Khalid


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4]  Add support for sharing page tables across processes (Previously mshare)
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
                   ` (4 preceding siblings ...)
  2023-04-26 21:27 ` [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Mike Kravetz
@ 2023-06-12 16:25 ` Peter Xu
  2023-06-30 11:29 ` Rongwei Wang
  2023-07-31  4:35 ` Rongwei Wang
  7 siblings, 0 replies; 21+ messages in thread
From: Peter Xu @ 2023-06-12 16:25 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, markhemm, viro, david, mike.kravetz, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Wed, Apr 26, 2023 at 10:49:47AM -0600, Khalid Aziz wrote:
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.

Since hugetlb has this, it'll be very helpful if you can provide a
comparison between this approach and hugetlb's - especially on the
differences - and reasonings about those.

Merging anything like this definitely should also pave way for hugetlb's
future, so it even seems to be an "requirement" of such patchset even
though implicitily..  considering the "hotness" that hugetlb recently has
on refactoring demand (if not a rewrite).

Some high level questions:

  - Why mmap() flag?

    For this one, I agree it should be opt-in - sharing pgtable definitely
    means sharing of a lot of privileges operating on current mm, so one
    should be aware and be prepared to be messed up.

    IIUC hugetlb doesn't do this but instead when something "racy" happens
    itt just unshares by default.  To me opt-in makes more sense if to
    start from zero, because I don't think by default a process wants to
    leak its mm to others.

    I think you mentioned allowing pgtable to be shared even for mprotect()
    from one MM then all MMs can see; but if so then DONTNEED should really
    do the same - when one MM DONTNEED it it should go away for all.  It
    doesn't make a lot of sense to me to treat it differently with a
    DONTNEED refcount anywhere..

  - Can guest MM opt-out?

    Should we allow guest MM to opt-out when they want?  It sounds like a
    good thing to have to me, especially for file that sounds like as
    simple as zapping the pgtable.  But then mmap flag will stop working
    iiuc, so goes back to that question (e.g. what about madvise or prctl?).

  - Why mm_struct to represent shared pgtable?

    IIUC hugetlb used the pgtable page itself plus some refcounting (the
    refcounting is racy with gup-fast that Jann used to point out, but
    that's another story..).  My question is do you think that won't work?
    Are there reasons to explain?  Why mm_struct is the structure you chose
    for representing a shared pgtable?  Why per-file?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing
  2023-04-26 16:49 ` [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing Khalid Aziz
@ 2023-06-26  8:08   ` Karim Manaouil
  0 siblings, 0 replies; 21+ messages in thread
From: Karim Manaouil @ 2023-06-26  8:08 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, markhemm, viro, david, mike.kravetz, andreyknvl,
	dave.hansen, luto, brauner, arnd, ebiederm, catalin.marinas,
	linux-arch, linux-kernel, linux-mm, mhiramat, rostedt,
	vasily.averin, xhao, pcc, neilb, maz

On Wed, Apr 26, 2023 at 10:49:50AM -0600, Khalid Aziz wrote:
> When a process passes MAP_SHARED_PT flag to mmap(), create a new mm
> struct to hold the shareable page table entries for the newly mapped
> region.  This new mm is not associated with a task.  Its lifetime is
> until the last shared mapping is deleted.  This patch also adds a
> new pointer "ptshare_data" to struct address_space which points to
> the data structure that will contain pointer to this newly created
> mm along with properties of the shared mapping. ptshare_data
> maintains a refcount for the shared mapping so that it can be
> cleaned up upon last unmap.
>
> Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/fs.h |   2 +
>  mm/Makefile        |   2 +-
>  mm/internal.h      |  14 +++++
>  mm/mmap.c          |  72 ++++++++++++++++++++++++++
>  mm/ptshare.c       | 126 +++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 215 insertions(+), 1 deletion(-)
>  create mode 100644 mm/ptshare.c
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c85916e9f7db..db8d3257c712 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -422,6 +422,7 @@ extern const struct address_space_operations empty_aops;
>   * @private_lock: For use by the owner of the address_space.
>   * @private_list: For use by the owner of the address_space.
>   * @private_data: For use by the owner of the address_space.
> + * @ptshare_data: For shared page table use
>   */
>  struct address_space {
>       struct inode            *host;
> @@ -443,6 +444,7 @@ struct address_space {
>       spinlock_t              private_lock;
>       struct list_head        private_list;
>       void                    *private_data;
> +     void                    *ptshare_data;
>  } __attribute__((aligned(sizeof(long)))) __randomize_layout;
>       /*
>        * On most architectures that alignment is already the case; but
> diff --git a/mm/Makefile b/mm/Makefile
> index 8e105e5b3e29..d9bb14fdf220 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -40,7 +40,7 @@ mmu-y                       := nommu.o
>  mmu-$(CONFIG_MMU)    := highmem.o memory.o mincore.o \
>                          mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>                          msync.o page_vma_mapped.o pagewalk.o \
> -                        pgtable-generic.o rmap.o vmalloc.o
> +                        pgtable-generic.o rmap.o vmalloc.o ptshare.o
>
>
>  ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/internal.h b/mm/internal.h
> index 4d60d2d5fe19..3efb8738e26f 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1047,4 +1047,18 @@ static inline bool vma_is_shared(const struct vm_area_struct *vma)
>  {
>       return vma->vm_flags & VM_SHARED_PT;
>  }
> +
> +/*
> + * mm/ptshare.c
> + */
> +struct ptshare_data {
> +     struct mm_struct *mm;
> +     refcount_t refcnt;
> +     unsigned long start;
> +     unsigned long size;
> +     unsigned long mode;
> +};

Why does ptshare_data contain the start address, size and mode of the
mapping? Does it mean ptshare_data can represent only a single mapping
of the file (the one that begins at ptshare_data->start)? What if we
want to share multiple different mappings of the same file (which may
or may not intersect)?

If we choose to use the VMAs in host_mm for that, will this possibly create
a lot of special-cased VMA handling?

> +int ptshare_new_mm(struct file *file, struct vm_area_struct *vma);
> +void ptshare_del_mm(struct vm_area_struct *vm);
> +int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma);
>  #endif       /* __MM_INTERNAL_H */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 8b46d465f8d4..c5e9b7f6de90 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1382,6 +1382,60 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>           ((vm_flags & VM_LOCKED) ||
>            (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
>               *populate = len;
> +
> +#if VM_SHARED_PT
> +     /*
> +      * Check if this mapping is a candidate for page table sharing
> +      * at PMD level. It is if following conditions hold:
> +      *      - It is not anonymous mapping
> +      *      - It is not hugetlbfs mapping (for now)
> +      *      - flags conatins MAP_SHARED or MAP_SHARED_VALIDATE and
> +      *        MAP_SHARED_PT
> +      *      - Start address is aligned to PMD size
> +      *      - Mapping size is a multiple of PMD size
> +      */
> +     if (ptshare && file && !is_file_hugepages(file)) {
> +             struct vm_area_struct *vma;
> +
> +             vma = find_vma(mm, addr);
> +             if (!((vma->vm_start | vma->vm_end) & (PMD_SIZE - 1))) {
> +                     struct ptshare_data *info = file->f_mapping->ptshare_data;

This is racy with another process trying to share the same mapping of
the file. It's also racy with the removal (this process can get a
pointer to ptshare_data that's currently being freed).

> +                     /*
> +                      * If this mapping has not been set up for page table
> +                      * sharing yet, do so by creating a new mm to hold the
> +                      * shared page tables for this mapping
> +                      */
> +                     if (info == NULL) {
> +                             int ret;
> +
> +                             ret = ptshare_new_mm(file, vma);
> +                             if (ret < 0)
> +                                     return ret;
> +
> +                             info = file->f_mapping->ptshare_data;
> +                             ret = ptshare_insert_vma(info->mm, vma);
> +                             if (ret < 0)
> +                                     addr = ret;
> +                             else
> +                                     vm_flags_set(vma, VM_SHARED_PT);

Creation might race with another process.

> +                     } else {
> +                             /*
> +                              * Page tables will be shared only if the
> +                              * file is mapped in with the same permissions
> +                              * across all mappers with same starting
> +                              * address and size
> +                              */
> +                             if (((prot & info->mode) == info->mode) &&
> +                                     (addr == info->start) &&
> +                                     (len == info->size)) {
> +                                     vm_flags_set(vma, VM_SHARED_PT);
> +                                     refcount_inc(&info->refcnt);
> +                             }
> +                     }
> +             }
> +     }
> +#endif
> +
>       return addr;
>  }
>
> @@ -2495,6 +2549,22 @@ int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
>       if (end == start)
>               return -EINVAL;
>
> +     /*
> +      * Check if this vma uses shared page tables
> +      */
> +     vma = find_vma_intersection(mm, start, end);
> +     if (vma && unlikely(vma_is_shared(vma))) {
> +             struct ptshare_data *info = NULL;
> +
> +             if (vma->vm_file && vma->vm_file->f_mapping)
> +                     info = vma->vm_file->f_mapping->ptshare_data;
> +             /* Don't allow partial munmaps */
> +             if (info && ((start != info->start) || (len != info->size)))
> +                     return -EINVAL;
> +             ptshare_del_mm(vma);
> +     }
> +
> +
>        /* arch_unmap() might do unmaps itself.  */
>       arch_unmap(mm, start, end);
>
> @@ -2664,6 +2734,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>                       }
>               }
>
> +             if (vm_flags & VM_SHARED_PT)
> +                     vm_flags_set(vma, VM_SHARED_PT);
>               vm_flags = vma->vm_flags;
>       } else if (vm_flags & VM_SHARED) {
>               error = shmem_zero_setup(vma);
> diff --git a/mm/ptshare.c b/mm/ptshare.c
> new file mode 100644
> index 000000000000..f6784268958c
> --- /dev/null
> +++ b/mm/ptshare.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Share page table entries when possible to reduce the amount of extra
> + * memory consumed by page tables
> + *
> + * Copyright (C) 2022 Oracle Corp. All rights reserved.
> + * Authors:  Khalid Aziz <khalid.aziz@oracle.com>
> + *           Matthew Wilcox <willy@infradead.org>
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <asm/pgalloc.h>
> +#include "internal.h"
> +
> +/*
> + * Create a new mm struct that will hold the shared PTEs. Pointer to
> + * this new mm is stored in the data structure ptshare_data which also
> + * includes a refcount for any current references to PTEs in this new
> + * mm. This refcount is used to determine when the mm struct for shared
> + * PTEs can be deleted.
> + */
> +int
> +ptshare_new_mm(struct file *file, struct vm_area_struct *vma)
> +{
> +     struct mm_struct *new_mm;
> +     struct ptshare_data *info = NULL;
> +     int retval = 0;
> +     unsigned long start = vma->vm_start;
> +     unsigned long len = vma->vm_end - vma->vm_start;
> +
> +     new_mm = mm_alloc();
> +     if (!new_mm) {
> +             retval = -ENOMEM;
> +             goto err_free;
> +     }
> +     new_mm->mmap_base = start;
> +     new_mm->task_size = len;
> +     if (!new_mm->task_size)
> +             new_mm->task_size--;
> +
> +     info = kzalloc(sizeof(*info), GFP_KERNEL);
> +     if (!info) {
> +             retval = -ENOMEM;
> +             goto err_free;
> +     }
> +     info->mm = new_mm;
> +     info->start = start;
> +     info->size = len;
> +     refcount_set(&info->refcnt, 1);
> +     file->f_mapping->ptshare_data = info;

Racy assignement. It can lead to a memory leak if another process does
the same concurrently and assigns before or after this one. The new_mm
and ptshare_data of one of them will be lost.

I think this whole process needs to be protected with i_mmap lock.

> +
> +     return retval;
> +
> +err_free:
> +     if (new_mm)
> +             mmput(new_mm);
> +     kfree(info);
> +     return retval;
> +}
> +
> +/*
> + * insert vma into mm holding shared page tables
> + */
> +int
> +ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma)
> +{
> +     struct vm_area_struct *new_vma;
> +     int err = 0;
> +
> +     new_vma = vm_area_dup(vma);
> +     if (!new_vma)
> +             return -ENOMEM;
> +
> +     new_vma->vm_file = NULL;
> +     /*
> +      * This new vma belongs to host mm, so clear the VM_SHARED_PT
> +      * flag on this so we know this is the host vma when we clean
> +      * up page tables. Do not use THP for page table shared regions
> +      */
> +     vm_flags_clear(new_vma, (VM_SHARED | VM_SHARED_PT));
> +     vm_flags_set(new_vma, VM_NOHUGEPAGE);
> +     new_vma->vm_mm = mm;
> +
> +     err = insert_vm_struct(mm, new_vma);
> +     if (err)
> +             return -ENOMEM;
> +
> +     return err;
> +}
> +
> +/*
> + * Free the mm struct created to hold shared PTEs and associated data
> + * structures
> + */
> +static inline void
> +free_ptshare_mm(struct ptshare_data *info)
> +{
> +     mmput(info->mm);
> +     kfree(info);
> +}
> +
> +/*
> + * This function is called when a reference to the shared PTEs in mm
> + * struct is dropped. It updates refcount and checks to see if last
> + * reference to the mm struct holding shared PTEs has been dropped. If
> + * so, it cleans up the mm struct and associated data structures
> + */
> +void
> +ptshare_del_mm(struct vm_area_struct *vma)
> +{
> +     struct ptshare_data *info;
> +     struct file *file = vma->vm_file;
> +
> +     if (!file || (!file->f_mapping))
> +             return;
> +     info = file->f_mapping->ptshare_data;
> +     WARN_ON(!info);
> +     if (!info)
> +             return;
> +
> +     if (refcount_dec_and_test(&info->refcnt)) {
> +             free_ptshare_mm(info);
> +             file->f_mapping->ptshare_data = NULL;

Maybe those two should be reordered (after keeping a pointer to
ptshare_data). Then setting f_mapping->ptshare_data to NULL can
be performed under a lock and freeing ptshare and host_mm can be
done without a lock.

Cheers
Karim

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
                   ` (5 preceding siblings ...)
  2023-06-12 16:25 ` Peter Xu
@ 2023-06-30 11:29 ` Rongwei Wang
  2023-07-31  4:35 ` Rongwei Wang
  7 siblings, 0 replies; 21+ messages in thread
From: Rongwei Wang @ 2023-06-30 11:29 UTC (permalink / raw)
  To: Khalid Aziz; +Cc: linux-arch, linux-kernel, linux-mm

Hi Khalid

I see this patch has send out in April, and wanna to ask about the 
status of this RFC now (IMHO, it seems that the code has some places to 
fix/do). This feature is useful to save much memory on pgtables, so we 
also want to use this optimization in our databases if upstream accept that.

BTW, in the past few weeks, I made some adjustments to simplify and meet 
with our databases base on your code, e.g. multi-vmas share same shadow 
mm, madvise, and memory compaction. if you are interested, I can provide 
a detailed codes.


Thanks,

-wrw

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
>    in. Is there a need or desire to make this more flexible and work
>    with other alignments/sizes? PMD size allows for adapting this
>    infrastructure to form the basis for hugetlbfs page table sharing
>    as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
>    alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
>    processes sharing the same mapping and that behavior will need to
>    be documented clearly. Effect of mprotect call being different for
>    processes using shared page tables is the primary reason to
>    require an explicit opt-in from userspace processes to share page
>    tables. With a transparent sharing derived from MAP_SHARED alone,
>    changed effect of mprotect can break significant number of
>    userspace apps. One could work around that by unsharing whenever
>    mprotect changes modes on shared mapping but that introduces
>    complexity and the capability to execute a single mprotect to
>    change modes across 1000's of processes sharing a mapped database
>    is a feature explicitly asked for by database folks. This
>    capability has significant performance advantage when compared to
>    mechanism of sending messages to every process using shared
>    mapping to call mprotect and change modes in each process, or
>    using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
>    mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>    by the process. For a shared page table, shared PMDs can not be
>    released when a guest process exits. These shared PMDs are
>    released when host mm struct is released upon end of last
>    reference to page table shared region hosted by this mm. For now
>    to stop PMDs being released, this RFC introduces following change
>    in mm/memory.c which works but does not feel like the right
>    approach. Any suggestions for a better long term approach will be
>    very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
>          pmd = pmd_offset(pud, start);
>          pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>   }
>
>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
>    32 bits are currently in use, the new VM flag must come from upper
>    32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
>    need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
>    pagetable bytes is not quite accurate yet in this RFC and will be
>    fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
>    mm/ptshare: Add vm flag for shared PTE
>    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
>    mm/ptshare: Create new mm struct for page table sharing
>    mm/ptshare: Add page fault handling for page table shared regions
>
>   include/linux/fs.h                     |   2 +
>   include/linux/mm.h                     |   8 +
>   include/trace/events/mmflags.h         |   3 +-
>   include/uapi/asm-generic/mman-common.h |   1 +
>   mm/Makefile                            |   2 +-
>   mm/internal.h                          |  21 ++
>   mm/memory.c                            | 105 ++++++++--
>   mm/mmap.c                              |  88 +++++++++
>   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
>   9 files changed, 476 insertions(+), 17 deletions(-)
>   create mode 100644 mm/ptshare.c
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
                   ` (6 preceding siblings ...)
  2023-06-30 11:29 ` Rongwei Wang
@ 2023-07-31  4:35 ` Rongwei Wang
  2023-07-31 12:25   ` Matthew Wilcox
  7 siblings, 1 reply; 21+ messages in thread
From: Rongwei Wang @ 2023-07-31  4:35 UTC (permalink / raw)
  To: willy; +Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com

Hi Matthew

May I ask you another question about mshare under this RFC? I remember 
you said you will redesign the mshare to per-vma not per-mapping 
(apologize if remember wrongly) in last time MM alignment session. And I 
also refer to you to re-code this part in our internal version (based on 
this RFC). It seems that per VMA will can simplify the structure of 
pgtable sharing, even doesn't care the different permission of file 
mapping. these are advantages (maybe) that I can imagine. But IMHO, It 
seems not a strongly reason to switch per-mapping to per-vma.

And I can't imagine other considerations of upstream. Can you share the 
reason why redesigning in a per-vma way, due to integation with 
hugetlbfs pgtable sharing or anonymous page sharing?

Thanks for your time.

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
>    in. Is there a need or desire to make this more flexible and work
>    with other alignments/sizes? PMD size allows for adapting this
>    infrastructure to form the basis for hugetlbfs page table sharing
>    as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
>    alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
>    processes sharing the same mapping and that behavior will need to
>    be documented clearly. Effect of mprotect call being different for
>    processes using shared page tables is the primary reason to
>    require an explicit opt-in from userspace processes to share page
>    tables. With a transparent sharing derived from MAP_SHARED alone,
>    changed effect of mprotect can break significant number of
>    userspace apps. One could work around that by unsharing whenever
>    mprotect changes modes on shared mapping but that introduces
>    complexity and the capability to execute a single mprotect to
>    change modes across 1000's of processes sharing a mapped database
>    is a feature explicitly asked for by database folks. This
>    capability has significant performance advantage when compared to
>    mechanism of sending messages to every process using shared
>    mapping to call mprotect and change modes in each process, or
>    using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
>    mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>    by the process. For a shared page table, shared PMDs can not be
>    released when a guest process exits. These shared PMDs are
>    released when host mm struct is released upon end of last
>    reference to page table shared region hosted by this mm. For now
>    to stop PMDs being released, this RFC introduces following change
>    in mm/memory.c which works but does not feel like the right
>    approach. Any suggestions for a better long term approach will be
>    very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
>          pmd = pmd_offset(pud, start);
>          pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>   }
>
>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
>    32 bits are currently in use, the new VM flag must come from upper
>    32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
>    need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
>    pagetable bytes is not quite accurate yet in this RFC and will be
>    fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
>    mm/ptshare: Add vm flag for shared PTE
>    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
>    mm/ptshare: Create new mm struct for page table sharing
>    mm/ptshare: Add page fault handling for page table shared regions
>
>   include/linux/fs.h                     |   2 +
>   include/linux/mm.h                     |   8 +
>   include/trace/events/mmflags.h         |   3 +-
>   include/uapi/asm-generic/mman-common.h |   1 +
>   mm/Makefile                            |   2 +-
>   mm/internal.h                          |  21 ++
>   mm/memory.c                            | 105 ++++++++--
>   mm/mmap.c                              |  88 +++++++++
>   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
>   9 files changed, 476 insertions(+), 17 deletions(-)
>   create mode 100644 mm/ptshare.c
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31  4:35 ` Rongwei Wang
@ 2023-07-31 12:25   ` Matthew Wilcox
  2023-07-31 12:50     ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2023-07-31 12:25 UTC (permalink / raw)
  To: Rongwei Wang
  Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com,
	David Hildenbrand

On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
> Hi Matthew
> 
> May I ask you another question about mshare under this RFC? I remember you
> said you will redesign the mshare to per-vma not per-mapping (apologize if
> remember wrongly) in last time MM alignment session. And I also refer to you
> to re-code this part in our internal version (based on this RFC). It seems
> that per VMA will can simplify the structure of pgtable sharing, even
> doesn't care the different permission of file mapping. these are advantages
> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
> switch per-mapping to per-vma.
> 
> And I can't imagine other considerations of upstream. Can you share the
> reason why redesigning in a per-vma way, due to integation with hugetlbfs
> pgtable sharing or anonymous page sharing?

It was David who wants to make page table sharing be per-VMA.  I think
he is advocating for the wrong approach.  In any case, I don't have time
to work on mshare and Khalid is on leave until September, so I don't
think anybody is actively working on mshare.

> Thanks for your time.
> 
> On 2023/4/27 00:49, Khalid Aziz wrote:
> > Memory pages shared between processes require a page table entry
> > (PTE) for each process. Each of these PTE consumes some of the
> > memory and as long as number of mappings being maintained is small
> > enough, this space consumed by page tables is not objectionable.
> > When very few memory pages are shared between processes, the number
> > of page table entries (PTEs) to maintain is mostly constrained by
> > the number of pages of memory on the system.  As the number of
> > shared pages and the number of times pages are shared goes up,
> > amount of memory consumed by page tables starts to become
> > significant. This issue does not apply to threads. Any number of
> > threads can share the same pages inside a process while sharing the
> > same PTEs. Extending this same model to sharing pages across
> > processes can eliminate this issue for sharing across processes as
> > well.
> > 
> > Some of the field deployments commonly see memory pages shared
> > across 1000s of processes. On x86_64, each page requires a PTE that
> > is only 8 bytes long which is very small compared to the 4K page
> > size. When 2000 processes map the same page in their address space,
> > each one of them requires 8 bytes for its PTE and together that adds
> > up to 8K of memory just to hold the PTEs for one 4K page. On a
> > database server with 300GB SGA, a system crash was seen with
> > out-of-memory condition when 1500+ clients tried to share this SGA
> > even though the system had 512GB of memory. On this server, in the
> > worst case scenario of all 1500 processes mapping every page from
> > SGA would have required 878GB+ for just the PTEs. If these PTEs
> > could be shared, amount of memory saved is very significant.
> > 
> > This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> > This flag can be specified along with MAP_SHARED by a process to
> > hint to kernel that it wishes to share page table entries for this
> > file mapping mmap region with other processes. Any other process
> > that mmaps the same file with MAP_SHARED_PT flag can then share the
> > same page table entries. Besides specifying MAP_SHARED_PT flag, the
> > processes must map the files at a PMD aligned address with a size
> > that is a multiple of PMD size and at the same virtual addresses.
> > This last requirement of same virtual addresses can possibly be
> > relaxed if that is the consensus.
> > 
> > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> > is created to hold the shared page tables. Host mm struct is not
> > attached to a process. Start and size of host mm are set to the
> > start and size of the mmap region and a VMA covering this range is
> > also added to host mm struct. Existing page table entries from the
> > process that creates the mapping are copied over to the host mm
> > struct. All processes mapping this shared region are considered
> > guest processes. When a guest process mmap's the shared region, a vm
> > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> > flag is found, its corresponding PMD is updated with the PMD from
> > host mm struct so the PMD will point to the page tables in host mm
> > struct. vm_mm pointer of the VMA is also updated to point to host mm
> > struct for the duration of fault handling to ensure fault handling
> > happens in the context of host mm struct. When a new PTE is
> > created, it is created in the host mm struct page tables and the PMD
> > in guest mm points to the same PTEs.
> > 
> > This is a basic working implementation. It will need to go through
> > more testing and refinements. Some notes and questions:
> > 
> > - PMD size alignment and size requirement is currently hard coded
> >    in. Is there a need or desire to make this more flexible and work
> >    with other alignments/sizes? PMD size allows for adapting this
> >    infrastructure to form the basis for hugetlbfs page table sharing
> >    as well. More work will be needed to make that happen.
> > 
> > - Is there a reason to allow a userspace app to query this size and
> >    alignment requirement for MAP_SHARED_PT in some way?
> > 
> > - Shared PTEs means mprotect() call made by one process affects all
> >    processes sharing the same mapping and that behavior will need to
> >    be documented clearly. Effect of mprotect call being different for
> >    processes using shared page tables is the primary reason to
> >    require an explicit opt-in from userspace processes to share page
> >    tables. With a transparent sharing derived from MAP_SHARED alone,
> >    changed effect of mprotect can break significant number of
> >    userspace apps. One could work around that by unsharing whenever
> >    mprotect changes modes on shared mapping but that introduces
> >    complexity and the capability to execute a single mprotect to
> >    change modes across 1000's of processes sharing a mapped database
> >    is a feature explicitly asked for by database folks. This
> >    capability has significant performance advantage when compared to
> >    mechanism of sending messages to every process using shared
> >    mapping to call mprotect and change modes in each process, or
> >    using traps on permissions mismatch in each process.
> > 
> > - This implementation does not allow unmapping page table shared
> >    mappings partially. Should that be supported in future?
> > 
> > Some concerns in this RFC:
> > 
> > - When page tables for a process are freed upon process exit,
> >    pmd_free_tlb() gets called at one point to free all PMDs allocated
> >    by the process. For a shared page table, shared PMDs can not be
> >    released when a guest process exits. These shared PMDs are
> >    released when host mm struct is released upon end of last
> >    reference to page table shared region hosted by this mm. For now
> >    to stop PMDs being released, this RFC introduces following change
> >    in mm/memory.c which works but does not feel like the right
> >    approach. Any suggestions for a better long term approach will be
> >    very appreciated:
> > 
> > @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> > pud_t *pud,
> > 
> >          pmd = pmd_offset(pud, start);
> >          pud_clear(pud);
> > -       pmd_free_tlb(tlb, pmd, start);
> > -       mm_dec_nr_pmds(tlb->mm);
> > +       if (shared_pte) {
> > +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> > +               tlb->freed_tables = 1;
> > +       } else {
> > +               pmd_free_tlb(tlb, pmd, start);
> > +               mm_dec_nr_pmds(tlb->mm);
> > +       }
> >   }
> > 
> >   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> > 
> > - This implementation requires an additional VM flag. Since all lower
> >    32 bits are currently in use, the new VM flag must come from upper
> >    32 bits which restricts this feature to 64-bit processors.
> > 
> > - This feature is implemented for file mappings only. Is there a
> >    need to support it for anonymous memory as well?
> > 
> > - Accounting for MAP_SHARED_PT mapped filepages in a process and
> >    pagetable bytes is not quite accurate yet in this RFC and will be
> >    fixed in the non-RFC version of patches.
> > 
> > I appreciate any feedback on these patches and ideas for
> > improvements before moving these patches out of RFC stage.
> > 
> > 
> > Changes from RFC v1:
> > - Broken the patches up into smaller patches
> > - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> > - Cleaned up the code a bit
> > 
> > 
> > Khalid Aziz (4):
> >    mm/ptshare: Add vm flag for shared PTE
> >    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> >    mm/ptshare: Create new mm struct for page table sharing
> >    mm/ptshare: Add page fault handling for page table shared regions
> > 
> >   include/linux/fs.h                     |   2 +
> >   include/linux/mm.h                     |   8 +
> >   include/trace/events/mmflags.h         |   3 +-
> >   include/uapi/asm-generic/mman-common.h |   1 +
> >   mm/Makefile                            |   2 +-
> >   mm/internal.h                          |  21 ++
> >   mm/memory.c                            | 105 ++++++++--
> >   mm/mmap.c                              |  88 +++++++++
> >   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
> >   9 files changed, 476 insertions(+), 17 deletions(-)
> >   create mode 100644 mm/ptshare.c
> > 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 12:25   ` Matthew Wilcox
@ 2023-07-31 12:50     ` David Hildenbrand
  2023-07-31 16:19       ` Rongwei Wang
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-07-31 12:50 UTC (permalink / raw)
  To: Matthew Wilcox, Rongwei Wang
  Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com

On 31.07.23 14:25, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>> Hi Matthew
>>
>> May I ask you another question about mshare under this RFC? I remember you
>> said you will redesign the mshare to per-vma not per-mapping (apologize if
>> remember wrongly) in last time MM alignment session. And I also refer to you
>> to re-code this part in our internal version (based on this RFC). It seems
>> that per VMA will can simplify the structure of pgtable sharing, even
>> doesn't care the different permission of file mapping. these are advantages
>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>> switch per-mapping to per-vma.
>>
>> And I can't imagine other considerations of upstream. Can you share the
>> reason why redesigning in a per-vma way, due to integation with hugetlbfs
>> pgtable sharing or anonymous page sharing?
> 
> It was David who wants to make page table sharing be per-VMA.  I think
> he is advocating for the wrong approach.  In any case, I don't have time
> to work on mshare and Khalid is on leave until September, so I don't
> think anybody is actively working on mshare.

Not that I also don't have any time to look into this, but my comment 
essentially was that we should try decoupling page table sharing (reduce 
memory consumption, shorter rmap walk) from the mprotect(PROT_READ) use 
case.

For page table sharing I was wondering whether there could be ways to 
just have that done semi-automatically. Similar to how it's done for 
hugetlb. There are some clear limitations: mappings < PMD_SIZE won't be 
able to benefit.

It's still unclear whether that is a real limitation. Some use cases 
were raised (put all user space library mappings into a shared area), 
but I realized that these conflict with MAP_PRIVATE requirements of such 
areas. Maybe I'm wrong and this is easily resolved.

At least it's not the primary use case that was raised. For the primary 
use cases (VMs, databases) that map huge areas, it might not be a 
limitation.


Regarding mprotect(PROT_READ), my point was that mprotect() is most 
probably the wrong tool to use (especially, due to signal handling). 
Instead, I was suggesting having a way to essentially protect pages in a 
shmem file -- and get notified whenever wants to write to such a page 
either via the page tables or via write() and friends. We do have the 
write-notify infrastructure for filesystems in place that we might 
extend/reuse. That mechanism could benefit from shared page tables by 
having to do less rmap walks.

Again, I don't have time to look into that (just like everybody else as 
it appears) and might miss something important. Just sharing my thoughts 
that I raised in the call.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 12:50     ` David Hildenbrand
@ 2023-07-31 16:19       ` Rongwei Wang
  2023-07-31 16:30         ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Rongwei Wang @ 2023-07-31 16:19 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com


On 2023/7/31 20:50, David Hildenbrand wrote:
> On 31.07.23 14:25, Matthew Wilcox wrote:
>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>> Hi Matthew
>>>
>>> May I ask you another question about mshare under this RFC? I 
>>> remember you
>>> said you will redesign the mshare to per-vma not per-mapping 
>>> (apologize if
>>> remember wrongly) in last time MM alignment session. And I also 
>>> refer to you
>>> to re-code this part in our internal version (based on this RFC). It 
>>> seems
>>> that per VMA will can simplify the structure of pgtable sharing, even
>>> doesn't care the different permission of file mapping. these are 
>>> advantages
>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>> switch per-mapping to per-vma.
>>>
>>> And I can't imagine other considerations of upstream. Can you share the
>>> reason why redesigning in a per-vma way, due to integation with 
>>> hugetlbfs
>>> pgtable sharing or anonymous page sharing?
>>
>> It was David who wants to make page table sharing be per-VMA.  I think
>> he is advocating for the wrong approach.  In any case, I don't have time
>> to work on mshare and Khalid is on leave until September, so I don't
>> think anybody is actively working on mshare.
>
> Not that I also don't have any time to look into this, but my comment 
> essentially was that we should try decoupling page table sharing 
> (reduce memory consumption, shorter rmap walk) from the 
> mprotect(PROT_READ) use case.

Hi David, Matthew

Thanks for your reply.

Uh, sorry, I can't imagine the relative between decouping page table 
sharing with per-VMA design. And I think mprotect(PROT_READ) has to 
modify all sharing page tables of related tasks. It seems that I miss 
something about per-VMA from your words.

BTW, I can imagine a corner case to show the defect (maybe) of 
per-mapping. If we create a range of page table sharing by 
memfd_create(), and a child also own this range of page table sharing. 
But this child process can not create page table sharing base on the 
same fd after mumap() this range (same mapping but different vma area). 
Of course, per-VMA is better choice that can continue to create page 
table sharing base on original fd. That's because new mm struct created 
in this way. I guess that is a type of decoupling you said?

It's just corner case. I am not sure how important it is.

>
> For page table sharing I was wondering whether there could be ways to 
> just have that done semi-automatically. Similar to how it's done for 
> hugetlb. There are some clear limitations: mappings < PMD_SIZE won't 
> be able to benefit.
>
> It's still unclear whether that is a real limitation. Some use cases 
> were raised (put all user space library mappings into a shared area), 
> but I realized that these conflict with MAP_PRIVATE requirements of 
> such areas. Maybe I'm wrong and this is easily resolved.
>
> At least it's not the primary use case that was raised. For the 
> primary use cases (VMs, databases) that map huge areas, it might not 
> be a limitation.
>
>
> Regarding mprotect(PROT_READ), my point was that mprotect() is most 
> probably the wrong tool to use (especially, due to signal handling). 
> Instead, I was suggesting having a way to essentially protect pages in 
> a shmem file -- and get notified whenever wants to write to such a 
> page either via the page tables or via write() and friends. We do have 
> the write-notify infrastructure for filesystems in place that we might 
> extend/reuse.
I am poor in filesystem. The write-notify sounds a good idea. Maybe I 
need some times to digest this.
> That mechanism could benefit from shared page tables by having to do 
> less rmap walks.
>
> Again, I don't have time to look into that (just like everybody else 
> as it appears) and might miss something important. Just sharing my 
> thoughts that I raised in the call.

Your words are very helpful to me. I try to design our internal version 
about this feature in a right way.

Thanks again.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:19       ` Rongwei Wang
@ 2023-07-31 16:30         ` David Hildenbrand
  2023-07-31 16:38           ` Matthew Wilcox
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-07-31 16:30 UTC (permalink / raw)
  To: Rongwei Wang, Matthew Wilcox
  Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com

On 31.07.23 18:19, Rongwei Wang wrote:
> 
> On 2023/7/31 20:50, David Hildenbrand wrote:
>> On 31.07.23 14:25, Matthew Wilcox wrote:
>>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>>> Hi Matthew
>>>>
>>>> May I ask you another question about mshare under this RFC? I
>>>> remember you
>>>> said you will redesign the mshare to per-vma not per-mapping
>>>> (apologize if
>>>> remember wrongly) in last time MM alignment session. And I also
>>>> refer to you
>>>> to re-code this part in our internal version (based on this RFC). It
>>>> seems
>>>> that per VMA will can simplify the structure of pgtable sharing, even
>>>> doesn't care the different permission of file mapping. these are
>>>> advantages
>>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>>> switch per-mapping to per-vma.
>>>>
>>>> And I can't imagine other considerations of upstream. Can you share the
>>>> reason why redesigning in a per-vma way, due to integation with
>>>> hugetlbfs
>>>> pgtable sharing or anonymous page sharing?
>>>
>>> It was David who wants to make page table sharing be per-VMA.  I think
>>> he is advocating for the wrong approach.  In any case, I don't have time
>>> to work on mshare and Khalid is on leave until September, so I don't
>>> think anybody is actively working on mshare.
>>
>> Not that I also don't have any time to look into this, but my comment
>> essentially was that we should try decoupling page table sharing
>> (reduce memory consumption, shorter rmap walk) from the
>> mprotect(PROT_READ) use case.
> 
> Hi David, Matthew
> 
> Thanks for your reply.
> 
> Uh, sorry, I can't imagine the relative between decouping page table
> sharing with per-VMA design. And I think mprotect(PROT_READ) has to
> modify all sharing page tables of related tasks. It seems that I miss
> something about per-VMA from your words.

Assume we do do the page table sharing at mmap time, if the flags are 
right. Let's focus on the most common:

mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)

And doing the same in each and every process.


Having the original design of doing an mprotect(PROT_READ) in each and 
every process is just absolutely inefficient to protect a memfd page.

For that case, my thought was that you actually want to write-protect 
the pages on the memfd level.

So instead of doing mprotect(PROT_READ) in 999 processes, or doing 
mprotect(PROT_READ) on mshare(), you have memfd feature to protect pages 
from any write access -- not using virtual addresses but using an offset 
in the memfd.


Assume such a (badly imagined) memfd_protect(PROT_READ) would make sure 
that:
(1) Any page table mappings of the page are write-protected and
(2) Any write access using the page table mappings trigger write-notify and
(3) Any other access -- e.g., write() -- similarly informs memfd.

Without page table sharing, (1) would have to walk all mappings via the 
rmap. With page table sharing, it would only have to walk one page table.

But the features would be two separate things.

What memfd would do with that write notification (inject a signal, 
something like uffd) would be a different story.


Again, just an idea and maybe complete garbage.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:30         ` David Hildenbrand
@ 2023-07-31 16:38           ` Matthew Wilcox
  2023-07-31 16:48             ` David Hildenbrand
  2023-08-01  6:53             ` Rongwei Wang
  0 siblings, 2 replies; 21+ messages in thread
From: Matthew Wilcox @ 2023-07-31 16:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rongwei Wang, linux-arch, linux-kernel, linux-mm,
	xuyu@linux.alibaba.com

On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> Assume we do do the page table sharing at mmap time, if the flags are right.
> Let's focus on the most common:
> 
> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> 
> And doing the same in each and every process.

That may be the most common in your usage, but for a database, you're
looking at two usage scenarios.  Postgres calls mmap() on the database
file itself so that all processes share the kernel page cache.
Some Commercial Databases call mmap() on a hugetlbfs file so that all
processes share the same userspace buffer cache.  Other Commecial
Databases call shmget() / shmat() with SHM_HUGETLB for the exact
same reason.

This is why I proposed mshare().  Anyone can use it for anything.
We have such a diverse set of users who want to do stuff with shared
page tables that we should not be tying it to memfd or any other
filesystem.  Not to mention that it's more flexible; you can map
individual 4kB files into it and still get page table sharing.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:38           ` Matthew Wilcox
@ 2023-07-31 16:48             ` David Hildenbrand
  2023-07-31 16:54               ` Matthew Wilcox
  2023-08-01  6:53             ` Rongwei Wang
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2023-07-31 16:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rongwei Wang, linux-arch, linux-kernel, linux-mm,
	xuyu@linux.alibaba.com

On 31.07.23 18:38, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>> Assume we do do the page table sharing at mmap time, if the flags are right.
>> Let's focus on the most common:
>>
>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>
>> And doing the same in each and every process.
> 
> That may be the most common in your usage, but for a database, you're
> looking at two usage scenarios.  Postgres calls mmap() on the database
> file itself so that all processes share the kernel page cache.
> Some Commercial Databases call mmap() on a hugetlbfs file so that all
> processes share the same userspace buffer cache.  Other Commecial
> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> same reason.

I remember you said that postgres might be looking into using shmem as 
well, maybe I am wrong.

memfd/hugetlb/shmem could all be handled alike, just "arbitrary 
filesystems" would require more work.

> 
> This is why I proposed mshare().  Anyone can use it for anything.
> We have such a diverse set of users who want to do stuff with shared
> page tables that we should not be tying it to memfd or any other
> filesystem.  Not to mention that it's more flexible; you can map
> individual 4kB files into it and still get page table sharing.

That's not what the current proposal does, or am I wrong?

Also, I'm curious, is that a real requirement in the database world?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:48             ` David Hildenbrand
@ 2023-07-31 16:54               ` Matthew Wilcox
  2023-07-31 17:06                 ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2023-07-31 16:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Rongwei Wang, linux-arch, linux-kernel, linux-mm,
	xuyu@linux.alibaba.com

On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote:
> On 31.07.23 18:38, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> > > Assume we do do the page table sharing at mmap time, if the flags are right.
> > > Let's focus on the most common:
> > > 
> > > mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> > > 
> > > And doing the same in each and every process.
> > 
> > That may be the most common in your usage, but for a database, you're
> > looking at two usage scenarios.  Postgres calls mmap() on the database
> > file itself so that all processes share the kernel page cache.
> > Some Commercial Databases call mmap() on a hugetlbfs file so that all
> > processes share the same userspace buffer cache.  Other Commecial
> > Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> > same reason.
> 
> I remember you said that postgres might be looking into using shmem as well,
> maybe I am wrong.

No, I said that postgres was also interested in sharing page tables.
I don't think they have any use for shmem.

> memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems"
> would require more work.

But arbitrary filesystems was one of the origin use cases; where the
database is stored on a persistent memory filesystem, and neither the
kernel nor userspace has a cache.  The Postgres & Commercial Database
use-cases collapse into the same case, and we want to mmap the files
directly and share the page tables.

> > This is why I proposed mshare().  Anyone can use it for anything.
> > We have such a diverse set of users who want to do stuff with shared
> > page tables that we should not be tying it to memfd or any other
> > filesystem.  Not to mention that it's more flexible; you can map
> > individual 4kB files into it and still get page table sharing.
> 
> That's not what the current proposal does, or am I wrong?

I think you're wrong, but I haven't had time to read the latest patches.

> Also, I'm curious, is that a real requirement in the database world?

I don't know.  It's definitely an advantage that falls out of the design
of mshare.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:54               ` Matthew Wilcox
@ 2023-07-31 17:06                 ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2023-07-31 17:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rongwei Wang, linux-arch, linux-kernel, linux-mm,
	xuyu@linux.alibaba.com

On 31.07.23 18:54, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote:
>> On 31.07.23 18:38, Matthew Wilcox wrote:
>>> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>>>> Assume we do do the page table sharing at mmap time, if the flags are right.
>>>> Let's focus on the most common:
>>>>
>>>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>>>
>>>> And doing the same in each and every process.
>>>
>>> That may be the most common in your usage, but for a database, you're
>>> looking at two usage scenarios.  Postgres calls mmap() on the database
>>> file itself so that all processes share the kernel page cache.
>>> Some Commercial Databases call mmap() on a hugetlbfs file so that all
>>> processes share the same userspace buffer cache.  Other Commecial
>>> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
>>> same reason.
>>
>> I remember you said that postgres might be looking into using shmem as well,
>> maybe I am wrong.
> 
> No, I said that postgres was also interested in sharing page tables.
> I don't think they have any use for shmem.
> 
>> memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems"
>> would require more work.
> 
> But arbitrary filesystems was one of the origin use cases; where the
> database is stored on a persistent memory filesystem, and neither the
> kernel nor userspace has a cache.  The Postgres & Commercial Database
> use-cases collapse into the same case, and we want to mmap the files
> directly and share the page tables.

Yes, and transparent page table sharing can be achieved otherwise.

I guess what you imply is that they want to share page tables and have a 
single mprotect(PROT_READ) to modify the shared page tables.

> 
>>> This is why I proposed mshare().  Anyone can use it for anything.
>>> We have such a diverse set of users who want to do stuff with shared
>>> page tables that we should not be tying it to memfd or any other
>>> filesystem.  Not to mention that it's more flexible; you can map
>>> individual 4kB files into it and still get page table sharing.
>>
>> That's not what the current proposal does, or am I wrong?
> 
> I think you're wrong, but I haven't had time to read the latest patches.
> 

Maybe I misunderstood what the MAP_SHARED_PT actually does.

"
This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.
"

Reading this, I'm confused how 4k files would interact with the PMD size 
requirement.

Probably I got it all wrong.

>> Also, I'm curious, is that a real requirement in the database world?
> 
> I don't know.  It's definitely an advantage that falls out of the design
> of mshare.

Okay, just checking if there is an important use case I'm missing, I'm 
also not aware of any.


Anyhow, I have other work to do. Happy to continue the discussion 
someone is actually working on this (again).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-07-31 16:38           ` Matthew Wilcox
  2023-07-31 16:48             ` David Hildenbrand
@ 2023-08-01  6:53             ` Rongwei Wang
  2023-08-01 19:28               ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Rongwei Wang @ 2023-08-01  6:53 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com


On 2023/8/1 00:38, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>> Assume we do do the page table sharing at mmap time, if the flags are right.
>> Let's focus on the most common:
>>
>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>
>> And doing the same in each and every process.
> That may be the most common in your usage, but for a database, you're
> looking at two usage scenarios.  Postgres calls mmap() on the database
> file itself so that all processes share the kernel page cache.
> Some Commercial Databases call mmap() on a hugetlbfs file so that all
> processes share the same userspace buffer cache.  Other Commecial
> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> same reason.
>
> This is why I proposed mshare().  Anyone can use it for anything.

Hi Matthew

I'm a little confused about this mshare(). Which one is the mshare() you 
refer to here, previous mshare() based on filesystem or this RFC v2 
posted by Khalid?

IMHO, they have much difference between previously mshare() and 
MAP_SHARED_PT now.

> We have such a diverse set of users who want to do stuff with shared
> page tables that we should not be tying it to memfd or any other
> filesystem.  Not to mention that it's more flexible; you can map
> individual 4kB files into it and still get page table sharing.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare)
  2023-08-01  6:53             ` Rongwei Wang
@ 2023-08-01 19:28               ` Matthew Wilcox
  0 siblings, 0 replies; 21+ messages in thread
From: Matthew Wilcox @ 2023-08-01 19:28 UTC (permalink / raw)
  To: Rongwei Wang; +Cc: linux-arch, linux-kernel, linux-mm, xuyu@linux.alibaba.com

On Tue, Aug 01, 2023 at 02:53:02PM +0800, Rongwei Wang wrote:
> 
> On 2023/8/1 00:38, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> > > Assume we do do the page table sharing at mmap time, if the flags are right.
> > > Let's focus on the most common:
> > > 
> > > mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> > > 
> > > And doing the same in each and every process.
> > That may be the most common in your usage, but for a database, you're
> > looking at two usage scenarios.  Postgres calls mmap() on the database
> > file itself so that all processes share the kernel page cache.
> > Some Commercial Databases call mmap() on a hugetlbfs file so that all
> > processes share the same userspace buffer cache.  Other Commecial
> > Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> > same reason.
> > 
> > This is why I proposed mshare().  Anyone can use it for anything.
> 
> Hi Matthew
> 
> I'm a little confused about this mshare(). Which one is the mshare() you
> refer to here, previous mshare() based on filesystem or this RFC v2 posted
> by Khalid?
> 
> IMHO, they have much difference between previously mshare() and
> MAP_SHARED_PT now.

I haven't read this version of the patchset.  I'm describing the original
idea, not what it may have turned into.  As far as I'm concerned, we're
still trying to decide what functionality we actually want, not arguing
about whether this exact patchset has the correct number of tab indents
to be merged.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-08-01 19:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-26 16:49 [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
2023-04-26 16:49 ` [PATCH RFC v2 1/4] mm/ptshare: Add vm flag for shared PTE Khalid Aziz
2023-04-26 16:49 ` [PATCH RFC v2 2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap() Khalid Aziz
2023-04-26 16:49 ` [PATCH RFC v2 3/4] mm/ptshare: Create new mm struct for page table sharing Khalid Aziz
2023-06-26  8:08   ` Karim Manaouil
2023-04-26 16:49 ` [PATCH RFC v2 4/4] mm/ptshare: Add page fault handling for page table shared regions Khalid Aziz
2023-04-26 21:27 ` [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Mike Kravetz
2023-04-27 16:40   ` Khalid Aziz
2023-06-12 16:25 ` Peter Xu
2023-06-30 11:29 ` Rongwei Wang
2023-07-31  4:35 ` Rongwei Wang
2023-07-31 12:25   ` Matthew Wilcox
2023-07-31 12:50     ` David Hildenbrand
2023-07-31 16:19       ` Rongwei Wang
2023-07-31 16:30         ` David Hildenbrand
2023-07-31 16:38           ` Matthew Wilcox
2023-07-31 16:48             ` David Hildenbrand
2023-07-31 16:54               ` Matthew Wilcox
2023-07-31 17:06                 ` David Hildenbrand
2023-08-01  6:53             ` Rongwei Wang
2023-08-01 19:28               ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).