* [RFC PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) @ 2022-12-06 15:41 Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE Khalid Aziz ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Khalid Aziz @ 2022-12-06 15:41 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, khalid.aziz [Previously mshare patches. After discussion on mailing list and at LSF/MM, mshare implementation has been reworked to eliminate the mshare API and use mmap instead with a new flag. This eliminates the need for msharefs. Alignment and size restrictions were changed to PMD] Memory pages shared between processes require a page table entry (PTE) for each process. Each of these PTE consumes consume some of the memory and as long as number of mappings being maintained is small enough, this space consumed by page tables is not objectionable. When very few memory pages are shared between processes, the number of page table entries (PTEs) to maintain is mostly constrained by the number of pages of memory on the system. As the number of shared pages and the number of times pages are shared goes up, amount of memory consumed by page tables starts to become significant. This issue does not apply to threads. Any number of threads can share the same pages inside a process while sharing the same PTEs. Extending this same model to sharing pages across processes can eliminate this issue for sharing across processes as well. Some of the field deployments commonly see memory pages shared across 1000s of processes. On x86_64, each page requires a PTE that is only 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page. On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, amount of memory saved is very significant. This patch series adds a new flag to mmap() call - MAP_SHARED_PT. This map can be specified along with MAP_SHARED by a process to hint to kernel that it wishes to share page table entries for this file mapping mmap region with other processes. Any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries. Besides specifying MAP_SHARED_PT flag, the processes must map the files at a PMD aligned address with a size that is a multiple of PMD size and at the same virtual addresses. This last requirement of same virtual addresses can possibly be relaxed if that is the consensus. When mmap() is called with MAP_SHARED_PT flag, a new host mm struct is created to hold the shared page tables. Host mm struct is not attached to a process. Start and size of host mm are set to the start and size of the mmap region and a VMA covering this range is also added to host mm struct. Existing page table entries from the process that creates the mapping are copied over to the host mm struct. All processes mapping this shared region are considered guest processes. When a guest process mmap's the shared region, a vm flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page fault, VMA is checked for the presence of VM_SHARED_PT flag. If the flag is found, its corresponding PMD is updated with the PMD from host mm struct so the PMD will point to the page tables in host mm struct. vm_mm pointer of the VMA is also updated to point to host mm struct for the duration of fault handling to ensure fault handling happens in the context of host mm struct. When a new PTE is created, it is created in the host mm struct page tables and the PMD in guest mm points to the same PTEs. This is a basic working implementation. It will need to go through more testing and refinements. Some notes and questions: - PMD size alignment and size requirement is currently hard coded in. Is there a need or desire to make this more flexible and work with other alignments/sizes? PMD size allows for adapting this infrastructure to form the basis for hugetlbfs page table sharing as well. More work ill be needed to make that happen. - Is there a reason to allow a userspace app to query this size and alignment requirement for MAP_SHARED_PT in some way? - Shared PTEs means mprotect() call made by one process affects all processes sharing the same mapping and that behavior will need to be documented clearly. Effect of mprotect call being different for processes using shared page tables is the primary reason to require an explicit opt-in from userspace processes to share page tables. With a transparent sharing derived from MAP_SHARED alone, changed effect of mprotect can break significant number of userspace apps. One could work around that by unsharing whenever mprotect changes modes on shared mapping but that introduces complexity and the capability to execute a single mprotect to change modes across 1000's of processes sharing a mapped database is a feature explicitly asked for by database folks. This capability has significant performance advantage when compared to mechanism of sending messages to every process using shared mapping to call mprotect and change modes in each process, or using traps on permissions mismatch in each process. - This implementation does not allow unmapping page table shared mappings partially. Should that be supported in future? - Patch 2 will be broken up into smaller patches after RFC. It likely is easier to review the proposal with all code in one patch for now. Khalid Aziz (2): mm/ptshare: Add vm flag for shared PTE mm/ptshare: Create a new mm to share pagetables include/linux/fs.h | 1 + include/linux/mm.h | 8 + include/trace/events/mmflags.h | 3 +- include/uapi/asm-generic/mman-common.h | 1 + mm/Makefile | 2 +- mm/internal.h | 21 ++ mm/memory.c | 52 ++++- mm/mmap.c | 87 ++++++++ mm/ptshare.c | 262 +++++++++++++++++++++++++ 9 files changed, 433 insertions(+), 4 deletions(-) create mode 100644 mm/ptshare.c -- 2.34.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE 2022-12-06 15:41 [RFC PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz @ 2022-12-06 15:41 ` Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND " Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz 2 siblings, 1 reply; 6+ messages in thread From: Khalid Aziz @ 2022-12-06 15:41 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, Khalid Aziz Add a bit to vm_flags to indicate a vma shares PTEs with others. Add a function to determine if a vma shares PTE by checking this flag. This is to be used to find the shared page table entries on page fault for vmas sharing PTE. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/mm.h | 8 ++++++++ include/trace/events/mmflags.h | 3 ++- mm/internal.h | 5 +++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bbcccbc5565..699323be7502 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -314,11 +314,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -360,6 +362,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_MTE_ALLOWED VM_NONE #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +#define VM_SHARED_PT VM_HIGH_ARCH_5 +#else +#define VM_SHARED_PT VM_NONE +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index e87cb2b80ed3..30e56cbac99b 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ - {VM_MERGEABLE, "mergeable" } \ + {VM_MERGEABLE, "mergeable" }, \ + {VM_SHARED_PT, "sharedpt" } \ #define show_vma_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/internal.h b/mm/internal.h index 6b7ef495b56d..16083eca720e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -856,4 +856,9 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) return !(vma->vm_flags & VM_SOFTDIRTY); } +static inline bool vma_is_shared(const struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_SHARED_PT; +} + #endif /* __MM_INTERNAL_H */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC RESEND PATCH 1/2] mm/ptshare: Add vm flag for shared PTE 2022-12-06 15:41 ` [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE Khalid Aziz @ 2023-01-20 16:08 ` Khalid Aziz 0 siblings, 0 replies; 6+ messages in thread From: Khalid Aziz @ 2023-01-20 16:08 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao Add a bit to vm_flags to indicate a vma shares PTEs with others. Add a function to determine if a vma shares PTE by checking this flag. This is to be used to find the shared page table entries on page fault for vmas sharing PTE. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/mm.h | 8 ++++++++ include/trace/events/mmflags.h | 3 ++- mm/internal.h | 5 +++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8bbcccbc5565..699323be7502 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -314,11 +314,13 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_2 34 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) +#define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -360,6 +362,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_MTE_ALLOWED VM_NONE #endif +#ifdef CONFIG_ARCH_USES_HIGH_VMA_FLAGS +#define VM_SHARED_PT VM_HIGH_ARCH_5 +#else +#define VM_SHARED_PT VM_NONE +#endif + #ifndef VM_GROWSUP # define VM_GROWSUP VM_NONE #endif diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index e87cb2b80ed3..30e56cbac99b 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -194,7 +194,8 @@ IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY, "softdirty" ) \ {VM_MIXEDMAP, "mixedmap" }, \ {VM_HUGEPAGE, "hugepage" }, \ {VM_NOHUGEPAGE, "nohugepage" }, \ - {VM_MERGEABLE, "mergeable" } \ + {VM_MERGEABLE, "mergeable" }, \ + {VM_SHARED_PT, "sharedpt" } \ #define show_vma_flags(flags) \ (flags) ? __print_flags(flags, "|", \ diff --git a/mm/internal.h b/mm/internal.h index 6b7ef495b56d..16083eca720e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -856,4 +856,9 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma) return !(vma->vm_flags & VM_SOFTDIRTY); } +static inline bool vma_is_shared(const struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_SHARED_PT; +} + #endif /* __MM_INTERNAL_H */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support 2022-12-06 15:41 [RFC PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE Khalid Aziz @ 2022-12-06 15:41 ` Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND " Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz 2 siblings, 1 reply; 6+ messages in thread From: Khalid Aziz @ 2022-12-06 15:41 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, Khalid Aziz When a process mmaps a file with MAP_SHARED flag, it is possible that any other processes mmaping the same file with MAP_SHARED flag with same permissions could share the page table entries as well instead of creating duplicate entries. This patch introduces a new flag MAP_SHARED_PT which a process can use to hint that it can share page atbes with other processes using the same mapping. It creates a new mm struct to hold the shareable page table entries for the newly mapped region. This new mm is not associated with a task. Its lifetime is until the last shared mapping is deleted. It adds a new pointer "ptshare_data" to struct address_space which points to the data structure that will contain pointer to this newly created mm. Add support for creating a new set of shared page tables in a new mm_struct upon mmap of an region that can potentially share page tables. Add page fault handling for this now shared region. Modify free_pgtables path to make sure page tables in the shared regions are kept intact when a process using page table region exits and there are other mappers for the shared region. Clean up mm_struct holding shared page tables when the last process sharing the region exits. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/fs.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/Makefile | 2 +- mm/internal.h | 16 ++ mm/memory.c | 52 ++++- mm/mmap.c | 87 ++++++++ mm/ptshare.c | 262 +++++++++++++++++++++++++ 7 files changed, 418 insertions(+), 3 deletions(-) create mode 100644 mm/ptshare.c diff --git a/include/linux/fs.h b/include/linux/fs.h index 59ae95ddb679..f940bf03303b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -441,6 +441,7 @@ struct address_space { spinlock_t private_lock; struct list_head private_list; void *private_data; + void *ptshare_data; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..4d23456b5915 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -29,6 +29,7 @@ #define MAP_HUGETLB 0x040000 /* create a huge page mapping */ #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_SHARED_PT 0x200000 /* Shared page table mappings */ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..d9bb14fdf220 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -40,7 +40,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o + pgtable-generic.o rmap.o vmalloc.o ptshare.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/internal.h b/mm/internal.h index 16083eca720e..22cae2ff97fa 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -861,4 +861,20 @@ static inline bool vma_is_shared(const struct vm_area_struct *vma) return vma->vm_flags & VM_SHARED_PT; } +/* + * mm/ptshare.c + */ +struct mshare_data { + struct mm_struct *mm; + refcount_t refcnt; + unsigned long start; + unsigned long size; + unsigned long mode; +}; +int ptshare_new_mm(struct file *file, struct vm_area_struct *vma); +void ptshare_del_mm(struct vm_area_struct *vm); +int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma); +extern vm_fault_t find_shared_vma(struct vm_area_struct **vmap, + unsigned long *addrp, unsigned int flags); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 8a6d5c823f91..051c82e13627 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -416,15 +416,21 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, unlink_anon_vmas(vma); unlink_file_vma(vma); + /* + * If vma is sharing page tables through a host mm, do not + * free page tables for it. Those page tables wil be freed + * when host mm is released. + */ if (is_vm_hugetlb_page(vma)) { hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); - } else { + } else if (!vma_is_shared(vma)) { /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_vm_hugetlb_page(next)) { + && !is_vm_hugetlb_page(next) + && !vma_is_shared(next)) { vma = next; next = mas_find(&mas, ceiling - 1); unlink_anon_vmas(vma); @@ -5189,6 +5195,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; + bool shared = false; + struct mm_struct *orig_mm; __set_current_state(TASK_RUNNING); @@ -5198,6 +5206,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, /* do counter updates before entering really critical section. */ check_sync_rss_stat(current); + orig_mm = vma->vm_mm; + if (unlikely(vma_is_shared(vma))) { + ret = find_shared_vma(&vma, &address, flags); + if (ret) + return ret; + if (!vma) + return VM_FAULT_SIGSEGV; + shared = true; + } + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) @@ -5219,6 +5237,36 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, lru_gen_exit_fault(); + /* + * Release the read lock on shared VMA's parent mm unless + * __handle_mm_fault released the lock already. + * __handle_mm_fault sets VM_FAULT_RETRY in return value if + * it released mmap lock. If lock was released, that implies + * the lock would have been released on task's original mm if + * this were not a shared PTE vma. To keep lock state consistent, + * make sure to release the lock on task's original mm + */ + if (shared) { + int release_mmlock = 1; + + if (!(ret & VM_FAULT_RETRY)) { + mmap_read_unlock(vma->vm_mm); + release_mmlock = 0; + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) && + (flags & FAULT_FLAG_RETRY_NOWAIT)) { + mmap_read_unlock(vma->vm_mm); + release_mmlock = 0; + } + + /* + * Reset guest vma pointers that were set up in + * find_shared_vma() to process this fault. + */ + vma->vm_mm = orig_mm; + if (release_mmlock) + mmap_read_unlock(orig_mm); + } + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/mmap.c b/mm/mmap.c index 74a84eb33b90..c1adb9d52f5d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1245,6 +1245,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; vm_flags_t vm_flags; int pkey = 0; + int ptshare = 0; validate_mm(mm); *populate = 0; @@ -1282,6 +1283,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (mm->map_count > sysctl_max_map_count) return -ENOMEM; + /* + * If MAP_SHARED_PT is set, MAP_SHARED or MAP_SHARED_VALIDATE must + * be set as well + */ + if (flags & MAP_SHARED_PT) { +#if VM_SHARED_PT + if (flags & (MAP_SHARED | MAP_SHARED_VALIDATE)) + ptshare = 1; + else + return -EINVAL; +#else + return -EINVAL; +#endif + } + /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ @@ -1414,6 +1430,60 @@ unsigned long do_mmap(struct file *file, unsigned long addr, ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; + +#if VM_SHARED_PT + /* + * Check if this mapping is a candidate for page table sharing + * at PMD level. It is if following conditions hold: + * - It is not anonymous mapping + * - It is not hugetlbfs mapping (for now) + * - flags conatins MAP_SHARED or MAP_SHARED_VALIDATE and + * MAP_SHARED_PT + * - Start address is aligned to PMD size + * - Mapping size is a multiple of PMD size + */ + if (ptshare && file && !is_file_hugepages(file)) { + struct vm_area_struct *vma; + + vma = find_vma(mm, addr); + if (!((vma->vm_start | vma->vm_end) & (PMD_SIZE - 1))) { + struct mshare_data *info = file->f_mapping->ptshare_data; + + /* + * If this mapping has not been set up for page table + * sharing yet, do so by creating a new mm to hold the + * shared page tables for this mapping + */ + if (info == NULL) { + int ret; + + ret = ptshare_new_mm(file, vma); + if (ret < 0) + return ret; + info = file->f_mapping->ptshare_data; + ret = ptshare_insert_vma(info->mm, vma); + if (ret < 0) + addr = ret; + else + vma->vm_flags |= VM_SHARED_PT; + } else { + /* + * Page tables will be shared only if the + * file is mapped in with the same permissions + * across all mappers with same starting + * address and size + */ + if (((prot & info->mode) == info->mode) && + (addr == info->start) && + (len == info->size)) { + vma->vm_flags |= VM_SHARED_PT; + refcount_inc(&info->refcnt); + } + } + } + } +#endif + return addr; } @@ -2491,6 +2561,21 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm, if (end == start) return -EINVAL; + /* + * Check if this vma uses shared page tables + */ + vma = find_vma_intersection(mm, start, end); + if (vma && unlikely(vma_is_shared(vma))) { + struct mshare_data *info = NULL; + + if (vma->vm_file && vma->vm_file->f_mapping) + info = vma->vm_file->f_mapping->ptshare_data; + /* Don't allow partial munmaps */ + if (info && ((start != info->start) || (len != info->size))) + return -EINVAL; + ptshare_del_mm(vma); + } + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end); @@ -2660,6 +2745,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, } } + if (vm_flags & VM_SHARED_PT) + vma->vm_flags |= VM_SHARED_PT; vm_flags = vma->vm_flags; } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); diff --git a/mm/ptshare.c b/mm/ptshare.c new file mode 100644 index 000000000000..97322f822233 --- /dev/null +++ b/mm/ptshare.c @@ -0,0 +1,262 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Share page table entries when possible to reduce the amount of extra + * memory consumed by page tables + * + * Copyright (C) 2022 Oracle Corp. All rights reserved. + * Authors: Khalid Aziz <khalid.aziz@oracle.com> + * Matthew Wilcox <willy@infradead.org> + */ + +#include <linux/mm.h> +#include <linux/fs.h> +#include <asm/pgalloc.h> +#include "internal.h" + +/* + */ +static pmd_t +*get_pmd(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + pgd = pgd_offset(mm, addr); + if (pgd_none(*pgd)) + return NULL; + + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) { + p4d = p4d_alloc(mm, pgd, addr); + if (!p4d) + return NULL; + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) { + pud = pud_alloc(mm, p4d, addr); + if (!pud) + return NULL; + } + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) { + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + } + + return pmd; +} + +/* + * Find the shared pmd entries in host mm struct and install them into + * guest page tables. + */ +static int +ptshare_copy_pmd(struct mm_struct *host_mm, struct mm_struct *guest_mm, + struct vm_area_struct *vma, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *host_pmd; + spinlock_t *host_ptl, *guest_ptl; + + pgd = pgd_offset(guest_mm, addr); + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) { + p4d = p4d_alloc(guest_mm, pgd, addr); + if (!p4d) + return 1; + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) { + host_pmd = get_pmd(host_mm, addr); + if (!host_pmd) + return 1; + + get_page(virt_to_page(host_pmd)); + host_ptl = pmd_lockptr(host_mm, host_pmd); + guest_ptl = pud_lockptr(guest_mm, pud); + spin_lock(host_ptl); + spin_lock(guest_ptl); + pud_populate(guest_mm, pud, + (pmd_t *)((unsigned long)host_pmd & PAGE_MASK)); + put_page(virt_to_page(host_pmd)); + spin_unlock(guest_ptl); + spin_unlock(host_ptl); + } + + return 0; +} + +/* + * Find the shared page tables in hosting mm struct and install those in + * the guest mm struct + */ +vm_fault_t +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp, + unsigned int flags) +{ + struct mshare_data *info; + struct mm_struct *host_mm; + struct vm_area_struct *host_vma, *guest_vma = *vmap; + unsigned long host_addr; + pmd_t *guest_pmd, *host_pmd; + + if ((!guest_vma->vm_file) || (!guest_vma->vm_file->f_mapping)) + return 0; + info = guest_vma->vm_file->f_mapping->ptshare_data; + if (!info) { + pr_warn("VM_SHARED_PT vma with NULL ptshare_data"); + dump_stack_print_info(KERN_WARNING); + return 0; + } + host_mm = info->mm; + + mmap_read_lock(host_mm); + host_addr = *addrp - guest_vma->vm_start + host_mm->mmap_base; + host_pmd = get_pmd(host_mm, host_addr); + guest_pmd = get_pmd(guest_vma->vm_mm, *addrp); + if (!pmd_same(*guest_pmd, *host_pmd)) { + set_pmd(guest_pmd, *host_pmd); + mmap_read_unlock(host_mm); + return VM_FAULT_NOPAGE; + } + + *addrp = host_addr; + host_vma = find_vma(host_mm, host_addr); + if (!host_vma) + return VM_FAULT_SIGSEGV; + + /* + * Point vm_mm for the faulting vma to the mm struct holding shared + * page tables so the fault handling will happen in the right + * shared context + */ + guest_vma->vm_mm = host_mm; + + return 0; +} + +/* + * Create a new mm struct that will hold the shared PTEs. Pointer to + * this new mm is stored in the data structure mshare_data which also + * includes a refcount for any current references to PTEs in this new + * mm. This refcount is used to determine when the mm struct for shared + * PTEs can be deleted. + */ +int +ptshare_new_mm(struct file *file, struct vm_area_struct *vma) +{ + struct mm_struct *new_mm; + struct mshare_data *info = NULL; + int retval = 0; + unsigned long start = vma->vm_start; + unsigned long len = vma->vm_end - vma->vm_start; + + new_mm = mm_alloc(); + if (!new_mm) { + retval = -ENOMEM; + goto err_free; + } + new_mm->mmap_base = start; + new_mm->task_size = len; + if (!new_mm->task_size) + new_mm->task_size--; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) { + retval = -ENOMEM; + goto err_free; + } + info->mm = new_mm; + info->start = start; + info->size = len; + refcount_set(&info->refcnt, 1); + file->f_mapping->ptshare_data = info; + + return retval; + +err_free: + if (new_mm) + mmput(new_mm); + kfree(info); + return retval; +} + +/* + * insert vma into mm holding shared page tables + */ +int +ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma) +{ + struct vm_area_struct *new_vma; + int err = 0; + + new_vma = vm_area_dup(vma); + if (!new_vma) + return -ENOMEM; + + new_vma->vm_file = NULL; + /* + * This new vma belongs to host mm, so clear the VM_SHARED_PT + * flag on this so we know this is the host vma when we clean + * up page tables. Do not use THP for page table shared regions + */ + new_vma->vm_flags &= ~(VM_SHARED | VM_SHARED_PT); + new_vma->vm_flags |= VM_NOHUGEPAGE; + new_vma->vm_mm = mm; + + err = insert_vm_struct(mm, new_vma); + if (err) + return -ENOMEM; + + /* + * Copy the PMD entries from host mm to guest so they use the + * same PTEs + */ + err = ptshare_copy_pmd(mm, vma->vm_mm, vma, vma->vm_start); + + return err; +} + +/* + * Free the mm struct created to hold shared PTEs and associated data + * structures + */ +static inline void +free_ptshare_mm(struct mshare_data *info) +{ + mmput(info->mm); + kfree(info); +} + +/* + * This function is called when a reference to the shared PTEs in mm + * struct is dropped. It updates refcount and checks to see if last + * reference to the mm struct holding shared PTEs has been dropped. If + * so, it cleans up the mm struct and associated data structures + */ +void +ptshare_del_mm(struct vm_area_struct *vma) +{ + struct mshare_data *info; + struct file *file = vma->vm_file; + + if (!file || (!file->f_mapping)) + return; + info = file->f_mapping->ptshare_data; + WARN_ON(!info); + if (!info) + return; + + if (refcount_dec_and_test(&info->refcnt)) { + free_ptshare_mm(info); + file->f_mapping->ptshare_data = NULL; + } +} -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC RESEND PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support 2022-12-06 15:41 ` [RFC PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support Khalid Aziz @ 2023-01-20 16:08 ` Khalid Aziz 0 siblings, 0 replies; 6+ messages in thread From: Khalid Aziz @ 2023-01-20 16:08 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao When a process mmaps a file with MAP_SHARED flag, it is possible that any other processes mmaping the same file with MAP_SHARED flag with same permissions could share the page table entries as well instead of creating duplicate entries. This patch introduces a new flag MAP_SHARED_PT which a process can use to hint that it can share page atbes with other processes using the same mapping. It creates a new mm struct to hold the shareable page table entries for the newly mapped region. This new mm is not associated with a task. Its lifetime is until the last shared mapping is deleted. It adds a new pointer "ptshare_data" to struct address_space which points to the data structure that will contain pointer to this newly created mm. Add support for creating a new set of shared page tables in a new mm_struct upon mmap of an region that can potentially share page tables. Add page fault handling for this now shared region. Modify free_pgtables path to make sure page tables in the shared regions are kept intact when a process using page table region exits and there are other mappers for the shared region. Clean up mm_struct holding shared page tables when the last process sharing the region exits. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> --- include/linux/fs.h | 1 + include/uapi/asm-generic/mman-common.h | 1 + mm/Makefile | 2 +- mm/internal.h | 16 ++ mm/memory.c | 52 ++++- mm/mmap.c | 87 ++++++++ mm/ptshare.c | 262 +++++++++++++++++++++++++ 7 files changed, 418 insertions(+), 3 deletions(-) create mode 100644 mm/ptshare.c diff --git a/include/linux/fs.h b/include/linux/fs.h index 59ae95ddb679..f940bf03303b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -441,6 +441,7 @@ struct address_space { spinlock_t private_lock; struct list_head private_list; void *private_data; + void *ptshare_data; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /* * On most architectures that alignment is already the case; but diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index 6ce1f1ceb432..4d23456b5915 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -29,6 +29,7 @@ #define MAP_HUGETLB 0x040000 /* create a huge page mapping */ #define MAP_SYNC 0x080000 /* perform synchronous page faults for the mapping */ #define MAP_FIXED_NOREPLACE 0x100000 /* MAP_FIXED which doesn't unmap underlying mapping */ +#define MAP_SHARED_PT 0x200000 /* Shared page table mappings */ #define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be * uninitialized */ diff --git a/mm/Makefile b/mm/Makefile index 8e105e5b3e29..d9bb14fdf220 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -40,7 +40,7 @@ mmu-y := nommu.o mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o + pgtable-generic.o rmap.o vmalloc.o ptshare.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/internal.h b/mm/internal.h index 16083eca720e..22cae2ff97fa 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -861,4 +861,20 @@ static inline bool vma_is_shared(const struct vm_area_struct *vma) return vma->vm_flags & VM_SHARED_PT; } +/* + * mm/ptshare.c + */ +struct mshare_data { + struct mm_struct *mm; + refcount_t refcnt; + unsigned long start; + unsigned long size; + unsigned long mode; +}; +int ptshare_new_mm(struct file *file, struct vm_area_struct *vma); +void ptshare_del_mm(struct vm_area_struct *vm); +int ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma); +extern vm_fault_t find_shared_vma(struct vm_area_struct **vmap, + unsigned long *addrp, unsigned int flags); + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memory.c b/mm/memory.c index 8a6d5c823f91..051c82e13627 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -416,15 +416,21 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, unlink_anon_vmas(vma); unlink_file_vma(vma); + /* + * If vma is sharing page tables through a host mm, do not + * free page tables for it. Those page tables wil be freed + * when host mm is released. + */ if (is_vm_hugetlb_page(vma)) { hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next ? next->vm_start : ceiling); - } else { + } else if (!vma_is_shared(vma)) { /* * Optimization: gather nearby vmas into one call down */ while (next && next->vm_start <= vma->vm_end + PMD_SIZE - && !is_vm_hugetlb_page(next)) { + && !is_vm_hugetlb_page(next) + && !vma_is_shared(next)) { vma = next; next = mas_find(&mas, ceiling - 1); unlink_anon_vmas(vma); @@ -5189,6 +5195,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; + bool shared = false; + struct mm_struct *orig_mm; __set_current_state(TASK_RUNNING); @@ -5198,6 +5206,16 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, /* do counter updates before entering really critical section. */ check_sync_rss_stat(current); + orig_mm = vma->vm_mm; + if (unlikely(vma_is_shared(vma))) { + ret = find_shared_vma(&vma, &address, flags); + if (ret) + return ret; + if (!vma) + return VM_FAULT_SIGSEGV; + shared = true; + } + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, flags & FAULT_FLAG_INSTRUCTION, flags & FAULT_FLAG_REMOTE)) @@ -5219,6 +5237,36 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, lru_gen_exit_fault(); + /* + * Release the read lock on shared VMA's parent mm unless + * __handle_mm_fault released the lock already. + * __handle_mm_fault sets VM_FAULT_RETRY in return value if + * it released mmap lock. If lock was released, that implies + * the lock would have been released on task's original mm if + * this were not a shared PTE vma. To keep lock state consistent, + * make sure to release the lock on task's original mm + */ + if (shared) { + int release_mmlock = 1; + + if (!(ret & VM_FAULT_RETRY)) { + mmap_read_unlock(vma->vm_mm); + release_mmlock = 0; + } else if ((flags & FAULT_FLAG_ALLOW_RETRY) && + (flags & FAULT_FLAG_RETRY_NOWAIT)) { + mmap_read_unlock(vma->vm_mm); + release_mmlock = 0; + } + + /* + * Reset guest vma pointers that were set up in + * find_shared_vma() to process this fault. + */ + vma->vm_mm = orig_mm; + if (release_mmlock) + mmap_read_unlock(orig_mm); + } + if (flags & FAULT_FLAG_USER) { mem_cgroup_exit_user_fault(); /* diff --git a/mm/mmap.c b/mm/mmap.c index 74a84eb33b90..c1adb9d52f5d 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1245,6 +1245,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; vm_flags_t vm_flags; int pkey = 0; + int ptshare = 0; validate_mm(mm); *populate = 0; @@ -1282,6 +1283,21 @@ unsigned long do_mmap(struct file *file, unsigned long addr, if (mm->map_count > sysctl_max_map_count) return -ENOMEM; + /* + * If MAP_SHARED_PT is set, MAP_SHARED or MAP_SHARED_VALIDATE must + * be set as well + */ + if (flags & MAP_SHARED_PT) { +#if VM_SHARED_PT + if (flags & (MAP_SHARED | MAP_SHARED_VALIDATE)) + ptshare = 1; + else + return -EINVAL; +#else + return -EINVAL; +#endif + } + /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ @@ -1414,6 +1430,60 @@ unsigned long do_mmap(struct file *file, unsigned long addr, ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; + +#if VM_SHARED_PT + /* + * Check if this mapping is a candidate for page table sharing + * at PMD level. It is if following conditions hold: + * - It is not anonymous mapping + * - It is not hugetlbfs mapping (for now) + * - flags conatins MAP_SHARED or MAP_SHARED_VALIDATE and + * MAP_SHARED_PT + * - Start address is aligned to PMD size + * - Mapping size is a multiple of PMD size + */ + if (ptshare && file && !is_file_hugepages(file)) { + struct vm_area_struct *vma; + + vma = find_vma(mm, addr); + if (!((vma->vm_start | vma->vm_end) & (PMD_SIZE - 1))) { + struct mshare_data *info = file->f_mapping->ptshare_data; + + /* + * If this mapping has not been set up for page table + * sharing yet, do so by creating a new mm to hold the + * shared page tables for this mapping + */ + if (info == NULL) { + int ret; + + ret = ptshare_new_mm(file, vma); + if (ret < 0) + return ret; + info = file->f_mapping->ptshare_data; + ret = ptshare_insert_vma(info->mm, vma); + if (ret < 0) + addr = ret; + else + vma->vm_flags |= VM_SHARED_PT; + } else { + /* + * Page tables will be shared only if the + * file is mapped in with the same permissions + * across all mappers with same starting + * address and size + */ + if (((prot & info->mode) == info->mode) && + (addr == info->start) && + (len == info->size)) { + vma->vm_flags |= VM_SHARED_PT; + refcount_inc(&info->refcnt); + } + } + } + } +#endif + return addr; } @@ -2491,6 +2561,21 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm, if (end == start) return -EINVAL; + /* + * Check if this vma uses shared page tables + */ + vma = find_vma_intersection(mm, start, end); + if (vma && unlikely(vma_is_shared(vma))) { + struct mshare_data *info = NULL; + + if (vma->vm_file && vma->vm_file->f_mapping) + info = vma->vm_file->f_mapping->ptshare_data; + /* Don't allow partial munmaps */ + if (info && ((start != info->start) || (len != info->size))) + return -EINVAL; + ptshare_del_mm(vma); + } + /* arch_unmap() might do unmaps itself. */ arch_unmap(mm, start, end); @@ -2660,6 +2745,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, } } + if (vm_flags & VM_SHARED_PT) + vma->vm_flags |= VM_SHARED_PT; vm_flags = vma->vm_flags; } else if (vm_flags & VM_SHARED) { error = shmem_zero_setup(vma); diff --git a/mm/ptshare.c b/mm/ptshare.c new file mode 100644 index 000000000000..97322f822233 --- /dev/null +++ b/mm/ptshare.c @@ -0,0 +1,262 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Share page table entries when possible to reduce the amount of extra + * memory consumed by page tables + * + * Copyright (C) 2022 Oracle Corp. All rights reserved. + * Authors: Khalid Aziz <khalid.aziz@oracle.com> + * Matthew Wilcox <willy@infradead.org> + */ + +#include <linux/mm.h> +#include <linux/fs.h> +#include <asm/pgalloc.h> +#include "internal.h" + +/* + */ +static pmd_t +*get_pmd(struct mm_struct *mm, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + pgd = pgd_offset(mm, addr); + if (pgd_none(*pgd)) + return NULL; + + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) { + p4d = p4d_alloc(mm, pgd, addr); + if (!p4d) + return NULL; + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) { + pud = pud_alloc(mm, p4d, addr); + if (!pud) + return NULL; + } + + pmd = pmd_offset(pud, addr); + if (pmd_none(*pmd)) { + pmd = pmd_alloc(mm, pud, addr); + if (!pmd) + return NULL; + } + + return pmd; +} + +/* + * Find the shared pmd entries in host mm struct and install them into + * guest page tables. + */ +static int +ptshare_copy_pmd(struct mm_struct *host_mm, struct mm_struct *guest_mm, + struct vm_area_struct *vma, unsigned long addr) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *host_pmd; + spinlock_t *host_ptl, *guest_ptl; + + pgd = pgd_offset(guest_mm, addr); + p4d = p4d_offset(pgd, addr); + if (p4d_none(*p4d)) { + p4d = p4d_alloc(guest_mm, pgd, addr); + if (!p4d) + return 1; + } + + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) { + host_pmd = get_pmd(host_mm, addr); + if (!host_pmd) + return 1; + + get_page(virt_to_page(host_pmd)); + host_ptl = pmd_lockptr(host_mm, host_pmd); + guest_ptl = pud_lockptr(guest_mm, pud); + spin_lock(host_ptl); + spin_lock(guest_ptl); + pud_populate(guest_mm, pud, + (pmd_t *)((unsigned long)host_pmd & PAGE_MASK)); + put_page(virt_to_page(host_pmd)); + spin_unlock(guest_ptl); + spin_unlock(host_ptl); + } + + return 0; +} + +/* + * Find the shared page tables in hosting mm struct and install those in + * the guest mm struct + */ +vm_fault_t +find_shared_vma(struct vm_area_struct **vmap, unsigned long *addrp, + unsigned int flags) +{ + struct mshare_data *info; + struct mm_struct *host_mm; + struct vm_area_struct *host_vma, *guest_vma = *vmap; + unsigned long host_addr; + pmd_t *guest_pmd, *host_pmd; + + if ((!guest_vma->vm_file) || (!guest_vma->vm_file->f_mapping)) + return 0; + info = guest_vma->vm_file->f_mapping->ptshare_data; + if (!info) { + pr_warn("VM_SHARED_PT vma with NULL ptshare_data"); + dump_stack_print_info(KERN_WARNING); + return 0; + } + host_mm = info->mm; + + mmap_read_lock(host_mm); + host_addr = *addrp - guest_vma->vm_start + host_mm->mmap_base; + host_pmd = get_pmd(host_mm, host_addr); + guest_pmd = get_pmd(guest_vma->vm_mm, *addrp); + if (!pmd_same(*guest_pmd, *host_pmd)) { + set_pmd(guest_pmd, *host_pmd); + mmap_read_unlock(host_mm); + return VM_FAULT_NOPAGE; + } + + *addrp = host_addr; + host_vma = find_vma(host_mm, host_addr); + if (!host_vma) + return VM_FAULT_SIGSEGV; + + /* + * Point vm_mm for the faulting vma to the mm struct holding shared + * page tables so the fault handling will happen in the right + * shared context + */ + guest_vma->vm_mm = host_mm; + + return 0; +} + +/* + * Create a new mm struct that will hold the shared PTEs. Pointer to + * this new mm is stored in the data structure mshare_data which also + * includes a refcount for any current references to PTEs in this new + * mm. This refcount is used to determine when the mm struct for shared + * PTEs can be deleted. + */ +int +ptshare_new_mm(struct file *file, struct vm_area_struct *vma) +{ + struct mm_struct *new_mm; + struct mshare_data *info = NULL; + int retval = 0; + unsigned long start = vma->vm_start; + unsigned long len = vma->vm_end - vma->vm_start; + + new_mm = mm_alloc(); + if (!new_mm) { + retval = -ENOMEM; + goto err_free; + } + new_mm->mmap_base = start; + new_mm->task_size = len; + if (!new_mm->task_size) + new_mm->task_size--; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) { + retval = -ENOMEM; + goto err_free; + } + info->mm = new_mm; + info->start = start; + info->size = len; + refcount_set(&info->refcnt, 1); + file->f_mapping->ptshare_data = info; + + return retval; + +err_free: + if (new_mm) + mmput(new_mm); + kfree(info); + return retval; +} + +/* + * insert vma into mm holding shared page tables + */ +int +ptshare_insert_vma(struct mm_struct *mm, struct vm_area_struct *vma) +{ + struct vm_area_struct *new_vma; + int err = 0; + + new_vma = vm_area_dup(vma); + if (!new_vma) + return -ENOMEM; + + new_vma->vm_file = NULL; + /* + * This new vma belongs to host mm, so clear the VM_SHARED_PT + * flag on this so we know this is the host vma when we clean + * up page tables. Do not use THP for page table shared regions + */ + new_vma->vm_flags &= ~(VM_SHARED | VM_SHARED_PT); + new_vma->vm_flags |= VM_NOHUGEPAGE; + new_vma->vm_mm = mm; + + err = insert_vm_struct(mm, new_vma); + if (err) + return -ENOMEM; + + /* + * Copy the PMD entries from host mm to guest so they use the + * same PTEs + */ + err = ptshare_copy_pmd(mm, vma->vm_mm, vma, vma->vm_start); + + return err; +} + +/* + * Free the mm struct created to hold shared PTEs and associated data + * structures + */ +static inline void +free_ptshare_mm(struct mshare_data *info) +{ + mmput(info->mm); + kfree(info); +} + +/* + * This function is called when a reference to the shared PTEs in mm + * struct is dropped. It updates refcount and checks to see if last + * reference to the mm struct holding shared PTEs has been dropped. If + * so, it cleans up the mm struct and associated data structures + */ +void +ptshare_del_mm(struct vm_area_struct *vma) +{ + struct mshare_data *info; + struct file *file = vma->vm_file; + + if (!file || (!file->f_mapping)) + return; + info = file->f_mapping->ptshare_data; + WARN_ON(!info); + if (!info) + return; + + if (refcount_dec_and_test(&info->refcnt)) { + free_ptshare_mm(info); + file->f_mapping->ptshare_data = NULL; + } +} -- 2.34.1 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [RFC RESEND PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) 2022-12-06 15:41 [RFC PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support Khalid Aziz @ 2023-01-20 16:08 ` Khalid Aziz 2 siblings, 0 replies; 6+ messages in thread From: Khalid Aziz @ 2023-01-20 16:08 UTC (permalink / raw) To: akpm, willy, djwong, markhemm, viro, david, mike.kravetz Cc: Khalid Aziz, andreyknvl, dave.hansen, luto, 21cnbao, arnd, ebiederm, elver, linux-arch, linux-kernel, linux-mm, mhiramat, rostedt, vasily.averin, xhao [Previously mshare patches. After discussion on mailing list and at LSF/MM, mshare implementation has been reworked to eliminate the mshare API and use mmap instead with a new flag. This eliminates the need for msharefs. Alignment and size restrictions were changed to PMD] Memory pages shared between processes require a page table entry (PTE) for each process. Each of these PTE consumes consume some of the memory and as long as number of mappings being maintained is small enough, this space consumed by page tables is not objectionable. When very few memory pages are shared between processes, the number of page table entries (PTEs) to maintain is mostly constrained by the number of pages of memory on the system. As the number of shared pages and the number of times pages are shared goes up, amount of memory consumed by page tables starts to become significant. This issue does not apply to threads. Any number of threads can share the same pages inside a process while sharing the same PTEs. Extending this same model to sharing pages across processes can eliminate this issue for sharing across processes as well. Some of the field deployments commonly see memory pages shared across 1000s of processes. On x86_64, each page requires a PTE that is only 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page. On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, amount of memory saved is very significant. This patch series adds a new flag to mmap() call - MAP_SHARED_PT. This map can be specified along with MAP_SHARED by a process to hint to kernel that it wishes to share page table entries for this file mapping mmap region with other processes. Any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries. Besides specifying MAP_SHARED_PT flag, the processes must map the files at a PMD aligned address with a size that is a multiple of PMD size and at the same virtual addresses. This last requirement of same virtual addresses can possibly be relaxed if that is the consensus. When mmap() is called with MAP_SHARED_PT flag, a new host mm struct is created to hold the shared page tables. Host mm struct is not attached to a process. Start and size of host mm are set to the start and size of the mmap region and a VMA covering this range is also added to host mm struct. Existing page table entries from the process that creates the mapping are copied over to the host mm struct. All processes mapping this shared region are considered guest processes. When a guest process mmap's the shared region, a vm flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page fault, VMA is checked for the presence of VM_SHARED_PT flag. If the flag is found, its corresponding PMD is updated with the PMD from host mm struct so the PMD will point to the page tables in host mm struct. vm_mm pointer of the VMA is also updated to point to host mm struct for the duration of fault handling to ensure fault handling happens in the context of host mm struct. When a new PTE is created, it is created in the host mm struct page tables and the PMD in guest mm points to the same PTEs. This is a basic working implementation. It will need to go through more testing and refinements. Some notes and questions: - PMD size alignment and size requirement is currently hard coded in. Is there a need or desire to make this more flexible and work with other alignments/sizes? PMD size allows for adapting this infrastructure to form the basis for hugetlbfs page table sharing as well. More work ill be needed to make that happen. - Is there a reason to allow a userspace app to query this size and alignment requirement for MAP_SHARED_PT in some way? - Shared PTEs means mprotect() call made by one process affects all processes sharing the same mapping and that behavior will need to be documented clearly. Effect of mprotect call being different for processes using shared page tables is the primary reason to require an explicit opt-in from userspace processes to share page tables. With a transparent sharing derived from MAP_SHARED alone, changed effect of mprotect can break significant number of userspace apps. One could work around that by unsharing whenever mprotect changes modes on shared mapping but that introduces complexity and the capability to execute a single mprotect to change modes across 1000's of processes sharing a mapped database is a feature explicitly asked for by database folks. This capability has significant performance advantage when compared to mechanism of sending messages to every process using shared mapping to call mprotect and change modes in each process, or using traps on permissions mismatch in each process. - This implementation does not allow unmapping page table shared mappings partially. Should that be supported in future? - Patch 2 will be broken up into smaller patches after RFC. It likely is easier to review the proposal with all code in one patch for now. Khalid Aziz (2): mm/ptshare: Add vm flag for shared PTE mm/ptshare: Create a new mm to share pagetables include/linux/fs.h | 1 + include/linux/mm.h | 8 + include/trace/events/mmflags.h | 3 +- include/uapi/asm-generic/mman-common.h | 1 + mm/Makefile | 2 +- mm/internal.h | 21 ++ mm/memory.c | 52 ++++- mm/mmap.c | 87 ++++++++ mm/ptshare.c | 262 +++++++++++++++++++++++++ 9 files changed, 433 insertions(+), 4 deletions(-) create mode 100644 mm/ptshare.c -- 2.34.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-01-20 16:09 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-12-06 15:41 [RFC PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 1/2] mm/ptshare: Add vm flag for shared PTE Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND " Khalid Aziz 2022-12-06 15:41 ` [RFC PATCH 2/2] mm/ptshare: Create a new mm for shared pagetables and add basic page table sharing support Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND " Khalid Aziz 2023-01-20 16:08 ` [RFC RESEND PATCH 0/2] Add support for sharing page tables across processes (Previously mshare) Khalid Aziz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).