* Sharing page tables across processes (mshare) @ 2023-10-23 22:44 Khalid Aziz 2023-10-30 2:45 ` Rongwei Wang 2023-11-01 14:02 ` David Hildenbrand 0 siblings, 2 replies; 7+ messages in thread From: Khalid Aziz @ 2023-10-23 22:44 UTC (permalink / raw) To: Matthew Wilcox, David Hildenbrand, Mike Kravetz, Peter Xu, rongwei.wang, Mark Hemment Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org Threads of a process share address space and page tables that allows for two key advantages: 1. Amount of memory required for PTEs to map physical pages stays low even when large number of threads share the same pages since PTEs are shared across threads. 2. Page protection attributes are shared across threads and a change of attributes applies immediately to every thread without any overhead of coordinating protection bit changes across threads. These advantages no longer apply when unrelated processes share pages. Some applications can require 1000s of processes that all access the same set of data on shared pages. For instance, a database server may map in a large chunk of database into memory to provide fast access to data to the clients using buffer cache. Server may launch new processes to provide services to new clients connecting to the shared database. Each new process will map in the shared database pages. When the PTEs for mapping in shared pages are not shared across processes, each process will consume some memory to store these PTEs. On x86_64, each page requires a PTE that is only 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page. On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, amount of memory saved is very significant. When PTEs are not shared between processes, each process ends up with its own set of protection bits for each shared page. Database servers often need to change protection bits for pages as they manipulate and update data in the database. When changing page protection for a shared page, all PTEs across all processes that have mapped the shared page in need to be updated to ensure data integrity. To accomplish this, the process making the initial change to protection bits sends messages to every process sharing that page. All processes then block any access to that page, make the appropriate change to protection bits, and send a confirmation back. To ensure data consistency, access to shared page can be resumed when all processes have acknowledged the change. This is a disruptive and expensive coordination process. If PTEs were shared across processes, a change to page protection for a shared PTE becomes applicable to all processes instantly with no coordination required to ensure consistency. Changing protection bits across all processes sharing database pages is a common enough operation on Oracle databases that the cost is significant and cost goes up with the number of clients. This is a proposal to extend the same model of page table sharing for threads across processes. This will allow processes to tap into the same benefits that threads get from shared page tables, Sharing page tables across processes opens their address spaces to each other and thus must be done carefully. This proposal suggests sharing PTEs across processes that trust each other and have explicitly agreed to share page tables. The proposal is to add a new flag to mmap() call - MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a process to hint to kernel that it wishes to share page table entries for this file mapping mmap region with other processes. Any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries. Besides specifying MAP_SHARED_PT flag, the processe must map the files at a PMD aligned address with a size that is a multiple of PMD size and at the same virtual addresses. NOTE: This last requirement of same virtual addresses can possibly be relaxed if that is the consensus. When mmap() is called with MAP_SHARED_PT flag, a new host mm struct is created to hold the shared page tables. Host mm struct is not attached to a process. Start and size of host mm are set to the start and size of the mmap region and a VMA covering this range is also added to host mm struct. Existing page table entries from the process that creates the mapping are copied over to the host mm struct. All processes mapping this shared region are considered guest processes. When a guest process mmap's the shared region, a vm flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page fault, VMA is checked for the presence of VM_SHARED_PT flag. If the flag is found, its corresponding PMD is updated with the PMD from host mm struct so the PMD will point to the page tables in host mm struct. When a new PTE is created, it is created in the host mm struct page tables and the PMD in guest mm points to the same PTEs. -------------------------- Evolution of this proposal -------------------------- The original proposal - <https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/>, was for an mshare() system call that a donor process calls to create an empty mshare'd region. This shared region is pgdir aligned and multiple of pgdir size. Each mshare'd region creates a corresponding file under /sys/fs/mshare which can be read to get information on the region. Once an empty region has been created, any objects can be mapped into this region and page tables for those objects will be shared. Snippet of the code that a donor process would run looks like below: addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0); if (addr == MAP_FAILED) perror("ERROR: mmap failed"); err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), GB(512), O_CREAT|O_RDWR|O_EXCL, 600); if (err < 0) { perror("mshare() syscall failed"); exit(1); } strncpy(addr, "Some random shared text", sizeof("Some random shared text")); Snippet of code that a consumer process would execute looks like: fd = open("testregion", O_RDONLY); if (fd < 0) { perror("open failed"); exit(1); } if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) printf("INFO: %ld bytes shared at addr %lx \n", mshare_info[1], mshare_info[0]); else perror("read failed"); close(fd); addr = (char *)mshare_info[0]; err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], mshare_info[1], O_RDWR, 600); if (err < 0) { perror("mshare() syscall failed"); exit(1); } printf("Guest mmap at %px:\n", addr); printf("%s\n", addr); printf("\nDone\n"); err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); if (err < 0) { perror("mshare_unlink() failed"); exit(1); } This proposal evolved into completely file and mmap based API - <https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/>. This new API looks like below: 1. Mount msharefs on /sys/fs/mshare - mount -t msharefs msharefs /sys/fs/mshare 2. mshare regions have alignment and size requirements. Start address for the region must be aligned to an address boundary and be a multiple of fixed size. This alignment and size requirement can be obtained by reading the file /sys/fs/mshare/mshare_info which returns a number in text format. mshare regions must be aligned to this boundary and be a multiple of this size. 3. For the process creating mshare region: a. Create a file on /sys/fs/mshare, for example - fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600); b. mmap this file to establish starting address and size - mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); c. Write and read to mshared region normally. 4. For processes attaching to mshare'd region: a. Open the file on msharefs, for example - fd = open("/sys/fs/mshare/shareme", O_RDWR); b. Get information about mshare'd region from the file: struct mshare_info { unsigned long start; unsigned long size; } m_info; read(fd, &m_info, sizeof(m_info)); c. mmap the mshare'd region - mmap(m_info.start, m_info.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); 5. To delete the mshare region - unlink("/sys/fs/mshare/shareme"); Further discussions over mailing lists and LSF/MM resulted in eliminating msharefs and making this entirely mmap based - <https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/>. With this change, if two processes map the same file with same size, PMD aligned address, same virtual address and both specify MAP_SHARED_PT flag, they start sharing PTEs for the file mapping. These changes eliminate support for any arbitrary objects being mapped in mshare'd region. The last implementation required sharing minimum PMD sized chunks across processes. These changes were significant enough to make this proposal distinct enough for me to use a new name - ptshare. ---------- What next? ---------- There were some more discussions on this proposal while I was on leave for a few months. There is enough interest in this feature to continue to refine this. I will refine the code further but before that I want to make sure we have a common understanding of what this feature should do. As a result of many discussions, a new distinct version of original proposal has evolved. Which one do we agree to continue forward with - (1) current version which restricts sharing to PMD sized and aligned file mappings only, using just a new mmap flag (MAP_SHARED_PT), or (2) original version that creates an empty page table shared mshare region using msharefs and mmap for arbitrary objects to be mapped into later? Thanks, Khalid ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-10-23 22:44 Sharing page tables across processes (mshare) Khalid Aziz @ 2023-10-30 2:45 ` Rongwei Wang 2023-10-31 23:01 ` Khalid Aziz 2023-11-01 14:02 ` David Hildenbrand 1 sibling, 1 reply; 7+ messages in thread From: Rongwei Wang @ 2023-10-30 2:45 UTC (permalink / raw) To: Khalid Aziz, Matthew Wilcox, David Hildenbrand, Mike Kravetz, Peter Xu, Mark Hemment Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org On 2023/10/24 06:44, Khalid Aziz wrote: > Threads of a process share address space and page tables that allows for > two key advantages: > > 1. Amount of memory required for PTEs to map physical pages stays low > even when large number of threads share the same pages since PTEs are > shared across threads. > > 2. Page protection attributes are shared across threads and a change > of attributes applies immediately to every thread without any overhead > of coordinating protection bit changes across threads. > > These advantages no longer apply when unrelated processes share pages. > Some applications can require 1000s of processes that all access the > same set of data on shared pages. For instance, a database server may > map in a large chunk of database into memory to provide fast access to > data to the clients using buffer cache. Server may launch new processes > to provide services to new clients connecting to the shared database. > Each new process will map in the shared database pages. When the PTEs > for mapping in shared pages are not shared across processes, each > process will consume some memory to store these PTEs. On x86_64, each > page requires a PTE that is only 8 bytes long which is very small > compared to the 4K page size. When 2000 processes map the same page in > their address space, each one of them requires 8 bytes for its PTE and > together that adds up to 8K of memory just to hold the PTEs for one 4K > page. On a database server with 300GB SGA, a system crash was seen with > out-of-memory condition when 1500+ clients tried to share this SGA even > though the system had 512GB of memory. On this server, in the worst case > scenario of all 1500 processes mapping every page from SGA would have > required 878GB+ for just the PTEs. If these PTEs could be shared, amount > of memory saved is very significant. > > When PTEs are not shared between processes, each process ends up with > its own set of protection bits for each shared page. Database servers > often need to change protection bits for pages as they manipulate and > update data in the database. When changing page protection for a shared > page, all PTEs across all processes that have mapped the shared page in > need to be updated to ensure data integrity. To accomplish this, the > process making the initial change to protection bits sends messages to > every process sharing that page. All processes then block any access to > that page, make the appropriate change to protection bits, and send a > confirmation back. To ensure data consistency, access to shared page > can be resumed when all processes have acknowledged the change. This is > a disruptive and expensive coordination process. If PTEs were shared > across processes, a change to page protection for a shared PTE becomes > applicable to all processes instantly with no coordination required to > ensure consistency. Changing protection bits across all processes > sharing database pages is a common enough operation on Oracle databases > that the cost is significant and cost goes up with the number of clients. > > This is a proposal to extend the same model of page table sharing for > threads across processes. This will allow processes to tap into the > same benefits that threads get from shared page tables, > > Sharing page tables across processes opens their address spaces to each > other and thus must be done carefully. This proposal suggests sharing > PTEs across processes that trust each other and have explicitly agreed > to share page tables. The proposal is to add a new flag to mmap() call - > MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a > process to hint to kernel that it wishes to share page table entries > for this file mapping mmap region with other processes. Any other process > that mmaps the same file with MAP_SHARED_PT flag can then share the same > page table entries. Besides specifying MAP_SHARED_PT flag, the processe > must map the files at a PMD aligned address with a size that is a > multiple of PMD size and at the same virtual addresses. NOTE: This > last requirement of same virtual addresses can possibly be relaxed if > that is the consensus. > > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct > is created to hold the shared page tables. Host mm struct is not > attached to a process. Start and size of host mm are set to the > start and size of the mmap region and a VMA covering this range is > also added to host mm struct. Existing page table entries from the > process that creates the mapping are copied over to the host mm > struct. All processes mapping this shared region are considered > guest processes. When a guest process mmap's the shared region, a vm > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the > flag is found, its corresponding PMD is updated with the PMD from > host mm struct so the PMD will point to the page tables in host mm > struct. When a new PTE is created, it is created in the host mm struct > page tables and the PMD in guest mm points to the same PTEs. > > > -------------------------- > Evolution of this proposal > -------------------------- > > The original proposal - > <https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/>, > > was for an mshare() system call that a donor process calls to create > an empty mshare'd region. This shared region is pgdir aligned and > multiple of pgdir size. Each mshare'd region creates a corresponding > file under /sys/fs/mshare which can be read to get information on > the region. Once an empty region has been created, any objects can > be mapped into this region and page tables for those objects will be > shared. Snippet of the code that a donor process would run looks > like below: > > addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, > MAP_SHARED | MAP_ANONYMOUS, 0, 0); > if (addr == MAP_FAILED) > perror("ERROR: mmap failed"); > > err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), > GB(512), O_CREAT|O_RDWR|O_EXCL, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > strncpy(addr, "Some random shared text", > sizeof("Some random shared text")); > > > Snippet of code that a consumer process would execute looks like: > > fd = open("testregion", O_RDONLY); > if (fd < 0) { > perror("open failed"); > exit(1); > } > > if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) > printf("INFO: %ld bytes shared at addr %lx \n", > mshare_info[1], mshare_info[0]); > else > perror("read failed"); > > close(fd); > > addr = (char *)mshare_info[0]; > err = syscall(MSHARE_SYSCALL, "testregion", (void > *)mshare_info[0], > mshare_info[1], O_RDWR, 600); > if (err < 0) { > perror("mshare() syscall failed"); > exit(1); > } > > printf("Guest mmap at %px:\n", addr); > printf("%s\n", addr); > printf("\nDone\n"); > > err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); > if (err < 0) { > perror("mshare_unlink() failed"); > exit(1); > } > > > This proposal evolved into completely file and mmap based API - > <https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/>. > > This new API looks like below: > > 1. Mount msharefs on /sys/fs/mshare - > mount -t msharefs msharefs /sys/fs/mshare > > 2. mshare regions have alignment and size requirements. Start > address for the region must be aligned to an address boundary and > be a multiple of fixed size. This alignment and size requirement > can be obtained by reading the file /sys/fs/mshare/mshare_info > which returns a number in text format. mshare regions must be > aligned to this boundary and be a multiple of this size. > > 3. For the process creating mshare region: > a. Create a file on /sys/fs/mshare, for example - > fd = open("/sys/fs/mshare/shareme", > O_RDWR|O_CREAT|O_EXCL, 0600); > > b. mmap this file to establish starting address and size - > mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, > MAP_SHARED, fd, 0); > > c. Write and read to mshared region normally. > > 4. For processes attaching to mshare'd region: > a. Open the file on msharefs, for example - > fd = open("/sys/fs/mshare/shareme", O_RDWR); > > b. Get information about mshare'd region from the file: > struct mshare_info { > unsigned long start; > unsigned long size; > } m_info; > > read(fd, &m_info, sizeof(m_info)); > > c. mmap the mshare'd region - > mmap(m_info.start, m_info.size, > PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > > 5. To delete the mshare region - > unlink("/sys/fs/mshare/shareme"); > > > > Further discussions over mailing lists and LSF/MM resulted in eliminating > msharefs and making this entirely mmap based - > <https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/>. > > With this change, if two processes map the same file with same > size, PMD aligned address, same virtual address and both specify > MAP_SHARED_PT flag, they start sharing PTEs for the file mapping. > These changes eliminate support for any arbitrary objects being > mapped in mshare'd region. The last implementation required sharing > minimum PMD sized chunks across processes. These changes were > significant enough to make this proposal distinct enough for me to > use a new name - ptshare. > > > ---------- > What next? > ---------- > > There were some more discussions on this proposal while I was on > leave for a few months. There is enough interest in this feature to > continue to refine this. I will refine the code further but before > that I want to make sure we have a common understanding of what this > feature should do. > > As a result of many discussions, a new distinct version of > original proposal has evolved. Which one do we agree to continue > forward with - (1) current version which restricts sharing to PMD sized > and aligned file mappings only, using just a new mmap flag > (MAP_SHARED_PT), or (2) original version that creates an empty page > table shared mshare region using msharefs and mmap for arbitrary > objects to be mapped into later? Hi, Khalid I am unfamiliar to original version, but I can provide some feedback on the issues encountered during the implementation of current version (mmap & MAP_SHARED_PT). We realize our internal pgtable sharing version in the current method, but the codes are a bit hack in some places, e.g. (1) page fault, need to switch original mm to flush TLB or charge memcg; (2) shrink memory, a bit complicated to to handle pte entries like normal pte mapping; (3) munmap/madvise support; If these hack codes can be resolved, the current method seems already simple and usable enough (just my humble opinion). And besides above issues, we (our internal version) do not care memory migration, compaction, etc,. I'm not sure what functions pgtable sharing needs to support. Maybe we can have a discussion about that firstly, then decide which one? Here are the things we support in pgtable sharing: a. share pgtables only between parent and child processes; b. support anonymous shared memory and id-known (SYSV shared memory); c. madvise(MADV_DONTNEED, MADV_DONTDUMP, MADV_DODUMP), DONTNEED supports 2M granularity; d. reclaim pgtable sharing memory in shrinker; The above support is actually requested by our internal user. Plus, we skip memory migration, compaction, mprotect, mremap etc, directly. IMHO, support all memory behavior likes normal pte mapping is unnecessary? (Next, It seems I need to study your original version :-)) Thanks, -wrw > > Thanks, > Khalid ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-10-30 2:45 ` Rongwei Wang @ 2023-10-31 23:01 ` Khalid Aziz 2023-11-01 13:00 ` Rongwei Wang 0 siblings, 1 reply; 7+ messages in thread From: Khalid Aziz @ 2023-10-31 23:01 UTC (permalink / raw) To: Rongwei Wang Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, Matthew Wilcox, David Hildenbrand, Mike Kravetz, Peter Xu, Mark Hemment On 10/29/23 20:45, Rongwei Wang wrote: > > > On 2023/10/24 06:44, Khalid Aziz wrote: >> Threads of a process share address space and page tables that allows for >> two key advantages: >> >> 1. Amount of memory required for PTEs to map physical pages stays low >> even when large number of threads share the same pages since PTEs are >> shared across threads. >> >> 2. Page protection attributes are shared across threads and a change >> of attributes applies immediately to every thread without any overhead >> of coordinating protection bit changes across threads. >> >> These advantages no longer apply when unrelated processes share pages. >> Some applications can require 1000s of processes that all access the >> same set of data on shared pages. For instance, a database server may >> map in a large chunk of database into memory to provide fast access to >> data to the clients using buffer cache. Server may launch new processes >> to provide services to new clients connecting to the shared database. >> Each new process will map in the shared database pages. When the PTEs >> for mapping in shared pages are not shared across processes, each >> process will consume some memory to store these PTEs. On x86_64, each >> page requires a PTE that is only 8 bytes long which is very small >> compared to the 4K page size. When 2000 processes map the same page in >> their address space, each one of them requires 8 bytes for its PTE and >> together that adds up to 8K of memory just to hold the PTEs for one 4K >> page. On a database server with 300GB SGA, a system crash was seen with >> out-of-memory condition when 1500+ clients tried to share this SGA even >> though the system had 512GB of memory. On this server, in the worst case >> scenario of all 1500 processes mapping every page from SGA would have >> required 878GB+ for just the PTEs. If these PTEs could be shared, amount >> of memory saved is very significant. >> >> When PTEs are not shared between processes, each process ends up with >> its own set of protection bits for each shared page. Database servers >> often need to change protection bits for pages as they manipulate and >> update data in the database. When changing page protection for a shared >> page, all PTEs across all processes that have mapped the shared page in >> need to be updated to ensure data integrity. To accomplish this, the >> process making the initial change to protection bits sends messages to >> every process sharing that page. All processes then block any access to >> that page, make the appropriate change to protection bits, and send a >> confirmation back. To ensure data consistency, access to shared page >> can be resumed when all processes have acknowledged the change. This is >> a disruptive and expensive coordination process. If PTEs were shared >> across processes, a change to page protection for a shared PTE becomes >> applicable to all processes instantly with no coordination required to >> ensure consistency. Changing protection bits across all processes >> sharing database pages is a common enough operation on Oracle databases >> that the cost is significant and cost goes up with the number of clients. >> >> This is a proposal to extend the same model of page table sharing for >> threads across processes. This will allow processes to tap into the >> same benefits that threads get from shared page tables, >> >> Sharing page tables across processes opens their address spaces to each >> other and thus must be done carefully. This proposal suggests sharing >> PTEs across processes that trust each other and have explicitly agreed >> to share page tables. The proposal is to add a new flag to mmap() call - >> MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a >> process to hint to kernel that it wishes to share page table entries >> for this file mapping mmap region with other processes. Any other process >> that mmaps the same file with MAP_SHARED_PT flag can then share the same >> page table entries. Besides specifying MAP_SHARED_PT flag, the processe >> must map the files at a PMD aligned address with a size that is a >> multiple of PMD size and at the same virtual addresses. NOTE: This >> last requirement of same virtual addresses can possibly be relaxed if >> that is the consensus. >> >> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct >> is created to hold the shared page tables. Host mm struct is not >> attached to a process. Start and size of host mm are set to the >> start and size of the mmap region and a VMA covering this range is >> also added to host mm struct. Existing page table entries from the >> process that creates the mapping are copied over to the host mm >> struct. All processes mapping this shared region are considered >> guest processes. When a guest process mmap's the shared region, a vm >> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page >> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the >> flag is found, its corresponding PMD is updated with the PMD from >> host mm struct so the PMD will point to the page tables in host mm >> struct. When a new PTE is created, it is created in the host mm struct >> page tables and the PMD in guest mm points to the same PTEs. >> >> >> -------------------------- >> Evolution of this proposal >> -------------------------- >> >> The original proposal - >> <https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/>, >> was for an mshare() system call that a donor process calls to create >> an empty mshare'd region. This shared region is pgdir aligned and >> multiple of pgdir size. Each mshare'd region creates a corresponding >> file under /sys/fs/mshare which can be read to get information on >> the region. Once an empty region has been created, any objects can >> be mapped into this region and page tables for those objects will be >> shared. Snippet of the code that a donor process would run looks >> like below: >> >> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >> if (addr == MAP_FAILED) >> perror("ERROR: mmap failed"); >> >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> strncpy(addr, "Some random shared text", >> sizeof("Some random shared text")); >> >> >> Snippet of code that a consumer process would execute looks like: >> >> fd = open("testregion", O_RDONLY); >> if (fd < 0) { >> perror("open failed"); >> exit(1); >> } >> >> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >> printf("INFO: %ld bytes shared at addr %lx \n", >> mshare_info[1], mshare_info[0]); >> else >> perror("read failed"); >> >> close(fd); >> >> addr = (char *)mshare_info[0]; >> err = syscall(MSHARE_SYSCALL, "testregion", (void *)mshare_info[0], >> mshare_info[1], O_RDWR, 600); >> if (err < 0) { >> perror("mshare() syscall failed"); >> exit(1); >> } >> >> printf("Guest mmap at %px:\n", addr); >> printf("%s\n", addr); >> printf("\nDone\n"); >> >> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >> if (err < 0) { >> perror("mshare_unlink() failed"); >> exit(1); >> } >> >> >> This proposal evolved into completely file and mmap based API - >> <https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/>. >> This new API looks like below: >> >> 1. Mount msharefs on /sys/fs/mshare - >> mount -t msharefs msharefs /sys/fs/mshare >> >> 2. mshare regions have alignment and size requirements. Start >> address for the region must be aligned to an address boundary and >> be a multiple of fixed size. This alignment and size requirement >> can be obtained by reading the file /sys/fs/mshare/mshare_info >> which returns a number in text format. mshare regions must be >> aligned to this boundary and be a multiple of this size. >> >> 3. For the process creating mshare region: >> a. Create a file on /sys/fs/mshare, for example - >> fd = open("/sys/fs/mshare/shareme", >> O_RDWR|O_CREAT|O_EXCL, 0600); >> >> b. mmap this file to establish starting address and size - >> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, >> MAP_SHARED, fd, 0); >> >> c. Write and read to mshared region normally. >> >> 4. For processes attaching to mshare'd region: >> a. Open the file on msharefs, for example - >> fd = open("/sys/fs/mshare/shareme", O_RDWR); >> >> b. Get information about mshare'd region from the file: >> struct mshare_info { >> unsigned long start; >> unsigned long size; >> } m_info; >> >> read(fd, &m_info, sizeof(m_info)); >> >> c. mmap the mshare'd region - >> mmap(m_info.start, m_info.size, >> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >> >> 5. To delete the mshare region - >> unlink("/sys/fs/mshare/shareme"); >> >> >> >> Further discussions over mailing lists and LSF/MM resulted in eliminating >> msharefs and making this entirely mmap based - >> <https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/>. >> With this change, if two processes map the same file with same >> size, PMD aligned address, same virtual address and both specify >> MAP_SHARED_PT flag, they start sharing PTEs for the file mapping. >> These changes eliminate support for any arbitrary objects being >> mapped in mshare'd region. The last implementation required sharing >> minimum PMD sized chunks across processes. These changes were >> significant enough to make this proposal distinct enough for me to >> use a new name - ptshare. >> >> >> ---------- >> What next? >> ---------- >> >> There were some more discussions on this proposal while I was on >> leave for a few months. There is enough interest in this feature to >> continue to refine this. I will refine the code further but before >> that I want to make sure we have a common understanding of what this >> feature should do. >> >> As a result of many discussions, a new distinct version of >> original proposal has evolved. Which one do we agree to continue >> forward with - (1) current version which restricts sharing to PMD sized >> and aligned file mappings only, using just a new mmap flag >> (MAP_SHARED_PT), or (2) original version that creates an empty page >> table shared mshare region using msharefs and mmap for arbitrary >> objects to be mapped into later? > Hi, Khalid > > I am unfamiliar to original version, but I can provide some feedback on the issues encountered > during the implementation of current version (mmap & MAP_SHARED_PT). > We realize our internal pgtable sharing version in the current method, but the codes > are a bit hack in some places, e.g. (1) page fault, need to switch original mm to flush TLB or > charge memcg; (2) shrink memory, a bit complicated to to handle pte entries like normal pte mapping; > (3) munmap/madvise support; > > If these hack codes can be resolved, the current method seems already simple and usable enough (just my humble opinion). Thanks for taking the time to review. Yes, the code could use some improvement and I expect to do that as I get feedback. Can I ask you what you mean by "internal pgtable sharing version"? Are you using the patch I had sent out or a modified version of it on internal test machines? Thanks, Khalid > > > And besides above issues, we (our internal version) do not care memory migration, compaction, etc,. I'm not sure what > functions pgtable sharing needs to support. Maybe we can have a discussion about that firstly, then decide > which one? Here are the things we support in pgtable sharing: > > a. share pgtables only between parent and child processes; > b. support anonymous shared memory and id-known (SYSV shared memory); > c. madvise(MADV_DONTNEED, MADV_DONTDUMP, MADV_DODUMP), DONTNEED supports 2M granularity; > d. reclaim pgtable sharing memory in shrinker; > > The above support is actually requested by our internal user. Plus, we skip memory migration, compaction, mprotect, > mremap etc, directly. > IMHO, support all memory behavior likes normal pte mapping is unnecessary? > (Next, It seems I need to study your original version :-)) > > Thanks, > -wrw >> >> Thanks, >> Khalid > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-10-31 23:01 ` Khalid Aziz @ 2023-11-01 13:00 ` Rongwei Wang 0 siblings, 0 replies; 7+ messages in thread From: Rongwei Wang @ 2023-11-01 13:00 UTC (permalink / raw) To: Khalid Aziz Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org, Matthew Wilcox, David Hildenbrand, Mike Kravetz, Peter Xu, Mark Hemment On 2023/11/1 07:01, Khalid Aziz wrote: > On 10/29/23 20:45, Rongwei Wang wrote: >> >> >> On 2023/10/24 06:44, Khalid Aziz wrote: >>> Threads of a process share address space and page tables that allows >>> for >>> two key advantages: >>> >>> 1. Amount of memory required for PTEs to map physical pages stays low >>> even when large number of threads share the same pages since PTEs are >>> shared across threads. >>> >>> 2. Page protection attributes are shared across threads and a change >>> of attributes applies immediately to every thread without any overhead >>> of coordinating protection bit changes across threads. >>> >>> These advantages no longer apply when unrelated processes share pages. >>> Some applications can require 1000s of processes that all access the >>> same set of data on shared pages. For instance, a database server may >>> map in a large chunk of database into memory to provide fast access to >>> data to the clients using buffer cache. Server may launch new processes >>> to provide services to new clients connecting to the shared database. >>> Each new process will map in the shared database pages. When the PTEs >>> for mapping in shared pages are not shared across processes, each >>> process will consume some memory to store these PTEs. On x86_64, each >>> page requires a PTE that is only 8 bytes long which is very small >>> compared to the 4K page size. When 2000 processes map the same page in >>> their address space, each one of them requires 8 bytes for its PTE and >>> together that adds up to 8K of memory just to hold the PTEs for one 4K >>> page. On a database server with 300GB SGA, a system crash was seen with >>> out-of-memory condition when 1500+ clients tried to share this SGA even >>> though the system had 512GB of memory. On this server, in the worst >>> case >>> scenario of all 1500 processes mapping every page from SGA would have >>> required 878GB+ for just the PTEs. If these PTEs could be shared, >>> amount >>> of memory saved is very significant. >>> >>> When PTEs are not shared between processes, each process ends up with >>> its own set of protection bits for each shared page. Database servers >>> often need to change protection bits for pages as they manipulate and >>> update data in the database. When changing page protection for a shared >>> page, all PTEs across all processes that have mapped the shared page in >>> need to be updated to ensure data integrity. To accomplish this, the >>> process making the initial change to protection bits sends messages to >>> every process sharing that page. All processes then block any access to >>> that page, make the appropriate change to protection bits, and send a >>> confirmation back. To ensure data consistency, access to shared page >>> can be resumed when all processes have acknowledged the change. This is >>> a disruptive and expensive coordination process. If PTEs were shared >>> across processes, a change to page protection for a shared PTE becomes >>> applicable to all processes instantly with no coordination required to >>> ensure consistency. Changing protection bits across all processes >>> sharing database pages is a common enough operation on Oracle databases >>> that the cost is significant and cost goes up with the number of >>> clients. >>> >>> This is a proposal to extend the same model of page table sharing for >>> threads across processes. This will allow processes to tap into the >>> same benefits that threads get from shared page tables, >>> >>> Sharing page tables across processes opens their address spaces to each >>> other and thus must be done carefully. This proposal suggests sharing >>> PTEs across processes that trust each other and have explicitly agreed >>> to share page tables. The proposal is to add a new flag to mmap() >>> call - >>> MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a >>> process to hint to kernel that it wishes to share page table entries >>> for this file mapping mmap region with other processes. Any other >>> process >>> that mmaps the same file with MAP_SHARED_PT flag can then share the >>> same >>> page table entries. Besides specifying MAP_SHARED_PT flag, the processe >>> must map the files at a PMD aligned address with a size that is a >>> multiple of PMD size and at the same virtual addresses. NOTE: This >>> last requirement of same virtual addresses can possibly be relaxed if >>> that is the consensus. >>> >>> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct >>> is created to hold the shared page tables. Host mm struct is not >>> attached to a process. Start and size of host mm are set to the >>> start and size of the mmap region and a VMA covering this range is >>> also added to host mm struct. Existing page table entries from the >>> process that creates the mapping are copied over to the host mm >>> struct. All processes mapping this shared region are considered >>> guest processes. When a guest process mmap's the shared region, a vm >>> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page >>> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the >>> flag is found, its corresponding PMD is updated with the PMD from >>> host mm struct so the PMD will point to the page tables in host mm >>> struct. When a new PTE is created, it is created in the host mm struct >>> page tables and the PMD in guest mm points to the same PTEs. >>> >>> >>> -------------------------- >>> Evolution of this proposal >>> -------------------------- >>> >>> The original proposal - >>> <https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/>, >>> >>> was for an mshare() system call that a donor process calls to create >>> an empty mshare'd region. This shared region is pgdir aligned and >>> multiple of pgdir size. Each mshare'd region creates a corresponding >>> file under /sys/fs/mshare which can be read to get information on >>> the region. Once an empty region has been created, any objects can >>> be mapped into this region and page tables for those objects will be >>> shared. Snippet of the code that a donor process would run looks >>> like below: >>> >>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE, >>> MAP_SHARED | MAP_ANONYMOUS, 0, 0); >>> if (addr == MAP_FAILED) >>> perror("ERROR: mmap failed"); >>> >>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2), >>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600); >>> if (err < 0) { >>> perror("mshare() syscall failed"); >>> exit(1); >>> } >>> >>> strncpy(addr, "Some random shared text", >>> sizeof("Some random shared text")); >>> >>> >>> Snippet of code that a consumer process would execute looks like: >>> >>> fd = open("testregion", O_RDONLY); >>> if (fd < 0) { >>> perror("open failed"); >>> exit(1); >>> } >>> >>> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0)) >>> printf("INFO: %ld bytes shared at addr %lx \n", >>> mshare_info[1], mshare_info[0]); >>> else >>> perror("read failed"); >>> >>> close(fd); >>> >>> addr = (char *)mshare_info[0]; >>> err = syscall(MSHARE_SYSCALL, "testregion", (void >>> *)mshare_info[0], >>> mshare_info[1], O_RDWR, 600); >>> if (err < 0) { >>> perror("mshare() syscall failed"); >>> exit(1); >>> } >>> >>> printf("Guest mmap at %px:\n", addr); >>> printf("%s\n", addr); >>> printf("\nDone\n"); >>> >>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion"); >>> if (err < 0) { >>> perror("mshare_unlink() failed"); >>> exit(1); >>> } >>> >>> >>> This proposal evolved into completely file and mmap based API - >>> <https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/>. >>> >>> This new API looks like below: >>> >>> 1. Mount msharefs on /sys/fs/mshare - >>> mount -t msharefs msharefs /sys/fs/mshare >>> >>> 2. mshare regions have alignment and size requirements. Start >>> address for the region must be aligned to an address boundary and >>> be a multiple of fixed size. This alignment and size requirement >>> can be obtained by reading the file /sys/fs/mshare/mshare_info >>> which returns a number in text format. mshare regions must be >>> aligned to this boundary and be a multiple of this size. >>> >>> 3. For the process creating mshare region: >>> a. Create a file on /sys/fs/mshare, for example - >>> fd = open("/sys/fs/mshare/shareme", >>> O_RDWR|O_CREAT|O_EXCL, 0600); >>> >>> b. mmap this file to establish starting address and size - >>> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE, >>> MAP_SHARED, fd, 0); >>> >>> c. Write and read to mshared region normally. >>> >>> 4. For processes attaching to mshare'd region: >>> a. Open the file on msharefs, for example - >>> fd = open("/sys/fs/mshare/shareme", O_RDWR); >>> >>> b. Get information about mshare'd region from the file: >>> struct mshare_info { >>> unsigned long start; >>> unsigned long size; >>> } m_info; >>> >>> read(fd, &m_info, sizeof(m_info)); >>> >>> c. mmap the mshare'd region - >>> mmap(m_info.start, m_info.size, >>> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); >>> >>> 5. To delete the mshare region - >>> unlink("/sys/fs/mshare/shareme"); >>> >>> >>> >>> Further discussions over mailing lists and LSF/MM resulted in >>> eliminating >>> msharefs and making this entirely mmap based - >>> <https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/>. >>> >>> With this change, if two processes map the same file with same >>> size, PMD aligned address, same virtual address and both specify >>> MAP_SHARED_PT flag, they start sharing PTEs for the file mapping. >>> These changes eliminate support for any arbitrary objects being >>> mapped in mshare'd region. The last implementation required sharing >>> minimum PMD sized chunks across processes. These changes were >>> significant enough to make this proposal distinct enough for me to >>> use a new name - ptshare. >>> >>> >>> ---------- >>> What next? >>> ---------- >>> >>> There were some more discussions on this proposal while I was on >>> leave for a few months. There is enough interest in this feature to >>> continue to refine this. I will refine the code further but before >>> that I want to make sure we have a common understanding of what this >>> feature should do. >>> >>> As a result of many discussions, a new distinct version of >>> original proposal has evolved. Which one do we agree to continue >>> forward with - (1) current version which restricts sharing to PMD sized >>> and aligned file mappings only, using just a new mmap flag >>> (MAP_SHARED_PT), or (2) original version that creates an empty page >>> table shared mshare region using msharefs and mmap for arbitrary >>> objects to be mapped into later? >> Hi, Khalid >> >> I am unfamiliar to original version, but I can provide some feedback >> on the issues encountered >> during the implementation of current version (mmap & MAP_SHARED_PT). >> We realize our internal pgtable sharing version in the current >> method, but the codes >> are a bit hack in some places, e.g. (1) page fault, need to switch >> original mm to flush TLB or >> charge memcg; (2) shrink memory, a bit complicated to to handle pte >> entries like normal pte mapping; >> (3) munmap/madvise support; >> >> If these hack codes can be resolved, the current method seems already >> simple and usable enough (just my humble opinion). > Thanks for taking the time to review. Yes, the code could use some > improvement and I expect to do that as I get feedback. Can I ask you > what you mean by "internal pgtable sharing version"? Are you using the > patch I had sent out or a modified version of it on internal test > machines? Yes, a modified version with functions mentioned in the previous mail based on your mmap(MAP_SHARED_PT) patchset. That realized in kernel-5.10. And if everyone thinks it's helpful for this discussion, I can send it out next. > > Thanks, > Khalid > >> >> >> And besides above issues, we (our internal version) do not care >> memory migration, compaction, etc,. I'm not sure what >> functions pgtable sharing needs to support. Maybe we can have a >> discussion about that firstly, then decide >> which one? Here are the things we support in pgtable sharing: >> >> a. share pgtables only between parent and child processes; > b. >> support anonymous shared memory and id-known (SYSV shared memory); >> c. madvise(MADV_DONTNEED, MADV_DONTDUMP, MADV_DODUMP), DONTNEED >> supports 2M granularity; >> d. reclaim pgtable sharing memory in shrinker; >> >> The above support is actually requested by our internal user. Plus, >> we skip memory migration, compaction, mprotect, mremap etc, directly. >> IMHO, support all memory behavior likes normal pte mapping is >> unnecessary? >> (Next, It seems I need to study your original version :-)) >> >> Thanks, >> -wrw >>> >>> Thanks, >>> Khalid >> ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-10-23 22:44 Sharing page tables across processes (mshare) Khalid Aziz 2023-10-30 2:45 ` Rongwei Wang @ 2023-11-01 14:02 ` David Hildenbrand 2023-11-01 22:40 ` Khalid Aziz 1 sibling, 1 reply; 7+ messages in thread From: David Hildenbrand @ 2023-11-01 14:02 UTC (permalink / raw) To: Khalid Aziz, Matthew Wilcox, Mike Kravetz, Peter Xu, rongwei.wang, Mark Hemment Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org > ---------- > What next? > ---------- > > There were some more discussions on this proposal while I was on > leave for a few months. There is enough interest in this feature to > continue to refine this. I will refine the code further but before > that I want to make sure we have a common understanding of what this > feature should do. Did you follow-up on the alternatives discussed in a bi-weekly mm session on this topic or is there some other reason you are leaving that out? To be precise, I raised that both problems should likely be decoupled (sharing of page tables as an optimization, NOT using mprotect to catch write access to pagecache pages). And that page table sharing better remains an implementation detail. Sharing of page tables (as learned by hugetlb) can easily be beneficial to other use cases -- for example, multi-process VMs; no need to bring in mshare. There was the concern that it might not always be reasonable to share page tables, so one could just have some kind of hint (madvise? mmap flag?) that it might be reasonable to try sharing page tables. But it would be a pure internal optimization. Just like it is for hugetlb we would unshare as soon as someone does an mprotect() etc. Initially, you could simply ignore any such hint for filesystems that don't support it. Starting with shmem sounds reasonable. Write access to pagecache pages (or also read-access?) would ideally be handled on the pagecache level, so you could catch any write (page tables, write(), ... and eventually later read access if required) and either notify someone (uffd-style, just on a fd) or send a signal to the faulting process. That would be a new feature, of course. But we do have writenotify infrastructure in place to catch write access to pagecache pages already, whereby we inform the FS that someone wants to write to a PTE-read-only pagecache page. Once you combine both features, you can easily update only a single shared page table when updating the page protection as triggered by the FS/yet-to-be-named-feature and have all processes sharing these page tables see the change in one go. > > As a result of many discussions, a new distinct version of > original proposal has evolved. Which one do we agree to continue > forward with - (1) current version which restricts sharing to PMD sized > and aligned file mappings only, using just a new mmap flag > (MAP_SHARED_PT), or (2) original version that creates an empty page > table shared mshare region using msharefs and mmap for arbitrary > objects to be mapped into later? So far my opinion on this is unchanged: turning an implementation detail (sharing of page tables) into a feature to bypass per-process VMA permissions sounds absolutely bad to me. The original concept of mshare certainly sounds interesting, but as discussed a couple of times (LSF/mm), it similarly sounds "dangerous" the way it was originally proposed. Having some kind of container that multiple process can mmap (fd?), and *selected* mmap()/mprotect()/ get rerouted to the container could be interesting; but it might be reasonable to then have separate operations to work on such an fd (ioctl), and *not* using mmap()/mprotect() for that. And one might only want to allow to mmap that fd with a superset of all permissions used inside the container (and only MAP_SHARED), and strictly filter what we allow to map into such a container. page table sharing would likely be an implementation detail. Just some random thoughts (some of which I previously raised). Probably makes sense to discuss that in a bi-weekly mm meeting (again, this time with you as well). -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-11-01 14:02 ` David Hildenbrand @ 2023-11-01 22:40 ` Khalid Aziz 2023-11-02 20:25 ` David Hildenbrand 0 siblings, 1 reply; 7+ messages in thread From: Khalid Aziz @ 2023-11-01 22:40 UTC (permalink / raw) To: David Hildenbrand, Matthew Wilcox, Mike Kravetz, Peter Xu, rongwei.wang, Mark Hemment Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org On 11/1/23 08:02, David Hildenbrand wrote: > >> ---------- >> What next? >> ---------- >> >> There were some more discussions on this proposal while I was on >> leave for a few months. There is enough interest in this feature to >> continue to refine this. I will refine the code further but before >> that I want to make sure we have a common understanding of what this >> feature should do. > > Did you follow-up on the alternatives discussed in a bi-weekly mm session on this topic or is there some other reason > you are leaving that out? I did a poor job of addressing it :) What we are trying to implement here is to allow disjoint processes to share page tables AND page protection across all processes. It is not the intent to simply catch a process trying to write to a protected page. Mechanism already exists for that. The intent is when page protection is changed for one process, it applies instantly to all processes. Using mprotect to catch a change in page protection continues the same problem database is experiencing. Database wants to be able to change read/write permissions for terrabytes of data for all clients very quickly and simultaneously. Today it requires coordination across 1000s of processes to accomplish that. It is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the same time. So the two requirements of this feature are not separable. It is a requirement of the feature to bypass per-process vma permissions. Processes that require per-process vma permissions would not use mshare and thus the requirement for a process to opt into mshare. Matthew, your thoughts? Hopefully I understood your suggestion to separate the two requirements correctly. We can discuss it at another alignment meeting if that helps. > > To be precise, I raised that both problems should likely be decoupled (sharing of page tables as an optimization, NOT > using mprotect to catch write access to pagecache pages). And that page table sharing better remains an implementation > detail. > > Sharing of page tables (as learned by hugetlb) can easily be beneficial to other use cases -- for example, multi-process > VMs; no need to bring in mshare. There was the concern that it might not always be reasonable to share page tables, so > one could just have some kind of hint (madvise? mmap flag?) that it might be reasonable to try sharing page tables. But > it would be a pure internal optimization. Just like it is for hugetlb we would unshare as soon as someone does an > mprotect() etc. Initially, you could simply ignore any such hint for filesystems that don't support it. Starting with > shmem sounds reasonable. > > Write access to pagecache pages (or also read-access?) would ideally be handled on the pagecache level, so you could > catch any write (page tables, write(), ... and eventually later read access if required) and either notify someone > (uffd-style, just on a fd) or send a signal to the faulting process. That would be a new feature, of course. But we do > have writenotify infrastructure in place to catch write access to pagecache pages already, whereby we inform the FS that > someone wants to write to a PTE-read-only pagecache page. > > Once you combine both features, you can easily update only a single shared page table when updating the page protection > as triggered by the FS/yet-to-be-named-feature and have all processes sharing these page tables see the change in one go. > >> >> As a result of many discussions, a new distinct version of >> original proposal has evolved. Which one do we agree to continue >> forward with - (1) current version which restricts sharing to PMD sized >> and aligned file mappings only, using just a new mmap flag >> (MAP_SHARED_PT), or (2) original version that creates an empty page >> table shared mshare region using msharefs and mmap for arbitrary >> objects to be mapped into later? At the meeting Matthew expressed desire to support mapping in any objects in the mshare'd region which makes this feature much more versatile. That was the intent of the original proposal which was not fd and MMAP_SHARED_PT based. That is (2) above. The current version was largely based upon your suggestion at LSF/MM to restrict this feature to file mappings only. > > So far my opinion on this is unchanged: turning an implementation detail (sharing of page tables) into a feature to > bypass per-process VMA permissions sounds absolutely bad to me. I agree if a feature silently bypasses per-process VMA permissions, that is a terrible idea. This is why we have explicit opt-in requirement and intent is to bypass per-vma permissions by sharing PTE, as opposed to shared PTE bringing in the feature of bypassing per-vma permissions. > > The original concept of mshare certainly sounds interesting, but as discussed a couple of times (LSF/mm), it similarly > sounds "dangerous" the way it was originally proposed. > > Having some kind of container that multiple process can mmap (fd?), and *selected* mmap()/mprotect()/ get rerouted to > the container could be interesting; but it might be reasonable to then have separate operations to work on such an fd > (ioctl), and *not* using mmap()/mprotect() for that. And one might only want to allow to mmap that fd with a superset of > all permissions used inside the container (and only MAP_SHARED), and strictly filter what we allow to map into such a > container. page table sharing would likely be an implementation detail. > > Just some random thoughts (some of which I previously raised). Probably makes sense to discuss that in a bi-weekly mm > meeting (again, this time with you as well). > I appreciate your thoughts and your helping move this discussion forward. Thanks, Khalid ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Sharing page tables across processes (mshare) 2023-11-01 22:40 ` Khalid Aziz @ 2023-11-02 20:25 ` David Hildenbrand 0 siblings, 0 replies; 7+ messages in thread From: David Hildenbrand @ 2023-11-02 20:25 UTC (permalink / raw) To: Khalid Aziz, Matthew Wilcox, Mike Kravetz, Peter Xu, rongwei.wang, Mark Hemment Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-arch@vger.kernel.org On 01.11.23 23:40, Khalid Aziz wrote: > On 11/1/23 08:02, David Hildenbrand wrote: >> >>> ---------- >>> What next? >>> ---------- >>> >>> There were some more discussions on this proposal while I was on >>> leave for a few months. There is enough interest in this feature to >>> continue to refine this. I will refine the code further but before >>> that I want to make sure we have a common understanding of what this >>> feature should do. >> >> Did you follow-up on the alternatives discussed in a bi-weekly mm session on this topic or is there some other reason >> you are leaving that out? Hi Khalid, > > I did a poor job of addressing it :) What we are trying to implement here is to allow disjoint processes to share page > tables AND page protection across all processes. It is not the intent to simply catch a process trying to write to a > protected page. Mechanism already exists for that. The intent is when page protection is changed for one process, it > applies instantly to all processes. Using mprotect to catch a change in page protection continues the same problem > database is experiencing. Database wants to be able to change read/write permissions for terrabytes of data for all > clients very quickly and simultaneously. Today it requires coordination across 1000s of processes to accomplish that. It Right, because you have to issue an mprotect() in each and every process context ... > is slow and impacts database performance significantly. For each process to have to handle a fault/signal whenever page > protection is changed impacts every process. By sharing same PTE across all processes, any page protection changes apply ... and everyone has to get the fault and mprotect() again, Which is one of the reasons why I said that mprotect() is simply the wrong tool to use here. You want to protect a pagecache page from write access, catch write access and handle it, to then allow write-access again without successive fault->signal. Something similar is being done by filesystems already with the writenotify infrastructure I believe. You just don't get a signal on write access, because it's all handled internally in the FS. Literally anything is better than using mprotect() here (uffd-wp would also be a faster alternative, but it similarly suffers from the multi-process setup; back when uffd-wp was introduced for shmem, I already raised that a an alternative for multi-process use would be to write-protect on the pagecache level instead of on individual VMAs. But Peter's position was that uffd-wp as is might also be helpful for some use cases that are single-process and we simply want to support shmem as well). > instantly to all processes (there is the TLB shootdown issue but as discussed in the meeting, it can be handled). The > mshare proposal implements the instant page protection change while bringing in benefits of shared page tables at the > same time. So the two requirements of this feature are not separable. Right, and I think we should talk about the problem we are trying to solve and not a solution to the problem. Because the current solution really requires sharing of page tables, which I absolutely don't like. It absolutely makes no sense to bring in mprotect and VMAs when wanting to catch all write accesses to a pagecache page. And because we still decide to do so, we have to come up with ways of making page table sharing a user-visible feature with weird VMA semantics. > It is a requirement of the feature to bypass > per-process vma permissions. Processes that require per-process vma permissions would not use mshare and thus the > requirement for a process to opt into mshare. Matthew, your thoughts? > > Hopefully I understood your suggestion to separate the two requirements correctly. We can discuss it at another > alignment meeting if that helps. Yes, I believe there are cleaner alternatives that (a) don't use mprotect() (b) don't imply page table sharing (although it's a reasonable thing to use internally as an implementation detail to speed things up further) If it's some API to write-protect on the pagecache level + page table sharing as optimization or some modified form of mshare (below), I can't tell. > >> >> To be precise, I raised that both problems should likely be decoupled (sharing of page tables as an optimization, NOT >> using mprotect to catch write access to pagecache pages). And that page table sharing better remains an implementation >> detail. >> >> Sharing of page tables (as learned by hugetlb) can easily be beneficial to other use cases -- for example, multi-process >> VMs; no need to bring in mshare. There was the concern that it might not always be reasonable to share page tables, so >> one could just have some kind of hint (madvise? mmap flag?) that it might be reasonable to try sharing page tables. But >> it would be a pure internal optimization. Just like it is for hugetlb we would unshare as soon as someone does an >> mprotect() etc. Initially, you could simply ignore any such hint for filesystems that don't support it. Starting with >> shmem sounds reasonable. >> >> Write access to pagecache pages (or also read-access?) would ideally be handled on the pagecache level, so you could >> catch any write (page tables, write(), ... and eventually later read access if required) and either notify someone >> (uffd-style, just on a fd) or send a signal to the faulting process. That would be a new feature, of course. But we do >> have writenotify infrastructure in place to catch write access to pagecache pages already, whereby we inform the FS that >> someone wants to write to a PTE-read-only pagecache page. >> >> Once you combine both features, you can easily update only a single shared page table when updating the page protection >> as triggered by the FS/yet-to-be-named-feature and have all processes sharing these page tables see the change in one go. >> >>> >>> As a result of many discussions, a new distinct version of >>> original proposal has evolved. Which one do we agree to continue >>> forward with - (1) current version which restricts sharing to PMD sized >>> and aligned file mappings only, using just a new mmap flag >>> (MAP_SHARED_PT), or (2) original version that creates an empty page >>> table shared mshare region using msharefs and mmap for arbitrary >>> objects to be mapped into later? > > At the meeting Matthew expressed desire to support mapping in any objects in the mshare'd region which makes this > feature much more versatile. That was the intent of the original proposal which was not fd and MMAP_SHARED_PT based. > That is (2) above. The current version was largely based upon your suggestion at LSF/MM to restrict this feature to file > mappings only. It's been a while, but I remember that the feedback in the room was primarily that: (a) the original mshare approach/implementation had a very dangerous smell to it. Rerouting mmap/mprotect/... is just absolutely nasty. (b) that pure page table sharing itself might be itself a reasonable optimization worth having. I still think generic page table sharing (as a pure optimization) can be something reasonable to have, and can help existing use cases without the need to modify any software (well, except maybe give a hint that it might be reasonable). As said, I see value in some fd-thingy that can be mmaped, but is internally assembled from other fds (using protect ioctls, not mmap) with sub-protection (using protect ioctls, not mprotect). The ioctls would be minimal and clearly specified. Most madvise()/uffd/... would simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting of mmap, munmap, mprotect, ... Under the hood, one can use a MM to manage all that and share page tables. But it would be an implementation detail. > >> >> So far my opinion on this is unchanged: turning an implementation detail (sharing of page tables) into a feature to >> bypass per-process VMA permissions sounds absolutely bad to me. > > I agree if a feature silently bypasses per-process VMA permissions, that is a terrible idea. This is why we have > explicit opt-in requirement and intent is to bypass per-vma permissions by sharing PTE, as opposed to shared PTE > bringing in the feature of bypassing per-vma permissions. Let's talk about cleaner alternatives, at least we should try :) > >> >> The original concept of mshare certainly sounds interesting, but as discussed a couple of times (LSF/mm), it similarly >> sounds "dangerous" the way it was originally proposed. >> >> Having some kind of container that multiple process can mmap (fd?), and *selected* mmap()/mprotect()/ get rerouted to >> the container could be interesting; but it might be reasonable to then have separate operations to work on such an fd >> (ioctl), and *not* using mmap()/mprotect() for that. And one might only want to allow to mmap that fd with a superset of >> all permissions used inside the container (and only MAP_SHARED), and strictly filter what we allow to map into such a >> container. page table sharing would likely be an implementation detail. >> >> Just some random thoughts (some of which I previously raised). Probably makes sense to discuss that in a bi-weekly mm >> meeting (again, this time with you as well). >> > > I appreciate your thoughts and your helping move this discussion forward. Yes, I'm happy to discuss further. In a bi-weekly MM meeting, off-list or here. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-11-02 20:25 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-10-23 22:44 Sharing page tables across processes (mshare) Khalid Aziz 2023-10-30 2:45 ` Rongwei Wang 2023-10-31 23:01 ` Khalid Aziz 2023-11-01 13:00 ` Rongwei Wang 2023-11-01 14:02 ` David Hildenbrand 2023-11-01 22:40 ` Khalid Aziz 2023-11-02 20:25 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).