From: Khalid Aziz <khalid.aziz@oracle.com>
To: akpm@linux-foundation.org, willy@infradead.org
Cc: Khalid Aziz <khalid.aziz@oracle.com>,
aneesh.kumar@linux.ibm.com, arnd@arndb.de, 21cnbao@gmail.com,
corbet@lwn.net, dave.hansen@linux.intel.com, david@redhat.com,
ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz,
keescook@chromium.org, kirill@shutemov.name, kucharsk@gmail.com,
linkinjeon@kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
longpeng2@huawei.com, luto@kernel.org, markhemm@googlemail.com,
pcc@google.com, rppt@kernel.org, sieberf@amazon.com,
sjpark@amazon.de, surenb@google.com, tst@schoebel-theuer.de,
yzaikin@google.com
Subject: [PATCH v2 0/9] Add support for shared PTEs across processes
Date: Wed, 29 Jun 2022 16:53:51 -0600 [thread overview]
Message-ID: <cover.1656531090.git.khalid.aziz@oracle.com> (raw)
Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes consume some of
the memory and as long as number of mappings being maintained is
small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of page table entries (PTEs) to maintain is
mostly constrained by the number of pages of memory on the system.
As the number of shared pages and the number of times pages are
shared goes up, amount of memory consumed by page tables starts to
become significant. This issue does not apply to threads. Any number
of threads can share the same pages inside a process while sharing
the same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.
Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.
This patch series implements a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. It adds a new
in-memory filesystem - msharefs. A file created on msharefs creates
a new shared region where all processes sharing that region will
share the PTEs as well. A process can create a new file on msharefs
and then mmap it which assigns a starting address and size to this
mshare'd region. Another process that has the right permission to
open the file on msharefs can then mmap this file in its address
space at same virtual address and size and share this region through
shared PTEs. An unlink() on the file marks the mshare'd region for
deletion once there are no more users of the region. When the mshare
region is deleted, all the pages used by the region are freed.
API
===
mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:
1. Mount msharefs on /sys/fs/mshare -
mount -t msharefs msharefs /sys/fs/mshare
2. mshare regions have alignment and size requirements. Start
address for the region must be aligned to an address boundary and
be a multiple of fixed size. This alignment and size requirement
can be obtained by reading the file /sys/fs/mshare/mshare_info
which returns a number in text format. mshare regions must be
aligned to this boundary and be a multiple of this size.
3. For the process creating mshare region:
a. Create a file on /sys/fs/mshare, for example -
fd = open("/sys/fs/mshare/shareme",
O_RDWR|O_CREAT|O_EXCL, 0600);
b. mmap this file to establish starting address and size -
mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
c. Write and read to mshared region normally.
4. For processes attaching to mshare'd region:
a. Open the file on msharefs, for example -
fd = open("/sys/fs/mshare/shareme", O_RDWR);
b. Get information about mshare'd region from the file:
struct mshare_info {
unsigned long start;
unsigned long size;
} m_info;
read(fd, &m_info, sizeof(m_info));
c. mmap the mshare'd region -
mmap(m_info.start, m_info.size,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
5. To delete the mshare region -
unlink("/sys/fs/mshare/shareme");
Example Code
============
Snippet of the code that a donor process would run looks like below:
-----------------
fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
read(fd, req, 128);
alignsize = atoi(req);
close(fd);
fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
start = alignsize * 4;
size = alignsize * 2;
addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, 0, 0);
if (addr == MAP_FAILED)
perror("ERROR: mmap failed");
strncpy(addr, "Some random shared text",
sizeof("Some random shared text"));
-----------------
Snippet of code that a consumer process would execute looks like:
-----------------
struct mshare_info {
unsigned long start;
unsigned long size;
} minfo;
fd = open("/sys/fs/mshare/shareme", O_RDONLY);
if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
printf("INFO: %ld bytes shared at addr 0x%lx \n",
minfo.size, minfo.start);
addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
printf("Guest mmap at %px:\n", addr);
printf("%s\n", addr);
printf("\nDone\n");
-----------------
v1 -> v2:
- Eliminated mshare and mshare_unlink system calls and
replaced API with standard mmap and unlink (Based upon
v1 patch discussions and LSF/MM discussions)
- All fd based API (based upon feedback and suggestions from
Andy Lutomirski, Eric Biederman, Kirill and others)
- Added a file /sys/fs/mshare/mshare_info to provide
alignment and size requirement info (based upon feedback
from Dave Hansen, Mark Hemment and discussions at LSF/MM)
- Addressed TODOs in v1
- Added support for directories in msharefs
- Added locks around any time vma is touched (Dave Hansen)
- Eliminated the need to point vm_mm in original vmas to the
newly synthesized mshare mm
- Ensured mmap_read_unlock is called for correct mm in
handle_mm_fault (Dave Hansen)
Khalid Aziz (9):
mm: Add msharefs filesystem
mm/mshare: pre-populate msharefs with information file
mm/mshare: make msharefs writable and support directories
mm/mshare: Add a read operation for msharefs files
mm/mshare: Add vm flag for shared PTE
mm/mshare: Add mmap operation
mm/mshare: Add unlink and munmap support
mm/mshare: Add basic page table sharing support
mm/mshare: Enable mshare region mapping across processes
Documentation/filesystems/msharefs.rst | 19 +
include/linux/mm.h | 10 +
include/trace/events/mmflags.h | 3 +-
include/uapi/linux/magic.h | 1 +
include/uapi/linux/mman.h | 5 +
mm/Makefile | 2 +-
mm/internal.h | 7 +
mm/memory.c | 101 ++++-
mm/mshare.c | 575 +++++++++++++++++++++++++
9 files changed, 719 insertions(+), 4 deletions(-)
create mode 100644 Documentation/filesystems/msharefs.rst
create mode 100644 mm/mshare.c
--
2.32.0
next reply other threads:[~2022-06-29 22:55 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-29 22:53 Khalid Aziz [this message]
2022-06-29 22:53 ` [PATCH v2 1/9] mm: Add msharefs filesystem Khalid Aziz
2022-06-30 21:53 ` Darrick J. Wong
2022-07-01 16:05 ` Khalid Aziz
2022-06-30 22:57 ` Al Viro
2022-07-01 16:08 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 2/9] mm/mshare: pre-populate msharefs with information file Khalid Aziz
2022-06-30 21:37 ` Darrick J. Wong
2022-06-30 22:54 ` Khalid Aziz
2022-06-30 23:01 ` Al Viro
2022-07-01 16:11 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 3/9] mm/mshare: make msharefs writable and support directories Khalid Aziz
2022-06-30 21:34 ` Darrick J. Wong
2022-06-30 22:49 ` Khalid Aziz
2022-06-30 23:09 ` Al Viro
2022-07-02 0:22 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 4/9] mm/mshare: Add a read operation for msharefs files Khalid Aziz
2022-06-30 21:27 ` Darrick J. Wong
2022-06-30 22:27 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 5/9] mm/mshare: Add vm flag for shared PTE Khalid Aziz
2022-06-30 14:59 ` Mark Hemment
2022-06-30 15:46 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 6/9] mm/mshare: Add mmap operation Khalid Aziz
2022-06-30 21:44 ` Darrick J. Wong
2022-06-30 23:30 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 7/9] mm/mshare: Add unlink and munmap support Khalid Aziz
2022-06-30 21:50 ` Darrick J. Wong
2022-07-01 15:58 ` Khalid Aziz
2022-06-29 22:53 ` [PATCH v2 8/9] mm/mshare: Add basic page table sharing support Khalid Aziz
2022-07-07 9:13 ` Xin Hao
2022-07-07 15:33 ` Khalid Aziz
2022-06-29 22:54 ` [PATCH v2 9/9] mm/mshare: Enable mshare region mapping across processes Khalid Aziz
2022-06-30 11:57 ` [PATCH v2 0/9] Add support for shared PTEs " Mark Hemment
2022-06-30 15:39 ` Khalid Aziz
2022-07-02 4:24 ` Andrew Morton
2022-07-06 19:26 ` Khalid Aziz
2022-07-08 11:47 ` David Hildenbrand
2022-07-08 19:36 ` Khalid Aziz
2022-07-13 14:00 ` David Hildenbrand
2022-07-13 17:58 ` Mike Kravetz
2022-07-13 18:03 ` David Hildenbrand
2022-07-14 22:02 ` Khalid Aziz
2022-07-18 12:59 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1656531090.git.khalid.aziz@oracle.com \
--to=khalid.aziz@oracle.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=arnd@arndb.de \
--cc=corbet@lwn.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=ebiederm@xmission.com \
--cc=hagen@jauu.net \
--cc=jack@suse.cz \
--cc=keescook@chromium.org \
--cc=kirill@shutemov.name \
--cc=kucharsk@gmail.com \
--cc=linkinjeon@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longpeng2@huawei.com \
--cc=luto@kernel.org \
--cc=markhemm@googlemail.com \
--cc=pcc@google.com \
--cc=rppt@kernel.org \
--cc=sieberf@amazon.com \
--cc=sjpark@amazon.de \
--cc=surenb@google.com \
--cc=tst@schoebel-theuer.de \
--cc=willy@infradead.org \
--cc=yzaikin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).