linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Khalid Aziz <khalid.aziz@oracle.com>
Cc: akpm@linux-foundation.org, willy@infradead.org,
	longpeng2@huawei.com, arnd@arndb.de, dave.hansen@linux.intel.com,
	david@redhat.com, surenb@google.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-api@vger.kernel.org
Subject: Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
Date: Sat, 22 Jan 2022 13:31:38 +0200	[thread overview]
Message-ID: <YevrGs3WE7ywB+lH@kernel.org> (raw)
In-Reply-To: <cover.1642526745.git.khalid.aziz@oracle.com>

(added linux-api)

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
> 
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
> 
> 
> API Proposal
> ============
> 
> The mshare API consists of two system calls - mshare() and mshare_unlink()
> 
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> 
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
> 
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
> 
> mshare() returns an error code if it fails, otherwise it returns 0.

Did you consider returning a file descriptor from mshare() system call?
Then there would be no need in mshare_unlink() as close(fd) would work.
 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
> 
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.

Maybe read should return a structure containing some data identifier and
the data itself, so that it could be extended in the future.
 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.

-- 
Sincerely yours,
Mike.


  parent reply	other threads:[~2022-01-22 11:31 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-18 21:19 [RFC PATCH 0/6] Add support for shared PTEs across processes Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 1/6] mm: Add new system calls mshare, mshare_unlink Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 2/6] mm: Add msharefs filesystem Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 3/6] mm: Add read for msharefs Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 4/6] mm: implement mshare_unlink syscall Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 5/6] mm: Add locking to msharefs syscalls Khalid Aziz
2022-01-18 21:19 ` [RFC PATCH 6/6] mm: Add basic page table sharing using mshare Khalid Aziz
2022-01-18 21:41 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Dave Hansen
2022-01-18 21:46   ` Matthew Wilcox
2022-01-18 22:47     ` Khalid Aziz
2022-01-18 22:06 ` Dave Hansen
2022-01-18 22:52   ` Khalid Aziz
2022-01-19 11:38 ` Mark Hemment
2022-01-19 17:02   ` Khalid Aziz
2022-01-20 12:49     ` Mark Hemment
2022-01-20 19:15       ` Khalid Aziz
2022-01-24 15:15         ` Mark Hemment
2022-01-24 15:27           ` Matthew Wilcox
2022-01-24 22:20           ` Khalid Aziz
2022-01-21  1:08 ` Barry Song
2022-01-21  2:13   ` Matthew Wilcox
2022-01-21  7:35     ` Barry Song
2022-01-21 14:47       ` Matthew Wilcox
2022-01-21 16:41         ` Khalid Aziz
2022-01-22  1:39           ` Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
2022-01-22  1:41             ` Matthew Wilcox
2022-01-22 10:18               ` Thomas Schoebel-Theuer
2022-01-22 16:09                 ` Matthew Wilcox
2022-01-22 11:31 ` Mike Rapoport [this message]
2022-01-22 18:29   ` Andy Lutomirski
2022-01-24 18:48   ` Khalid Aziz
2022-01-24 19:45     ` Andy Lutomirski
2022-01-24 22:30       ` Khalid Aziz
2022-01-24 23:16         ` Andy Lutomirski
2022-01-24 23:44           ` Khalid Aziz
2022-01-25 11:42 ` Kirill A. Shutemov
2022-01-25 12:09   ` William Kucharski
2022-01-25 13:18     ` David Hildenbrand
2022-01-25 14:01       ` Kirill A. Shutemov
2022-01-25 13:23   ` Matthew Wilcox
2022-01-25 13:59     ` Kirill A. Shutemov
2022-01-25 14:09       ` Matthew Wilcox
2022-01-25 18:57         ` Kirill A. Shutemov
2022-01-25 18:59           ` Matthew Wilcox
2022-01-26  4:04             ` Matthew Wilcox
2022-01-26 10:16               ` David Hildenbrand
2022-01-26 13:38                 ` Matthew Wilcox
2022-01-26 13:55                   ` David Hildenbrand
2022-01-26 14:12                     ` Matthew Wilcox
2022-01-26 14:30                       ` David Hildenbrand
2022-01-26 14:12                   ` Mike Rapoport
2022-01-26 13:42               ` Kirill A. Shutemov
2022-01-26 14:18                 ` Mike Rapoport
2022-01-26 17:33                   ` Khalid Aziz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YevrGs3WE7ywB+lH@kernel.org \
    --to=rppt@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longpeng2@huawei.com \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).