Re: [RFC PATCH 0/6] Add support for shared PTEs across processes

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
       [not found] <cover.1642526745.git.khalid.aziz@oracle.com>
@ 2022-01-22 11:31 ` Mike Rapoport
  2022-01-22 18:29   ` Andy Lutomirski
  2022-01-24 18:48   ` Khalid Aziz
  0 siblings, 2 replies; 7+ messages in thread
From: Mike Rapoport @ 2022-01-22 11:31 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: akpm, willy, longpeng2, arnd, dave.hansen, david, surenb,
	linux-kernel, linux-mm, linux-api

(added linux-api)

On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as
> number of mappings being maintained is small enough, this space
> consumed by page tables is not objectionable. When very few memory
> pages are shared between processes, the number of page table entries
> (PTEs) to maintain is mostly constrained by the number of pages of
> memory on the system. As the number of shared pages and the number
> of times pages are shared goes up, amount of memory consumed by page
> tables starts to become significant.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system carsh was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
> 
> This is a proposal to implement a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. The proposal is to add
> a new system call - mshare(), which can be used by a process to
> create a region (we will call it mshare'd region) which can be used
> by other processes to map same pages using shared PTEs. Other
> process(es), assuming they have the right permissions, can then make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
> 
> 
> API Proposal
> ============
> 
> The mshare API consists of two system calls - mshare() and mshare_unlink()
> 
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> 
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
> 
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
> 
> mshare() returns an error code if it fails, otherwise it returns 0.

Did you consider returning a file descriptor from mshare() system call?
Then there would be no need in mshare_unlink() as close(fd) would work.
 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64)
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
> 
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.

Maybe read should return a structure containing some data identifier and
the data itself, so that it could be extended in the future.
 
> --
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22 11:31 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Mike Rapoport
@ 2022-01-22 18:29   ` Andy Lutomirski
  2022-01-24 18:48   ` Khalid Aziz
  1 sibling, 0 replies; 7+ messages in thread
From: Andy Lutomirski @ 2022-01-22 18:29 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Khalid Aziz, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api, Andy Lutomirski

> On Jan 22, 2022, at 3:31 AM, Mike Rapoport <rppt@kernel.org> wrote:
>
> (added linux-api)
>
>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.

Sharing PTEs is nice, but merely sharing a chunk of address space
regardless of optimizations is nontrivial.  It’s also quite useful,
potentially.  So I think a good way to start would be to make a nice
design for just sharing address space and then, on top of it, figure
out how to share page tables.

See here for an earlier proposal:

https://lore.kernel.org/all/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com/

Alternatively, one could try to optimize memfd so that large similarly
aligned mappings in different processes could share page tables.

Any of the above will require some interesting thought as to whether
TLB shootdowns are managed by the core rmap code or by mmu notifiers.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-22 11:31 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Mike Rapoport
  2022-01-22 18:29   ` Andy Lutomirski
@ 2022-01-24 18:48   ` Khalid Aziz
  2022-01-24 19:45     ` Andy Lutomirski
  1 sibling, 1 reply; 7+ messages in thread
From: Khalid Aziz @ 2022-01-24 18:48 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: akpm, willy, longpeng2, arnd, dave.hansen, david, surenb,
	linux-kernel, linux-mm, linux-api

On 1/22/22 04:31, Mike Rapoport wrote:
> (added linux-api)
> 
> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as
>> number of mappings being maintained is small enough, this space
>> consumed by page tables is not objectionable. When very few memory
>> pages are shared between processes, the number of page table entries
>> (PTEs) to maintain is mostly constrained by the number of pages of
>> memory on the system. As the number of shared pages and the number
>> of times pages are shared goes up, amount of memory consumed by page
>> tables starts to become significant.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system carsh was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This is a proposal to implement a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. The proposal is to add
>> a new system call - mshare(), which can be used by a process to
>> create a region (we will call it mshare'd region) which can be used
>> by other processes to map same pages using shared PTEs. Other
>> process(es), assuming they have the right permissions, can then make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API Proposal
>> ============
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
> 
> Did you consider returning a file descriptor from mshare() system call?
> Then there would be no need in mshare_unlink() as close(fd) would work.

That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though 
for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases 
for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling 
mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to 
bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic 
with a close() semantics though unless there was another way to force refcount to 0. Right?


>   
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64)
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
> 
> Maybe read should return a structure containing some data identifier and
> the data itself, so that it could be extended in the future.

I like that idea. I will work on it.

Thanks!

--
Khalid

>   
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 18:48   ` Khalid Aziz
@ 2022-01-24 19:45     ` Andy Lutomirski
  2022-01-24 22:30       ` Khalid Aziz
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2022-01-24 19:45 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/22/22 04:31, Mike Rapoport wrote:
> > (added linux-api)
> >
> > On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >> Page tables in kernel consume some of the memory and as long as
> >> number of mappings being maintained is small enough, this space
> >> consumed by page tables is not objectionable. When very few memory
> >> pages are shared between processes, the number of page table entries
> >> (PTEs) to maintain is mostly constrained by the number of pages of
> >> memory on the system. As the number of shared pages and the number
> >> of times pages are shared goes up, amount of memory consumed by page
> >> tables starts to become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared
> >> across 1000s of processes. On x86_64, each page requires a PTE that
> >> is only 8 bytes long which is very small compared to the 4K page
> >> size. When 2000 processes map the same page in their address space,
> >> each one of them requires 8 bytes for its PTE and together that adds
> >> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >> database server with 300GB SGA, a system carsh was seen with
> >> out-of-memory condition when 1500+ clients tried to share this SGA
> >> even though the system had 512GB of memory. On this server, in the
> >> worst case scenario of all 1500 processes mapping every page from
> >> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >> could be shared, amount of memory saved is very significant.
> >>
> >> This is a proposal to implement a mechanism in kernel to allow
> >> userspace processes to opt into sharing PTEs. The proposal is to add
> >> a new system call - mshare(), which can be used by a process to
> >> create a region (we will call it mshare'd region) which can be used
> >> by other processes to map same pages using shared PTEs. Other
> >> process(es), assuming they have the right permissions, can then make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API Proposal
> >> ============
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>    exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >
> > Did you consider returning a file descriptor from mshare() system call?
> > Then there would be no need in mshare_unlink() as close(fd) would work.
>
> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> with a close() semantics though unless there was another way to force refcount to 0. Right?
>

I'm not sure I understand the problem.  If you're sharing a portion of
an mm and the mm goes away, then all that should be left are some
struct files that are no longer useful.  They'll go away when their
refcount goes to zero.

--Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 19:45     ` Andy Lutomirski
@ 2022-01-24 22:30       ` Khalid Aziz
  2022-01-24 23:16         ` Andy Lutomirski
  0 siblings, 1 reply; 7+ messages in thread
From: Khalid Aziz @ 2022-01-24 22:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On 1/24/22 12:45, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/22/22 04:31, Mike Rapoport wrote:
>>> (added linux-api)
>>>
>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>> Page tables in kernel consume some of the memory and as long as
>>>> number of mappings being maintained is small enough, this space
>>>> consumed by page tables is not objectionable. When very few memory
>>>> pages are shared between processes, the number of page table entries
>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>> memory on the system. As the number of shared pages and the number
>>>> of times pages are shared goes up, amount of memory consumed by page
>>>> tables starts to become significant.
>>>>
>>>> Some of the field deployments commonly see memory pages shared
>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>> is only 8 bytes long which is very small compared to the 4K page
>>>> size. When 2000 processes map the same page in their address space,
>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>> database server with 300GB SGA, a system carsh was seen with
>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>> even though the system had 512GB of memory. On this server, in the
>>>> worst case scenario of all 1500 processes mapping every page from
>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>> could be shared, amount of memory saved is very significant.
>>>>
>>>> This is a proposal to implement a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>> a new system call - mshare(), which can be used by a process to
>>>> create a region (we will call it mshare'd region) which can be used
>>>> by other processes to map same pages using shared PTEs. Other
>>>> process(es), assuming they have the right permissions, can then make
>>>> the mashare() system call to map the shared pages into their address
>>>> space using the shared PTEs.  When a process is done using this
>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>> access. When the last process accessing mshare'd region calls
>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>> it is freed.
>>>>
>>>>
>>>> API Proposal
>>>> ============
>>>>
>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>
>>>> --
>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>
>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>> region that will be shared at PTE level. "name" refers to shared object
>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>> of this shared memory area and length is the size of this area.
>>>> oflags can be one of:
>>>>
>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>> - O_RDWR opens shared memory area for read and write access
>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>     exists with that name, return an error.
>>>>
>>>> mode represents the creation mode for the shared object under
>>>> /sys/fs/mshare.
>>>>
>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> Did you consider returning a file descriptor from mshare() system call?
>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>
>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>
> 
> I'm not sure I understand the problem.  If you're sharing a portion of
> an mm and the mm goes away, then all that should be left are some
> struct files that are no longer useful.  They'll go away when their
> refcount goes to zero.
> 
> --Andy
> 

The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process 
initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd 
portion go away.

One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but 
the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to 
hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare() 
increments the refcount and each call to mshare_unlink() decrements the refcount.

--
Khalid

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 22:30       ` Khalid Aziz
@ 2022-01-24 23:16         ` Andy Lutomirski
  2022-01-24 23:44           ` Khalid Aziz
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2022-01-24 23:16 UTC (permalink / raw)
  To: Khalid Aziz
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> On 1/24/22 12:45, Andy Lutomirski wrote:
> > On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> On 1/22/22 04:31, Mike Rapoport wrote:
> >>> (added linux-api)
> >>>
> >>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
> >>>> Page tables in kernel consume some of the memory and as long as
> >>>> number of mappings being maintained is small enough, this space
> >>>> consumed by page tables is not objectionable. When very few memory
> >>>> pages are shared between processes, the number of page table entries
> >>>> (PTEs) to maintain is mostly constrained by the number of pages of
> >>>> memory on the system. As the number of shared pages and the number
> >>>> of times pages are shared goes up, amount of memory consumed by page
> >>>> tables starts to become significant.
> >>>>
> >>>> Some of the field deployments commonly see memory pages shared
> >>>> across 1000s of processes. On x86_64, each page requires a PTE that
> >>>> is only 8 bytes long which is very small compared to the 4K page
> >>>> size. When 2000 processes map the same page in their address space,
> >>>> each one of them requires 8 bytes for its PTE and together that adds
> >>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
> >>>> database server with 300GB SGA, a system carsh was seen with
> >>>> out-of-memory condition when 1500+ clients tried to share this SGA
> >>>> even though the system had 512GB of memory. On this server, in the
> >>>> worst case scenario of all 1500 processes mapping every page from
> >>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
> >>>> could be shared, amount of memory saved is very significant.
> >>>>
> >>>> This is a proposal to implement a mechanism in kernel to allow
> >>>> userspace processes to opt into sharing PTEs. The proposal is to add
> >>>> a new system call - mshare(), which can be used by a process to
> >>>> create a region (we will call it mshare'd region) which can be used
> >>>> by other processes to map same pages using shared PTEs. Other
> >>>> process(es), assuming they have the right permissions, can then make
> >>>> the mashare() system call to map the shared pages into their address
> >>>> space using the shared PTEs.  When a process is done using this
> >>>> mshare'd region, it makes a mshare_unlink() system call to end its
> >>>> access. When the last process accessing mshare'd region calls
> >>>> mshare_unlink(), the mshare'd region is torn down and memory used by
> >>>> it is freed.
> >>>>
> >>>>
> >>>> API Proposal
> >>>> ============
> >>>>
> >>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>>>
> >>>> --
> >>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>>>
> >>>> mshare() creates and opens a new, or opens an existing mshare'd
> >>>> region that will be shared at PTE level. "name" refers to shared object
> >>>> name that exists under /sys/fs/mshare. "addr" is the starting address
> >>>> of this shared memory area and length is the size of this area.
> >>>> oflags can be one of:
> >>>>
> >>>> - O_RDONLY opens shared memory area for read only access by everyone
> >>>> - O_RDWR opens shared memory area for read and write access
> >>>> - O_CREAT creates the named shared memory area if it does not exist
> >>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>>>     exists with that name, return an error.
> >>>>
> >>>> mode represents the creation mode for the shared object under
> >>>> /sys/fs/mshare.
> >>>>
> >>>> mshare() returns an error code if it fails, otherwise it returns 0.
> >>>
> >>> Did you consider returning a file descriptor from mshare() system call?
> >>> Then there would be no need in mshare_unlink() as close(fd) would work.
> >>
> >> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
> >> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
> >> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
> >> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
> >> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
> >> with a close() semantics though unless there was another way to force refcount to 0. Right?
> >>
> >
> > I'm not sure I understand the problem.  If you're sharing a portion of
> > an mm and the mm goes away, then all that should be left are some
> > struct files that are no longer useful.  They'll go away when their
> > refcount goes to zero.
> >
> > --Andy
> >
>
> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
> portion go away.
>
> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
> increments the refcount and each call to mshare_unlink() decrements the refcount.

In general, objects which are kept alive by name tend to be quite
awkward. Things like network namespaces essentially have to work that
way and end up with awkward APIs.  Things like shared memory don't
actually have to be kept alive by name, and the cases that do keep
them alive by name (tmpfs, shmget) can end up being so awkward that
people invent nameless variants like memfd.

So I would strongly suggest you see how the design works out with no
names and no external keep-alive mechanism.  Either have the continued
existence of *any* fd keep the whole thing alive or make it be a pair
of struct files, one that controls the region (and can map things into
it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
passing objects like this around.

--Andy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/6] Add support for shared PTEs across processes
  2022-01-24 23:16         ` Andy Lutomirski
@ 2022-01-24 23:44           ` Khalid Aziz
  0 siblings, 0 replies; 7+ messages in thread
From: Khalid Aziz @ 2022-01-24 23:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mike Rapoport, akpm, willy, longpeng2, arnd, dave.hansen, david,
	surenb, linux-kernel, linux-mm, linux-api

On 1/24/22 16:16, Andy Lutomirski wrote:
> On Mon, Jan 24, 2022 at 2:34 PM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> On 1/24/22 12:45, Andy Lutomirski wrote:
>>> On Mon, Jan 24, 2022 at 10:54 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>> On 1/22/22 04:31, Mike Rapoport wrote:
>>>>> (added linux-api)
>>>>>
>>>>> On Tue, Jan 18, 2022 at 02:19:12PM -0700, Khalid Aziz wrote:
>>>>>> Page tables in kernel consume some of the memory and as long as
>>>>>> number of mappings being maintained is small enough, this space
>>>>>> consumed by page tables is not objectionable. When very few memory
>>>>>> pages are shared between processes, the number of page table entries
>>>>>> (PTEs) to maintain is mostly constrained by the number of pages of
>>>>>> memory on the system. As the number of shared pages and the number
>>>>>> of times pages are shared goes up, amount of memory consumed by page
>>>>>> tables starts to become significant.
>>>>>>
>>>>>> Some of the field deployments commonly see memory pages shared
>>>>>> across 1000s of processes. On x86_64, each page requires a PTE that
>>>>>> is only 8 bytes long which is very small compared to the 4K page
>>>>>> size. When 2000 processes map the same page in their address space,
>>>>>> each one of them requires 8 bytes for its PTE and together that adds
>>>>>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>>>>>> database server with 300GB SGA, a system carsh was seen with
>>>>>> out-of-memory condition when 1500+ clients tried to share this SGA
>>>>>> even though the system had 512GB of memory. On this server, in the
>>>>>> worst case scenario of all 1500 processes mapping every page from
>>>>>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>>>>>> could be shared, amount of memory saved is very significant.
>>>>>>
>>>>>> This is a proposal to implement a mechanism in kernel to allow
>>>>>> userspace processes to opt into sharing PTEs. The proposal is to add
>>>>>> a new system call - mshare(), which can be used by a process to
>>>>>> create a region (we will call it mshare'd region) which can be used
>>>>>> by other processes to map same pages using shared PTEs. Other
>>>>>> process(es), assuming they have the right permissions, can then make
>>>>>> the mashare() system call to map the shared pages into their address
>>>>>> space using the shared PTEs.  When a process is done using this
>>>>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>>>>> access. When the last process accessing mshare'd region calls
>>>>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>>>>> it is freed.
>>>>>>
>>>>>>
>>>>>> API Proposal
>>>>>> ============
>>>>>>
>>>>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>>>>
>>>>>> --
>>>>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>>>>
>>>>>> mshare() creates and opens a new, or opens an existing mshare'd
>>>>>> region that will be shared at PTE level. "name" refers to shared object
>>>>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>>>>> of this shared memory area and length is the size of this area.
>>>>>> oflags can be one of:
>>>>>>
>>>>>> - O_RDONLY opens shared memory area for read only access by everyone
>>>>>> - O_RDWR opens shared memory area for read and write access
>>>>>> - O_CREAT creates the named shared memory area if it does not exist
>>>>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>>>>      exists with that name, return an error.
>>>>>>
>>>>>> mode represents the creation mode for the shared object under
>>>>>> /sys/fs/mshare.
>>>>>>
>>>>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>>>
>>>>> Did you consider returning a file descriptor from mshare() system call?
>>>>> Then there would be no need in mshare_unlink() as close(fd) would work.
>>>>
>>>> That is an interesting idea. It could work and eliminates the need for a new system call. It could be confusing though
>>>> for application writers. A close() call with a side-effect of deleting shared mapping would be odd. One of the use cases
>>>> for having files for mshare'd regions is to allow for orphaned mshare'd regions to be cleaned up by calling
>>>> mshare_unlink() with region name. This can require calling mshare_unlink() multiple times in current implementation to
>>>> bring the refcount for mshare'd region to 0 when mshare_unlink() finally cleans up the region. This would be problematic
>>>> with a close() semantics though unless there was another way to force refcount to 0. Right?
>>>>
>>>
>>> I'm not sure I understand the problem.  If you're sharing a portion of
>>> an mm and the mm goes away, then all that should be left are some
>>> struct files that are no longer useful.  They'll go away when their
>>> refcount goes to zero.
>>>
>>> --Andy
>>>
>>
>> The mm that holds shared PTEs is a separate mm not tied to a task. I started out by sharing portion of the donor process
>> initially but that necessitated keeping the donor process alive. If the donor process dies, its mm and the mshare'd
>> portion go away.
>>
>> One of the requirements I have is the process that creates mshare'd region can terminate, possibly involuntarily, but
>> the mshare'd region persists and rest of the consumer processes continue without hiccup. So I create a separate mm to
>> hold shared PTEs and that mm is cleaned up when all references to mshare'd region go away. Each call to mshare()
>> increments the refcount and each call to mshare_unlink() decrements the refcount.
> 
> In general, objects which are kept alive by name tend to be quite
> awkward. Things like network namespaces essentially have to work that
> way and end up with awkward APIs.  Things like shared memory don't
> actually have to be kept alive by name, and the cases that do keep
> them alive by name (tmpfs, shmget) can end up being so awkward that
> people invent nameless variants like memfd.
> 
> So I would strongly suggest you see how the design works out with no
> names and no external keep-alive mechanism.  Either have the continued
> existence of *any* fd keep the whole thing alive or make it be a pair
> of struct files, one that controls the region (and can map things into
> it, etc) and one that can map it.  SCM_RIGHTS is pretty good for
> passing objects like this around.
> 
> --Andy
> 

These are certainly good ideas to simplify this feature. My very first implementation of mshare did not have msharefs, 
was based on fd where fd could be passed to any other process using SCM_RIGHTS and required the process creating mshare 
region to stay alive for the region to exist. That certainly made life simpler in terms of implementation. Feedback from 
my customers of this feature (DB developers and people deploying DB systems) was this imposes a hard dependency on a 
server process that creates the mshare'd region and passes fd to other processes needing access to this region. This 
dependency creates a weak link in system reliability that is too risky. If the server process dies for any reason, the 
entire system becomes unavailable. They requested a more robust implementation that they can depend upon. I then went 
through the process of implementing this using shmfs since POSIX shm has those attributes. That turned out to be more 
kludgy than a clean implementation using a separate in-memory msharefs. That brought me to the RFC implementation I sent 
out.

I do agree with you that name based persistent object makes the implementation more complex (maintaining a separate mm 
not tied to a process requires quite a bit of work to keep things consistent and clean mm up properly as users of this 
shared mm terminate) but I see the reliability point of view. Does that logic resonate with you?

Thanks,
Khalid

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-25  4:38 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <cover.1642526745.git.khalid.aziz@oracle.com>
2022-01-22 11:31 ` [RFC PATCH 0/6] Add support for shared PTEs across processes Mike Rapoport
2022-01-22 18:29   ` Andy Lutomirski
2022-01-24 18:48   ` Khalid Aziz
2022-01-24 19:45     ` Andy Lutomirski
2022-01-24 22:30       ` Khalid Aziz
2022-01-24 23:16         ` Andy Lutomirski
2022-01-24 23:44           ` Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).