* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
[not found] <155793276388.13922.18064660723547377633.stgit@localhost.localdomain>
@ 2019-05-16 13:30 ` Michal Hocko
2019-05-16 13:52 ` Michal Hocko
2019-05-16 13:32 ` Jann Horn
1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2019-05-16 13:30 UTC (permalink / raw)
To: Kirill Tkhai
Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
shakeelb, guro, aarcange, hughd, jglisse, mgorman,
daniel.m.jordan, linux-kernel, linux-mm, linux-api
[You are defining a new user visible API, please always add linux-api
mailing list - now done]
On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
>
> struct iovec local_iov, remote_iov;
> void *buf;
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, ...);
> recv(sock, buf, n * PAGE_SIZE, 0);
>
> local_iov->iov_base = buf;
> local_iov->iov_len = n * PAGE_SIZE;
> remove_iov = ...;
>
> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> munmap(buf, n * PAGE_SIZE);
>
> (Note, that above completely ignores error handling)
>
> There are several problems with process_vm_writev() in this example:
>
> 1)it causes pagefault on remote process memory, and it forces
> allocation of a new page (if was not preallocated);
>
> 2)amount of memory for this example is doubled in a moment --
> n pages in current and n pages in remote tasks are occupied
> at the same time;
>
> 3)received data has no a chance to be properly swapped for
> a long time.
>
> The third is the most critical in case of remote process touches
> the data pages some time after process_vm_writev() was made.
> Imagine, node is under memory pressure:
>
> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;
> c)process_vm_writev() allocates duplicate pages in remote
> process and populates them;
> d)munmap() unmaps @buf;
> e)5 minutes later remote task touches data.
>
> In stages "a" and "b" kernel submits unneeded IO and makes
> system IO throughput worse. To make "b" and "c", kernel
> reclaims memory, and moves pages of some other processes
> to swap, so they have to read pages from swap back. Also,
> unneeded copying of pages is occured, while zero-copy is
> more preferred.
>
> We observe similar problem during online migration of big enough
> containers, when after doubling of container's size, the time
> increases 100 times. The system resides under high IO and
> throwing out of useful cashes.
>
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
>
> void *buf;
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, ...);
> recv(sock, buf, n * PAGE_SIZE, 0);
>
> /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> munmap(buf, n * PAGE_SIZE);
>
> It is swap-friendly: in case of memory is swapped right after recv(),
> the syscall just copies pagetable entries like we do on fork(),
> so real access to pages does not occurs, and no IO is needed.
> No excess pages are reclaimed, and number of pages is not doubled.
> Also, zero-copy takes a place, and this also reduces overhead.
>
> The patchset does not introduce much new code, since we simply
> reuse existing copy_page_range() and copy_vma() functions.
> We extend copy_vma() to be able merge VMAs in remote task [2/5],
> and teach copy_page_range() to work with different local and
> remote addresses [3/5]. Patch [5/5] introduces the syscall logic,
> which mostly consists of sanity checks. The rest of patches
> are preparations.
>
> This syscall may be used for page servers like in example
> above, for migration (I assume, even virtual machines may
> want something like this), for zero-copy desiring users
> of process_vm_writev() and process_vm_readv(), for debug
> purposes, etc. It requires the same permittions like
> existing proc_vm_xxx() syscalls have.
>
> The tests I used may be obtained here:
>
> [1]https://gist.github.com/tkhai/198d32fdc001ec7812a5e1ccf091f275
> [2]https://gist.github.com/tkhai/f52dbaeedad5a699f3fb386fda676562
>
> ---
>
> Kirill Tkhai (5):
> mm: Add process_vm_mmap() syscall declaration
> mm: Extend copy_vma()
> mm: Extend copy_page_range()
> mm: Export round_hint_to_min()
> mm: Add process_vm_mmap()
>
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1
> arch/x86/entry/syscalls/syscall_64.tbl | 2
> include/linux/huge_mm.h | 6 +
> include/linux/mm.h | 11 ++
> include/linux/mm_types.h | 2
> include/linux/mman.h | 14 +++
> include/linux/syscalls.h | 5 +
> include/uapi/asm-generic/mman-common.h | 5 +
> include/uapi/asm-generic/unistd.h | 5 +
> init/Kconfig | 9 +-
> kernel/fork.c | 5 +
> kernel/sys_ni.c | 2
> mm/huge_memory.c | 30 ++++--
> mm/memory.c | 165 +++++++++++++++++++++-----------
> mm/mmap.c | 154 ++++++++++++++++++++++++++----
> mm/mremap.c | 4 -
> mm/process_vm_access.c | 71 ++++++++++++++
> 17 files changed, 392 insertions(+), 99 deletions(-)
>
> --
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
[not found] <155793276388.13922.18064660723547377633.stgit@localhost.localdomain>
2019-05-16 13:30 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Michal Hocko
@ 2019-05-16 13:32 ` Jann Horn
2019-05-16 13:56 ` Kirill Tkhai
1 sibling, 1 reply; 5+ messages in thread
From: Jann Horn @ 2019-05-16 13:32 UTC (permalink / raw)
To: Kirill Tkhai
Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
Jerome Glisse
On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> This patchset adds a new syscall, which makes possible
> to clone a mapping from a process to another process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
[...]
> The proposed syscall aims to introduce an interface, which
> supplements currently existing process_vm_writev() and
> process_vm_readv(), and allows to solve the problem with
> anonymous memory transfer. The above example may be rewritten as:
>
> void *buf;
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, ...);
> recv(sock, buf, n * PAGE_SIZE, 0);
>
> /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> munmap(buf, n * PAGE_SIZE);
In this specific example, an alternative would be to splice() from the
socket into /proc/$pid/mem, or something like that, right?
proc_mem_operations has no ->splice_read() at the moment, and it'd
need that to be more efficient, but that could be built without
creating new UAPI, right?
But I guess maybe your workload is not that simple? What do you
actually do with the received data between receiving it and shoving it
over into the other process?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
2019-05-16 13:30 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Michal Hocko
@ 2019-05-16 13:52 ` Michal Hocko
2019-05-16 14:22 ` Kirill Tkhai
0 siblings, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2019-05-16 13:52 UTC (permalink / raw)
To: Kirill Tkhai
Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
shakeelb, guro, aarcange, hughd, jglisse, mgorman,
daniel.m.jordan, linux-kernel, linux-mm, linux-api
On Thu 16-05-19 15:30:34, Michal Hocko wrote:
> [You are defining a new user visible API, please always add linux-api
> mailing list - now done]
>
> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
[...]
> > The proposed syscall aims to introduce an interface, which
> > supplements currently existing process_vm_writev() and
> > process_vm_readv(), and allows to solve the problem with
> > anonymous memory transfer. The above example may be rewritten as:
> >
> > void *buf;
> >
> > buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> > MAP_PRIVATE|MAP_ANONYMOUS, ...);
> > recv(sock, buf, n * PAGE_SIZE, 0);
> >
> > /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
> > process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
> > munmap(buf, n * PAGE_SIZE);
AFAIU this means that you actually want to do an mmap of an anonymous
memory with a COW semantic to the remote process right? How does the
remote process find out where and what has been mmaped? What if the
range collides? This sounds quite scary to me TBH. Why cannot you simply
use shared memory for that?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
2019-05-16 13:32 ` Jann Horn
@ 2019-05-16 13:56 ` Kirill Tkhai
0 siblings, 0 replies; 5+ messages in thread
From: Kirill Tkhai @ 2019-05-16 13:56 UTC (permalink / raw)
To: Jann Horn
Cc: Andrew Morton, Dan Williams, Michal Hocko, keith.busch,
Kirill A . Shutemov, pasha.tatashin, Alexander Duyck, ira.weiny,
Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
Rik van Riel, Kees Cook, hannes, npiggin, Mathieu Desnoyers,
Shakeel Butt, Roman Gushchin, Andrea Arcangeli, Hugh Dickins,
Jerome Glisse
On 16.05.2019 16:32, Jann Horn wrote:
> On Wed, May 15, 2019 at 5:11 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a mapping from a process to another process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> [...]
>> The proposed syscall aims to introduce an interface, which
>> supplements currently existing process_vm_writev() and
>> process_vm_readv(), and allows to solve the problem with
>> anonymous memory transfer. The above example may be rewritten as:
>>
>> void *buf;
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>> MAP_PRIVATE|MAP_ANONYMOUS, ...);
>> recv(sock, buf, n * PAGE_SIZE, 0);
>>
>> /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>> munmap(buf, n * PAGE_SIZE);
>
> In this specific example, an alternative would be to splice() from the
> socket into /proc/$pid/mem, or something like that, right?
> proc_mem_operations has no ->splice_read() at the moment, and it'd
> need that to be more efficient, but that could be built without
> creating new UAPI, right?
I have just never seen, a socket memory may be preempted into swap.
If so, there is a fundamental problem.
But, anyway, like you guessed below:
> But I guess maybe your workload is not that simple? What do you
> actually do with the received data between receiving it and shoving it
> over into the other process?
Data are usually sent encrypted and compressed by socket, so there is no
possibility to go this way. You may want to do everything with data,
before passing to another process.
Kirill
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping
2019-05-16 13:52 ` Michal Hocko
@ 2019-05-16 14:22 ` Kirill Tkhai
0 siblings, 0 replies; 5+ messages in thread
From: Kirill Tkhai @ 2019-05-16 14:22 UTC (permalink / raw)
To: Michal Hocko
Cc: akpm, dan.j.williams, keith.busch, kirill.shutemov,
pasha.tatashin, alexander.h.duyck, ira.weiny, andreyknvl, arunks,
vbabka, cl, riel, keescook, hannes, npiggin, mathieu.desnoyers,
shakeelb, guro, aarcange, hughd, jglisse, mgorman,
daniel.m.jordan, linux-kernel, linux-mm, linux-api
On 16.05.2019 16:52, Michal Hocko wrote:
> On Thu 16-05-19 15:30:34, Michal Hocko wrote:
>> [You are defining a new user visible API, please always add linux-api
>> mailing list - now done]
>>
>> On Wed 15-05-19 18:11:15, Kirill Tkhai wrote:
> [...]
>>> The proposed syscall aims to introduce an interface, which
>>> supplements currently existing process_vm_writev() and
>>> process_vm_readv(), and allows to solve the problem with
>>> anonymous memory transfer. The above example may be rewritten as:
>>>
>>> void *buf;
>>>
>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> /* Sign of @pid is direction: "from @pid task to current" or vice versa. */
>>> process_vm_mmap(-pid, buf, n * PAGE_SIZE, remote_addr, PVMMAP_FIXED);
>>> munmap(buf, n * PAGE_SIZE);
>
> AFAIU this means that you actually want to do an mmap of an anonymous
> memory with a COW semantic to the remote process right?
Yes.
> How does the remote process find out where and what has been mmaped?
Any way. Isn't this a trivial task? :) You may use socket or any
of appropriate linux features to communicate between them.
>What if the range collides? This sounds quite scary to me TBH.
In case of range collides, the part of old VMA becomes unmapped.
The same way we behave on ordinary mmap. You may intersect a range,
which another thread mapped, so you need a synchronization between
them. There is no a principle difference.
Also I'm going to add a flag to prevent unmapping like Kees suggested.
Please, see his message.
> Why cannot you simply use shared memory for that?
Because of remote task may want specific type of VMA. It may want not to
share a VMA with its children.
Speaking about online migration, a task wants its anonymous private VMAs
remain the same after the migration. Otherwise, imagine the situation,
when task's stack becomes a shared VMA after the migration.
Also, task wants anonymous mapping remains anonymous.
In general, in case of shared memory is enough for everything, we would
have never had process_vm_writev() and process_vm_readv() syscalls.
Kirill
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-05-16 14:22 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <155793276388.13922.18064660723547377633.stgit@localhost.localdomain>
2019-05-16 13:30 ` [PATCH RFC 0/5] mm: process_vm_mmap() -- syscall for duplication a process mapping Michal Hocko
2019-05-16 13:52 ` Michal Hocko
2019-05-16 14:22 ` Kirill Tkhai
2019-05-16 13:32 ` Jann Horn
2019-05-16 13:56 ` Kirill Tkhai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).