* Re: [RFC][PATCH] Cross Memory Attach
[not found] <20100915104855.41de3ebf@lilo>
@ 2010-09-15 8:02 ` Ingo Molnar
2010-09-15 8:16 ` Ingo Molnar
2010-09-15 13:20 ` Christopher Yeoh
2010-09-15 10:58 ` Avi Kivity
1 sibling, 2 replies; 22+ messages in thread
From: Ingo Molnar @ 2010-09-15 8:02 UTC (permalink / raw)
To: Christopher Yeoh
Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra,
linux-mm
(Interesting patch found on lkml, more folks Cc:-ed)
* Christopher Yeoh <cyeoh@au1.ibm.com> wrote:
> The basic idea behind cross memory attach is to allow MPI programs
> doing intra-node communication to do a single copy of the message
> rather than a double copy of the message via shared memory.
>
> The following patch attempts to achieve this by allowing a destination
> process, given an address and size from a source process, to copy
> memory directly from the source process into its own address space via
> a system call. There is also a symmetrical ability to copy from the
> current process's address space into a destination process's address
> space.
>
> Use of vmsplice instead was considered, but has problems. Since you
> need the reader and writer working co-operatively if the pipe is not
> drained then you block. Which requires some wrapping to do non
> blocking on the send side or polling on the receive. In all to all
> communication it requires ordering otherwise you can deadlock. And in
> the example of many MPI tasks writing to one MPI task vmsplice
> serialises the copying.
>
> I've added the use of this capability to OpenMPI and run some MPI
> benchmarks on a 64-way (with SMT off) Power6 machine which see
> improvements in the following areas:
>
> HPCC results:
> =============
>
> MB/s Num Processes
> Naturally Ordered 4 8 16 32
> Base 1235 935 622 419
> CMA 4741 3769 1977 703
>
>
> MB/s Num Processes
> Randomly Ordered 4 8 16 32
> Base 1227 947 638 412
> CMA 4666 3682 1978 710
>
> MB/s Num Processes
> Max Ping Pong 4 8 16 32
> Base 2028 1938 1928 1882
> CMA 7424 7510 7598 7708
>
>
> NPB:
> ====
> BT - 12% improvement
> FT - 15% improvement
> IS - 30% improvement
> SP - 34% improvement
>
> IMB:
> ===
>
> Ping Pong - ~30% improvement
> Ping Ping - ~120% improvement
> SendRecv - ~100% improvement
> Exchange - ~150% improvement
> Gather(v) - ~20% improvement
> Scatter(v) - ~20% improvement
> AlltoAll(v) - 30-50% improvement
>
> Patch is as below. Any comments?
Impressive numbers!
What did those OpenMPI facilities use before your patch - shared memory
or sockets?
I have an observation about the interface:
> +asmlinkage long sys_copy_from_process(pid_t pid, unsigned long addr,
> + unsigned long len,
> + char __user *buf, int flags);
> +asmlinkage long sys_copy_to_process(pid_t pid, unsigned long addr,
> + unsigned long len,
> + char __user *buf, int flags);
A small detail: 'int flags' should probably be 'unsigned long flags' -
it leaves more space.
Also, note that there is a further performance optimization possible
here: if the other task's ->mm is the same as this task's (they share
the MM), then the copy can be done straight in this process context,
without GUP. User-space might not necessarily be aware of this so it
might make sense to express this special case in the kernel too.
More fundamentally, wouldnt it make sense to create an iovec interface
here? If the Gather(v) / Scatter(v) / AlltoAll(v) workloads have any
fragmentation on the user-space buffer side then the copy of multiple
areas could be done in a single syscall. (the MM lock has to be touched
only once, target task only be looked up only once, etc.)
Plus, a small naming detail, shouldnt the naming be more IO like:
sys_process_vm_read()
sys_process_vm_write()
Basically a regular read()/write() interface, but instead of fd's we'd
have (PID,addr) identifiers for remote buffers, and instant execution
(no buffering).
This makes these somewhat special syscalls a bit less special :-)
[ In theory we could also use this new ABI in a way to help the various
RDMA efforts as well - but it looks way too complex. RDMA is rather
difficult from an OS design POV - and this special case you have
implemented is much easier to do, as we are in a single trust domain. ]
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 8:02 ` [RFC][PATCH] Cross Memory Attach Ingo Molnar
@ 2010-09-15 8:16 ` Ingo Molnar
2010-09-15 13:23 ` Christopher Yeoh
2010-09-15 13:20 ` Christopher Yeoh
1 sibling, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2010-09-15 8:16 UTC (permalink / raw)
To: Christopher Yeoh
Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra,
linux-mm
> > NPB:
> > ====
> > BT - 12% improvement
> > FT - 15% improvement
> > IS - 30% improvement
> > SP - 34% improvement
> >
> > IMB:
> > ===
> >
> > Ping Pong - ~30% improvement
> > Ping Ping - ~120% improvement
> > SendRecv - ~100% improvement
> > Exchange - ~150% improvement
> > Gather(v) - ~20% improvement
> > Scatter(v) - ~20% improvement
> > AlltoAll(v) - 30-50% improvement
btw., how does OpenMPI signal the target tasks that something happened
to their address space - is there some pipe/socket side-channel, or
perhaps purely based on flags in the modified memory areas, which are
polled?
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
[not found] <20100915104855.41de3ebf@lilo>
2010-09-15 8:02 ` [RFC][PATCH] Cross Memory Attach Ingo Molnar
@ 2010-09-15 10:58 ` Avi Kivity
2010-09-15 13:51 ` Ingo Molnar
` (2 more replies)
1 sibling, 3 replies; 22+ messages in thread
From: Avi Kivity @ 2010-09-15 10:58 UTC (permalink / raw)
To: Christopher Yeoh; +Cc: linux-kernel, Linux Memory Management List, Ingo Molnar
On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
> The basic idea behind cross memory attach is to allow MPI programs doing
> intra-node communication to do a single copy of the message rather than
> a double copy of the message via shared memory.
If the host has a dma engine (many modern ones do) you can reduce this
to zero copies (at least, zero processor copies).
> The following patch attempts to achieve this by allowing a
> destination process, given an address and size from a source process, to
> copy memory directly from the source process into its own address space
> via a system call. There is also a symmetrical ability to copy from
> the current process's address space into a destination process's
> address space.
>
>
Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
ulong len) system call which returns an file descriptor that represents
a portion of the process address space. You can then use preadv() and
pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful
with a dma engine, since that adds latency).
With some care (and use of mmu_notifiers) you can even mmap() your vmfd
and access remote process memory directly.
A nice property of file descriptors is that you can pass them around
securely via SCM_RIGHTS. So a process can create a window into its
address space and pass it to other processes.
(or you could just use a shared memory object and pass it around)
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 8:02 ` [RFC][PATCH] Cross Memory Attach Ingo Molnar
2010-09-15 8:16 ` Ingo Molnar
@ 2010-09-15 13:20 ` Christopher Yeoh
1 sibling, 0 replies; 22+ messages in thread
From: Christopher Yeoh @ 2010-09-15 13:20 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra,
linux-mm
On Wed, 15 Sep 2010 10:02:35 +0200
Ingo Molnar <mingo@elte.hu> wrote:
>
> What did those OpenMPI facilities use before your patch - shared
> memory or sockets?
This comparison is against OpenMPI using the shared memory btl.
> I have an observation about the interface:
>
> A small detail: 'int flags' should probably be 'unsigned long flags'
> - it leaves more space.
ok.
> Also, note that there is a further performance optimization possible
> here: if the other task's ->mm is the same as this task's (they share
> the MM), then the copy can be done straight in this process context,
> without GUP. User-space might not necessarily be aware of this so it
> might make sense to express this special case in the kernel too.
ok.
> More fundamentally, wouldnt it make sense to create an iovec
> interface here? If the Gather(v) / Scatter(v) / AlltoAll(v) workloads
> have any fragmentation on the user-space buffer side then the copy of
> multiple areas could be done in a single syscall. (the MM lock has to
> be touched only once, target task only be looked up only once, etc.)
yes, I think so. Currently where I'm using the interface in OpenMPI I
can't take advantage of this, but it could be changed in the future- and
its likely other MPI's could take advantage of it already.
> Plus, a small naming detail, shouldnt the naming be more IO like:
>
> sys_process_vm_read()
> sys_process_vm_write()
Yes, that looks better to me. I really wasn't sure how to name them.
Regards,
Chris
--
cyeoh@au.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 8:16 ` Ingo Molnar
@ 2010-09-15 13:23 ` Christopher Yeoh
0 siblings, 0 replies; 22+ messages in thread
From: Christopher Yeoh @ 2010-09-15 13:23 UTC (permalink / raw)
To: Ingo Molnar
Cc: linux-kernel, Andrew Morton, Linus Torvalds, Peter Zijlstra,
linux-mm
On Wed, 15 Sep 2010 10:16:53 +0200
Ingo Molnar <mingo@elte.hu> wrote:
>
> btw., how does OpenMPI signal the target tasks that something
> happened to their address space - is there some pipe/socket
> side-channel, or perhaps purely based on flags in the modified memory
> areas, which are polled?
The shared memory btl signals through shared memory, though when
threading is enabled (I think its mostly used with threading support
disabled) in OpenMPI there is also signalling done through a pipe.
Regards,
Chris
--
cyeoh@au.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 10:58 ` Avi Kivity
@ 2010-09-15 13:51 ` Ingo Molnar
2010-09-15 16:10 ` Avi Kivity
2010-09-15 14:42 ` Christopher Yeoh
2010-09-15 14:46 ` Bryan Donlan
2 siblings, 1 reply; 22+ messages in thread
From: Ingo Molnar @ 2010-09-15 13:51 UTC (permalink / raw)
To: Avi Kivity
Cc: Christopher Yeoh, linux-kernel, Linux Memory Management List,
Andrew Morton, Linus Torvalds, Peter Zijlstra
* Avi Kivity <avi@redhat.com> wrote:
> On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
>
> > The basic idea behind cross memory attach is to allow MPI programs
> > doing intra-node communication to do a single copy of the message
> > rather than a double copy of the message via shared memory.
>
> If the host has a dma engine (many modern ones do) you can reduce this
> to zero copies (at least, zero processor copies).
>
> > The following patch attempts to achieve this by allowing a
> > destination process, given an address and size from a source
> > process, to copy memory directly from the source process into its
> > own address space via a system call. There is also a symmetrical
> > ability to copy from the current process's address space into a
> > destination process's address space.
>
> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong
> start, ulong len) system call which returns an file descriptor that
> represents a portion of the process address space. You can then use
> preadv() and pwritev() to copy memory, and io_submit(IO_CMD_PREADV)
> and io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially
> useful with a dma engine, since that adds latency).
>
> With some care (and use of mmu_notifiers) you can even mmap() your
> vmfd and access remote process memory directly.
>
> A nice property of file descriptors is that you can pass them around
> securely via SCM_RIGHTS. So a process can create a window into its
> address space and pass it to other processes.
>
> (or you could just use a shared memory object and pass it around)
Interesting, but how will that work in a scalable way with lots of
non-thread tasks?
Say we have 100 processes. We'd have to have 100 fd's - each has to be
passed to a new worker process.
In that sense a PID is just as good of a reference as an fd - it can be
looked up lockless, etc. - but has the added advantage that it can be
passed along just by number.
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 10:58 ` Avi Kivity
2010-09-15 13:51 ` Ingo Molnar
@ 2010-09-15 14:42 ` Christopher Yeoh
2010-09-15 14:52 ` Linus Torvalds
2010-09-16 6:32 ` Brice Goglin
2010-09-15 14:46 ` Bryan Donlan
2 siblings, 2 replies; 22+ messages in thread
From: Christopher Yeoh @ 2010-09-15 14:42 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-kernel, Linux Memory Management List, Ingo Molnar
On Wed, 15 Sep 2010 12:58:15 +0200
Avi Kivity <avi@redhat.com> wrote:
> On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
> > The basic idea behind cross memory attach is to allow MPI programs
> > doing intra-node communication to do a single copy of the message
> > rather than a double copy of the message via shared memory.
>
> If the host has a dma engine (many modern ones do) you can reduce
> this to zero copies (at least, zero processor copies).
Yes, this interface doesn't really support that. I've tried to keep
things really simple here, but I see potential for increasing
level/complexity of support with diminishing returns:
1. single copy (basically what the current implementation does)
2. support for async dma offload (rather arch specific)
3. ability to map part of another process's address space directly into
the current one. Would have setup/tear down overhead, but this would
be useful specifically for reduction operations where we don't even
need to really copy the data once at all, but use it directly in
arithmetic/logical operations on the receiver.
For reference, there is also knem http://runtime.bordeaux.inria.fr/knem/
which does implement (2) for I/OAT, though it looks to me the interface
and implementation are relatively speaking quite a bit more complex.
> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong
> start, ulong len) system call which returns an file descriptor that
> represents a portion of the process address space. You can then use
> preadv() and pwritev() to copy memory, and io_submit(IO_CMD_PREADV)
> and io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially
> useful with a dma engine, since that adds latency).
>
> With some care (and use of mmu_notifiers) you can even mmap() your
> vmfd and access remote process memory directly.
That interface sounds interesting (I'm not sure I understand how
this would be implemented), though this would mean that a file
descriptor would need to be created for every message that
each process sent wouldn't it?
Regards,
Chris
--
cyeoh@au.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 10:58 ` Avi Kivity
2010-09-15 13:51 ` Ingo Molnar
2010-09-15 14:42 ` Christopher Yeoh
@ 2010-09-15 14:46 ` Bryan Donlan
2010-09-15 16:13 ` Avi Kivity
` (2 more replies)
2 siblings, 3 replies; 22+ messages in thread
From: Bryan Donlan @ 2010-09-15 14:46 UTC (permalink / raw)
To: Avi Kivity
Cc: Christopher Yeoh, linux-kernel, Linux Memory Management List,
Ingo Molnar
On Wed, Sep 15, 2010 at 19:58, Avi Kivity <avi@redhat.com> wrote:
> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
> ulong len) system call which returns an file descriptor that represents a
> portion of the process address space. You can then use preadv() and
> pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
> io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful with
> a dma engine, since that adds latency).
>
> With some care (and use of mmu_notifiers) you can even mmap() your vmfd and
> access remote process memory directly.
Rather than introducing a new vmfd() API for this, why not just add
implementations for these more efficient operations to the existing
/proc/$pid/mem interface?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:42 ` Christopher Yeoh
@ 2010-09-15 14:52 ` Linus Torvalds
2010-09-15 15:44 ` Robin Holt
2010-09-16 6:32 ` Brice Goglin
1 sibling, 1 reply; 22+ messages in thread
From: Linus Torvalds @ 2010-09-15 14:52 UTC (permalink / raw)
To: Christopher Yeoh
Cc: Avi Kivity, linux-kernel, Linux Memory Management List,
Ingo Molnar
On Wed, Sep 15, 2010 at 7:42 AM, Christopher Yeoh <cyeoh@au1.ibm.com> wrote:
> On Wed, 15 Sep 2010 12:58:15 +0200
> Avi Kivity <avi@redhat.com> wrote:
>
>> On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
>> > The basic idea behind cross memory attach is to allow MPI programs
>> > doing intra-node communication to do a single copy of the message
>> > rather than a double copy of the message via shared memory.
>>
>> If the host has a dma engine (many modern ones do) you can reduce
>> this to zero copies (at least, zero processor copies).
>
> Yes, this interface doesn't really support that. I've tried to keep
> things really simple here, but I see potential for increasing
> level/complexity of support with diminishing returns:
I think keeping things simple is a good goal. The vmfd() approach
might be worth looking into, but your patch certainly is pretty simple
as-is.
That said, it's also buggy. You can't just get a task and then do
down_read(task->mm->mmap_sem)
on it. Not even if you have a refcount. The mm may well go away. You
need to do the same thing "get_task_mm()" does, ie look up the mm
under task_lock, and get a reference to it. You already get the
task-lock for permission testing, so it looks like doing it there
would likely work out.
> 3. ability to map part of another process's address space directly into
> the current one. Would have setup/tear down overhead, but this would
> be useful specifically for reduction operations where we don't even
> need to really copy the data once at all, but use it directly in
> arithmetic/logical operations on the receiver.
Don't even think about this. If you want to map another tasks memory,
use shared memory. The shared memory code knows about that. The races
for anything else are crazy.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:52 ` Linus Torvalds
@ 2010-09-15 15:44 ` Robin Holt
0 siblings, 0 replies; 22+ messages in thread
From: Robin Holt @ 2010-09-15 15:44 UTC (permalink / raw)
To: Linus Torvalds
Cc: Christopher Yeoh, Avi Kivity, linux-kernel,
Linux Memory Management List, Ingo Molnar
> > 3. ability to map part of another process's address space directly into
> > the current one. Would have setup/tear down overhead, but this would
> > be useful specifically for reduction operations where we don't even
> > need to really copy the data once at all, but use it directly in
> > arithmetic/logical operations on the receiver.
>
> Don't even think about this. If you want to map another tasks memory,
> use shared memory. The shared memory code knows about that. The races
> for anything else are crazy.
SGI has a similar, but significantly more difficult, problem to solve and
have written a fairly complex driver to handle exactly the scenario IBM
is proposing. In our case, not only are we trying to directly access one
processes memory, we are doing it from a completely different operating
system instance operating on the same numa fabric.
In our case (I have not looked at IBMs patch), we are actually using
get_user_pages() to get extra references on struct pages. We are
judicious about reference counting the mm and we use get_task_mm in all
places with the exception of process teardown (ignorable detail for now).
We have a fault handler inserting PFNs as appropriate. You can guess
at the complexity. Even with all its complexity, we still need to
caveat certain functionality as not being supported.
If we were to try and get that driver included in the kernel, how would
you suggest we expand the shared memory code to include support for the
coordination needed between those seperate operating system instances?
I am genuinely interested and not trying to be argumentative. This has
been on my "Get done before Aug-1 list for months and I have not had
any time to pursue.
Thanks,
Robin
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 13:51 ` Ingo Molnar
@ 2010-09-15 16:10 ` Avi Kivity
0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2010-09-15 16:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Christopher Yeoh, linux-kernel, Linux Memory Management List,
Andrew Morton, Linus Torvalds, Peter Zijlstra
On 09/15/2010 03:51 PM, Ingo Molnar wrote:
> * Avi Kivity<avi@redhat.com> wrote:
>
>> On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
>>
>>> The basic idea behind cross memory attach is to allow MPI programs
>>> doing intra-node communication to do a single copy of the message
>>> rather than a double copy of the message via shared memory.
>> If the host has a dma engine (many modern ones do) you can reduce this
>> to zero copies (at least, zero processor copies).
>>
>>> The following patch attempts to achieve this by allowing a
>>> destination process, given an address and size from a source
>>> process, to copy memory directly from the source process into its
>>> own address space via a system call. There is also a symmetrical
>>> ability to copy from the current process's address space into a
>>> destination process's address space.
>> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong
>> start, ulong len) system call which returns an file descriptor that
>> represents a portion of the process address space. You can then use
>> preadv() and pwritev() to copy memory, and io_submit(IO_CMD_PREADV)
>> and io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially
>> useful with a dma engine, since that adds latency).
>>
>> With some care (and use of mmu_notifiers) you can even mmap() your
>> vmfd and access remote process memory directly.
>>
>> A nice property of file descriptors is that you can pass them around
>> securely via SCM_RIGHTS. So a process can create a window into its
>> address space and pass it to other processes.
>>
>> (or you could just use a shared memory object and pass it around)
> Interesting, but how will that work in a scalable way with lots of
> non-thread tasks?
>
> Say we have 100 processes. We'd have to have 100 fd's - each has to be
> passed to a new worker process.
>
> In that sense a PID is just as good of a reference as an fd - it can be
> looked up lockless, etc. - but has the added advantage that it can be
> passed along just by number.
>
>
It also has better life-cycle control (with just a pid, you never know
what it refers to unless you're its parent). Would have been better if
clone() returned an fd from which you could derive the pid if you wanted
to present it to the user.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:46 ` Bryan Donlan
@ 2010-09-15 16:13 ` Avi Kivity
2010-09-15 19:35 ` Eric W. Biederman
2010-09-16 1:18 ` Christopher Yeoh
2010-09-16 1:58 ` KOSAKI Motohiro
2 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2010-09-15 16:13 UTC (permalink / raw)
To: Bryan Donlan
Cc: Christopher Yeoh, linux-kernel, Linux Memory Management List,
Ingo Molnar
On 09/15/2010 04:46 PM, Bryan Donlan wrote:
> On Wed, Sep 15, 2010 at 19:58, Avi Kivity<avi@redhat.com> wrote:
>
>> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
>> ulong len) system call which returns an file descriptor that represents a
>> portion of the process address space. You can then use preadv() and
>> pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
>> io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful with
>> a dma engine, since that adds latency).
>>
>> With some care (and use of mmu_notifiers) you can even mmap() your vmfd and
>> access remote process memory directly.
> Rather than introducing a new vmfd() API for this, why not just add
> implementations for these more efficient operations to the existing
> /proc/$pid/mem interface?
Yes, opening that file should be equivalent (and you could certainly
implement aio via dma for it).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 16:13 ` Avi Kivity
@ 2010-09-15 19:35 ` Eric W. Biederman
0 siblings, 0 replies; 22+ messages in thread
From: Eric W. Biederman @ 2010-09-15 19:35 UTC (permalink / raw)
To: Avi Kivity
Cc: Bryan Donlan, Christopher Yeoh, linux-kernel,
Linux Memory Management List, Ingo Molnar, Linus Torvalds,
Valdis.Kletnieks, Alan Cox, Robin Holt
Avi Kivity <avi@redhat.com> writes:
> On 09/15/2010 04:46 PM, Bryan Donlan wrote:
>> On Wed, Sep 15, 2010 at 19:58, Avi Kivity<avi@redhat.com> wrote:
>>
>>> Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
>>> ulong len) system call which returns an file descriptor that represents a
>>> portion of the process address space. You can then use preadv() and
>>> pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
>>> io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful with
>>> a dma engine, since that adds latency).
>>>
>>> With some care (and use of mmu_notifiers) you can even mmap() your vmfd and
>>> access remote process memory directly.
>> Rather than introducing a new vmfd() API for this, why not just add
>> implementations for these more efficient operations to the existing
>> /proc/$pid/mem interface?
>
> Yes, opening that file should be equivalent (and you could certainly implement
> aio via dma for it).
I will second this /proc/$pid/mem is semantically the same and it would
really be good if this patch became a patch optimizing that case.
Otherwise we have code duplication and thus dilution of knowledge in
two different places for no discernable reason. Hindering long term
maintenance.
+int copy_to_from_process_allowed(struct task_struct *task)
+{
+ /* Allow copy_to_from_process to access another process using
+ the same critera as a process would be allowed to ptrace
+ that same process */
+ const struct cred *cred = current_cred(), *tcred;
+
+ rcu_read_lock();
+ tcred = __task_cred(task);
+ if ((cred->uid != tcred->euid ||
+ cred->uid != tcred->suid ||
+ cred->uid != tcred->uid ||
+ cred->gid != tcred->egid ||
+ cred->gid != tcred->sgid ||
+ cred->gid != tcred->gid) &&
+ !capable(CAP_SYS_PTRACE)) {
+ rcu_read_unlock();
+ return 0;
+ }
+ rcu_read_unlock();
+ return 1;
+}
This hunk of the patch is a copy of __ptrace_may_access without security
hooks removed. Both the code duplication, the removal of the dumpable
check and the removal of the security hooks look like a bad idea.
Removing the other checks in check_mem_permission seems reasonable as
those appear to be overly paranoid.
Hmm. This is weird:
+ /* Get the pages we're interested in */
+ pages_pinned = get_user_pages(task, task->mm, pa,
+ nr_pages_to_copy,
+ copy_to, 0, process_pages, NULL);
+
+ if (pages_pinned != nr_pages_to_copy)
+ goto end;
+
+ /* Do the copy for each page */
+ for (i = 0; i < nr_pages_to_copy; i++) {
+ target_kaddr = kmap(process_pages[i]) + start_offset;
+ bytes_to_copy = min(PAGE_SIZE - start_offset,
+ len - *bytes_copied);
+ if (start_offset)
+ start_offset = 0;
+
+ if (copy_to) {
+ ret = copy_from_user(target_kaddr,
+ user_buf + *bytes_copied,
+ bytes_to_copy);
+ if (ret) {
+ kunmap(process_pages[i]);
+ goto end;
+ }
+ } else {
+ ret = copy_to_user(user_buf + *bytes_copied,
+ target_kaddr, bytes_to_copy);
+ if (ret) {
+ kunmap(process_pages[i]);
+ goto end;
+ }
+ }
+ kunmap(process_pages[i]);
+ *bytes_copied += bytes_to_copy;
+ }
+
That hunk of code appears to be an copy of mm/memmory.c:access_process_vm.
A little more optimized by taking the get_user_pages out of the inner
loop but otherwise pretty much the same code.
So I would argue it makes sense to optimize access_process_vm.
So unless there are fundamental bottlenecks to performance I am not
seeing please optimize the existing code paths in the kernel that do
exactly what you are trying to do.
Thanks,
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:46 ` Bryan Donlan
2010-09-15 16:13 ` Avi Kivity
@ 2010-09-16 1:18 ` Christopher Yeoh
2010-09-16 9:26 ` Avi Kivity
2010-09-16 1:58 ` KOSAKI Motohiro
2 siblings, 1 reply; 22+ messages in thread
From: Christopher Yeoh @ 2010-09-16 1:18 UTC (permalink / raw)
To: Bryan Donlan
Cc: Avi Kivity, linux-kernel, Linux Memory Management List,
Ingo Molnar
On Wed, 15 Sep 2010 23:46:09 +0900
Bryan Donlan <bdonlan@gmail.com> wrote:
> On Wed, Sep 15, 2010 at 19:58, Avi Kivity <avi@redhat.com> wrote:
>
> > Instead of those two syscalls, how about a vmfd(pid_t pid, ulong
> > start, ulong len) system call which returns an file descriptor that
> > represents a portion of the process address space. You can then
> > use preadv() and pwritev() to copy memory, and
> > io_submit(IO_CMD_PREADV) and io_submit(IO_CMD_PWRITEV) for
> > asynchronous variants (especially useful with a dma engine, since
> > that adds latency).
> >
> > With some care (and use of mmu_notifiers) you can even mmap() your
> > vmfd and access remote process memory directly.
>
> Rather than introducing a new vmfd() API for this, why not just add
> implementations for these more efficient operations to the existing
> /proc/$pid/mem interface?
Perhaps I'm misunderstanding something here, but
accessing /proc/$pid/mem requires ptracing the target process.
We can't really have all these MPI processes ptraceing each other
just to send/receive a message....
Regards,
Chris
--
cyeoh@au.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:46 ` Bryan Donlan
2010-09-15 16:13 ` Avi Kivity
2010-09-16 1:18 ` Christopher Yeoh
@ 2010-09-16 1:58 ` KOSAKI Motohiro
2010-09-16 8:08 ` Ingo Molnar
2 siblings, 1 reply; 22+ messages in thread
From: KOSAKI Motohiro @ 2010-09-16 1:58 UTC (permalink / raw)
To: Bryan Donlan
Cc: kosaki.motohiro, Avi Kivity, Christopher Yeoh, linux-kernel,
Linux Memory Management List, Ingo Molnar
> On Wed, Sep 15, 2010 at 19:58, Avi Kivity <avi@redhat.com> wrote:
>
> > Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
> > ulong len) system call which returns an file descriptor that represents a
> > portion of the process address space. You can then use preadv() and
> > pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
> > io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful with
> > a dma engine, since that adds latency).
> >
> > With some care (and use of mmu_notifiers) you can even mmap() your vmfd and
> > access remote process memory directly.
>
> Rather than introducing a new vmfd() API for this, why not just add
> implementations for these more efficient operations to the existing
> /proc/$pid/mem interface?
As far as I heared from my friend, old HP MPI implementation used /proc/$pid/mem
for this purpose. (I don't know current status). However almost implementation
doesn't do that because /proc/$pid/mem required the process is ptraced.
As far as I understand , very old /proc/$pid/mem doesn't require it. but It changed
for security concern. Then, Anybody haven't want to change this interface because
they worry break security.
But, I don't know what exactly protected "the process is ptraced" check. If anyone
explain the reason and we can remove it. I'm not againt at all.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-15 14:42 ` Christopher Yeoh
2010-09-15 14:52 ` Linus Torvalds
@ 2010-09-16 6:32 ` Brice Goglin
2010-09-16 9:15 ` Brice Goglin
1 sibling, 1 reply; 22+ messages in thread
From: Brice Goglin @ 2010-09-16 6:32 UTC (permalink / raw)
To: Christopher Yeoh
Cc: Avi Kivity, linux-kernel, Linux Memory Management List,
Ingo Molnar
Le 15/09/2010 16:42, Christopher Yeoh a ecrit :
> On Wed, 15 Sep 2010 12:58:15 +0200
> Avi Kivity <avi@redhat.com> wrote:
>
>
>> On 09/15/2010 03:18 AM, Christopher Yeoh wrote:
>>
>>> The basic idea behind cross memory attach is to allow MPI programs
>>> doing intra-node communication to do a single copy of the message
>>> rather than a double copy of the message via shared memory.
>>>
>> If the host has a dma engine (many modern ones do) you can reduce
>> this to zero copies (at least, zero processor copies).
>>
> Yes, this interface doesn't really support that. I've tried to keep
> things really simple here, but I see potential for increasing
> level/complexity of support with diminishing returns:
>
> 1. single copy (basically what the current implementation does)
> 2. support for async dma offload (rather arch specific)
> 3. ability to map part of another process's address space directly into
> the current one. Would have setup/tear down overhead, but this would
> be useful specifically for reduction operations where we don't even
> need to really copy the data once at all, but use it directly in
> arithmetic/logical operations on the receiver.
>
> For reference, there is also knem http://runtime.bordeaux.inria.fr/knem/
> which does implement (2) for I/OAT, though it looks to me the interface
> and implementation are relatively speaking quite a bit more complex.
>
I am the guy doing KNEM so I can comment on this. The I/OAT part of KNEM
was mostly a research topic, it's mostly useless on current machines
since the memcpy performance is much larger than I/OAT DMA Engine. We
also have an offload model with a kernel thread, but it wasn't used a
lot so far. These features can be ignored for the current discussion.
We've been working on this for a while with MPICH and OpenMPI developers
(both already use KNEM), and here's what I think is missing in
Christopher's proposal:
* Vectorial buffer support: MPI likes things like datatypes, which make
buffers non-contigous. You could add vectorial buffer support to your
interface, but the users would have to store the data-representation of
each process in all processes. Not a good idea, it's easier to keep the
knowledge of the non-contigous-ness of the remote buffer only in the
remote process.
* Collectives: You don't want to pin/unpin the same region over and
over, it's overkill when multiple processes are reading for the same
exact buffer (broadcast) or from contigous parts of the same buffer
(scatter).
So what we do in KNEM is:
* declare a memory region (sets of non-contigous segments + protection),
aka get_user_pages and return an associated cookie id
* have syscalls to read/write from region given a cookie, an offset in
the region and a length
This one-sided interface looks like an InfiniBand model, but only for
intra-node data transfers.
So OpenMPI and MPICH declare regions, pass their cookies through their
shared-memory buffer, and the remote process reads from there. Then,
they notify the first process that it may destroy the region (can be
automatic if the region creator passed a specific flag saying destroy
after first use).
Brice
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-16 1:58 ` KOSAKI Motohiro
@ 2010-09-16 8:08 ` Ingo Molnar
0 siblings, 0 replies; 22+ messages in thread
From: Ingo Molnar @ 2010-09-16 8:08 UTC (permalink / raw)
To: KOSAKI Motohiro, Alexander Viro, Chris Wright, Andrew Morton
Cc: Bryan Donlan, Avi Kivity, Christopher Yeoh, linux-kernel,
Linux Memory Management List, Linus Torvalds
* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > On Wed, Sep 15, 2010 at 19:58, Avi Kivity <avi@redhat.com> wrote:
> >
> > > Instead of those two syscalls, how about a vmfd(pid_t pid, ulong start,
> > > ulong len) system call which returns an file descriptor that represents a
> > > portion of the process address space. You can then use preadv() and
> > > pwritev() to copy memory, and io_submit(IO_CMD_PREADV) and
> > > io_submit(IO_CMD_PWRITEV) for asynchronous variants (especially useful with
> > > a dma engine, since that adds latency).
> > >
> > > With some care (and use of mmu_notifiers) you can even mmap() your vmfd and
> > > access remote process memory directly.
> >
> > Rather than introducing a new vmfd() API for this, why not just add
> > implementations for these more efficient operations to the existing
> > /proc/$pid/mem interface?
>
> As far as I heared from my friend, old HP MPI implementation used
> /proc/$pid/mem for this purpose. (I don't know current status).
> However almost implementation doesn't do that because /proc/$pid/mem
> required the process is ptraced. As far as I understand , very old
> /proc/$pid/mem doesn't require it. but It changed for security
> concern. Then, Anybody haven't want to change this interface because
> they worry break security.
>
> But, I don't know what exactly protected "the process is ptraced"
> check. If anyone explain the reason and we can remove it. I'm not
> againt at all.
I did some Git digging - that ptrace check for /proc/$pid/mem read/write
goes all the way back to the beginning of written human history, aka
Linux v2.6.12-rc2.
I researched the fragmented history of the stone ages as well, i checked
out numerous cave paintings, and while much was lost, i was able to
recover this old fragment of a clue in the cave called 'patch-2.3.27',
carbon-dated back as far as the previous millenium (!):
mem_read() in fs/proc/base.c:
+ * 1999, Al Viro. Rewritten. Now it covers the whole per-process part.
+ * Instead of using magical inumbers to determine the kind of object
+ * we allocate and fill in-core inodes upon lookup. They don't even
+ * go into icache. We cache the reference to task_struct upon lookup too.
+ * Eventually it should become a filesystem in its own. We don't use the
+ * rest of procfs anymore.
In such a long timespan language has changed much, so not all of this
scribbling can be interpreted - but one thing appears to be sure: this
is where the MAY_PTRACE() restriction was introduced to /proc/$pid/mem -
as part of a massive rewrite.
Alas, the reason for the restriction was not documented, and is feared
to be lost forever.
Thanks,
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-16 6:32 ` Brice Goglin
@ 2010-09-16 9:15 ` Brice Goglin
2010-09-16 14:00 ` Christopher Yeoh
0 siblings, 1 reply; 22+ messages in thread
From: Brice Goglin @ 2010-09-16 9:15 UTC (permalink / raw)
To: Christopher Yeoh
Cc: Avi Kivity, linux-kernel, Linux Memory Management List,
Ingo Molnar
Le 16/09/2010 08:32, Brice Goglin a ecrit :
> I am the guy doing KNEM so I can comment on this. The I/OAT part of KNEM
> was mostly a research topic, it's mostly useless on current machines
> since the memcpy performance is much larger than I/OAT DMA Engine. We
> also have an offload model with a kernel thread, but it wasn't used a
> lot so far. These features can be ignored for the current discussion.
I've just created a knem branch where I removed all the above, and some
other stuff that are not necessary for normal users. So it just contains
the region management code and two commands to copy between regions or
between a region and some local iovecs.
Commands are visible at (still uses ioctls since it doesn't matter while
discussing the features):
https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/kernel/driver/linux/knem_main.c?root=knem&content-type=text%2Fplain
And the actual driver is at:
https://gforge.inria.fr/scm/viewvc.php/*checkout*/branches/kernel/common/knem_io.h?root=knem&content-type=text%2Fplain
Brice
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-16 1:18 ` Christopher Yeoh
@ 2010-09-16 9:26 ` Avi Kivity
2010-11-02 3:37 ` Christopher Yeoh
0 siblings, 1 reply; 22+ messages in thread
From: Avi Kivity @ 2010-09-16 9:26 UTC (permalink / raw)
To: Christopher Yeoh
Cc: Bryan Donlan, linux-kernel, Linux Memory Management List,
Ingo Molnar
On 09/16/2010 03:18 AM, Christopher Yeoh wrote:
> On Wed, 15 Sep 2010 23:46:09 +0900
> Bryan Donlan<bdonlan@gmail.com> wrote:
>
> > On Wed, Sep 15, 2010 at 19:58, Avi Kivity<avi@redhat.com> wrote:
> >
> > > Instead of those two syscalls, how about a vmfd(pid_t pid, ulong
> > > start, ulong len) system call which returns an file descriptor that
> > > represents a portion of the process address space. You can then
> > > use preadv() and pwritev() to copy memory, and
> > > io_submit(IO_CMD_PREADV) and io_submit(IO_CMD_PWRITEV) for
> > > asynchronous variants (especially useful with a dma engine, since
> > > that adds latency).
> > >
> > > With some care (and use of mmu_notifiers) you can even mmap() your
> > > vmfd and access remote process memory directly.
> >
> > Rather than introducing a new vmfd() API for this, why not just add
> > implementations for these more efficient operations to the existing
> > /proc/$pid/mem interface?
>
> Perhaps I'm misunderstanding something here, but
> accessing /proc/$pid/mem requires ptracing the target process.
> We can't really have all these MPI processes ptraceing each other
> just to send/receive a message....
>
You could have each process open /proc/self/mem and pass the fd using
SCM_RIGHTS.
That eliminates a race; with copy_to_process(), by the time the pid is
looked up it might designate a different process.
--
error compiling committee.c: too many arguments to function
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-16 9:15 ` Brice Goglin
@ 2010-09-16 14:00 ` Christopher Yeoh
0 siblings, 0 replies; 22+ messages in thread
From: Christopher Yeoh @ 2010-09-16 14:00 UTC (permalink / raw)
To: Brice Goglin; +Cc: linux-kernel, Linux Memory Management List
On Thu, 16 Sep 2010 11:15:10 +0200
Brice Goglin <Brice.Goglin@inria.fr> wrote:
> Le 16/09/2010 08:32, Brice Goglin a écrit :
> > I am the guy doing KNEM so I can comment on this. The I/OAT part of
> > KNEM was mostly a research topic, it's mostly useless on current
> > machines since the memcpy performance is much larger than I/OAT DMA
> > Engine. We also have an offload model with a kernel thread, but it
> > wasn't used a lot so far. These features can be ignored for the
> > current discussion.
>
> I've just created a knem branch where I removed all the above, and
> some other stuff that are not necessary for normal users. So it just
> contains the region management code and two commands to copy between
> regions or between a region and some local iovecs.
When I did the original hpcc runs for CMA vs shared mem double copy I
also did some KNEM runs as a bit of a sanity check. The CMA OpenMPI
implementation actually uses the infrastructure KNEM put into the
OpenMPI shared mem btl - thanks for that btw it made things much easier
for me to test CMA.
Interestingly although KNEM and CMA fundamentally are doing very
similar things, at least with hpcc I didn't see as much of a gain with
KNEM as with CMA:
MB/s
Naturally Ordered 4 8 16 32
Base 1235 935 622 419
CMA 4741 3769 1977 703
KNEM 3362 3091 1857 681
MB/s
Randomly Ordered 4 8 16 32
Base 1227 947 638 412
CMA 4666 3682 1978 710
KNEM 3348 3050 1883 684
MB/s
Max Ping Pong 4 8 16 32
Base 2028 1938 1928 1882
CMA 7424 7510 7598 7708
KNEM 5661 5476 6050 6290
I don't know the reason behind the difference - if its something
perculiar to hpcc, or if there's extra overhead the way that
knem does setup for copying, or if knem wasn't configured
optimally. I haven't done any comparison IMB or NPB runs...
syscall and setup overhead does have some measurable effect - although I
don't have the numbers for it here, neither KNEM nor CMA does quite as
well with hpcc when compared against a hacked version of hpcc where
everything is declared ahead of time as shared memory so the receiver
can just do a single copy from userspace - which I think is
representative of a theoretical maximum gain from the single copy
approach.
Chris
--
cyeoh@au.ibm.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-09-16 9:26 ` Avi Kivity
@ 2010-11-02 3:37 ` Christopher Yeoh
2010-11-02 11:10 ` Avi Kivity
0 siblings, 1 reply; 22+ messages in thread
From: Christopher Yeoh @ 2010-11-02 3:37 UTC (permalink / raw)
To: Avi Kivity
Cc: Bryan Donlan, linux-kernel, Linux Memory Management List,
Ingo Molnar
On Thu, 16 Sep 2010 11:26:36 +0200
Avi Kivity <avi@redhat.com> wrote:
> On 09/16/2010 03:18 AM, Christopher Yeoh wrote:
> > On Wed, 15 Sep 2010 23:46:09 +0900
> > Bryan Donlan<bdonlan@gmail.com> wrote:
> >
> > > On Wed, Sep 15, 2010 at 19:58, Avi Kivity<avi@redhat.com> wrote:
> > >
> > > > Instead of those two syscalls, how about a vmfd(pid_t pid,
> > > > ulong start, ulong len) system call which returns an file
> > > > descriptor that represents a portion of the process address
> > > > space. You can then use preadv() and pwritev() to copy
> > > > memory, and io_submit(IO_CMD_PREADV) and
> > > > io_submit(IO_CMD_PWRITEV) for asynchronous variants
> > > > (especially useful with a dma engine, since that adds latency).
> > > >
> > > > With some care (and use of mmu_notifiers) you can even mmap()
> > > > your vmfd and access remote process memory directly.
> > >
> > > Rather than introducing a new vmfd() API for this, why not just
> > > add implementations for these more efficient operations to the
> > > existing /proc/$pid/mem interface?
> >
> > Perhaps I'm misunderstanding something here, but
> > accessing /proc/$pid/mem requires ptracing the target process.
> > We can't really have all these MPI processes ptraceing each other
> > just to send/receive a message....
> >
>
> You could have each process open /proc/self/mem and pass the fd using
> SCM_RIGHTS.
>
> That eliminates a race; with copy_to_process(), by the time the pid
> is looked up it might designate a different process.
Just to revive an old thread (I've been on holidays), but this doesn't
work either. the ptrace check is done by mem_read (eg on each read) so
even if you do pass the fd using SCM_RIGHTS, reads on the fd still
fail.
So unless there's good reason to believe that the ptrace permission
check is no longer needed, the /proc/pid/mem interface doesn't seem to
be an option for what we want to do.
Oh and interestingly reading from /proc/pid/mem involves a double copy
- copy to a temporary kernel page and then out to userspace. But that is
fixable.
Regards,
Chris
--
cyeoh@ozlabs.org
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [RFC][PATCH] Cross Memory Attach
2010-11-02 3:37 ` Christopher Yeoh
@ 2010-11-02 11:10 ` Avi Kivity
0 siblings, 0 replies; 22+ messages in thread
From: Avi Kivity @ 2010-11-02 11:10 UTC (permalink / raw)
To: Christopher Yeoh
Cc: Bryan Donlan, linux-kernel, Linux Memory Management List,
Ingo Molnar
On 11/01/2010 11:37 PM, Christopher Yeoh wrote:
> >
> > You could have each process open /proc/self/mem and pass the fd using
> > SCM_RIGHTS.
> >
> > That eliminates a race; with copy_to_process(), by the time the pid
> > is looked up it might designate a different process.
>
> Just to revive an old thread (I've been on holidays), but this doesn't
> work either. the ptrace check is done by mem_read (eg on each read) so
> even if you do pass the fd using SCM_RIGHTS, reads on the fd still
> fail.
>
> So unless there's good reason to believe that the ptrace permission
> check is no longer needed, the /proc/pid/mem interface doesn't seem to
> be an option for what we want to do.
>
Perhaps move the check to open(). I can understand the desire to avoid
letting random processes peek each other's memory, but once a process
has opened its own /proc/self/mem and explicitly passed it to another,
we should allow it.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2010-11-02 11:34 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20100915104855.41de3ebf@lilo>
2010-09-15 8:02 ` [RFC][PATCH] Cross Memory Attach Ingo Molnar
2010-09-15 8:16 ` Ingo Molnar
2010-09-15 13:23 ` Christopher Yeoh
2010-09-15 13:20 ` Christopher Yeoh
2010-09-15 10:58 ` Avi Kivity
2010-09-15 13:51 ` Ingo Molnar
2010-09-15 16:10 ` Avi Kivity
2010-09-15 14:42 ` Christopher Yeoh
2010-09-15 14:52 ` Linus Torvalds
2010-09-15 15:44 ` Robin Holt
2010-09-16 6:32 ` Brice Goglin
2010-09-16 9:15 ` Brice Goglin
2010-09-16 14:00 ` Christopher Yeoh
2010-09-15 14:46 ` Bryan Donlan
2010-09-15 16:13 ` Avi Kivity
2010-09-15 19:35 ` Eric W. Biederman
2010-09-16 1:18 ` Christopher Yeoh
2010-09-16 9:26 ` Avi Kivity
2010-11-02 3:37 ` Christopher Yeoh
2010-11-02 11:10 ` Avi Kivity
2010-09-16 1:58 ` KOSAKI Motohiro
2010-09-16 8:08 ` Ingo Molnar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).