From: ebiederm@xmission.com (Eric W. Biederman)
To: Terje Eggestad <terje.eggestad@scali.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Arjan van de Ven <arjanv@redhat.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
D.A.Fedorov@inp.nsk.su
Subject: Re: The disappearing sys_call_table export.
Date: 06 May 2003 03:21:28 -0600 [thread overview]
Message-ID: <m13cjsbfc7.fsf@frodo.biederman.org> (raw)
In-Reply-To: <1052208877.15887.13.camel@pc-16.office.scali.no>
Terje Eggestad <terje.eggestad@scali.com> writes:
> On Tue, 2003-05-06 at 09:30, Eric W. Biederman wrote:
> > Christoph Hellwig <hch@infradead.org> writes:
> >
>
> > Handling mpi_send/mpi_recv is more difficult. MPI specifies
> > that the data can be copied it just does not require it so in
> > sufficiently weird situations a copy slow path can be taken.
> >
> > So there are really two questions here.
> > 1) What is a clean way to provide a high performance message
> > passing layer. Assuming you have a network card for which
> > it is safe to mmap a subset of control registers.
> >
> > 2) What is a good way to map MPI onto that clean layer.
> >
>
> All applications pretty much uses send/recv.
>
> > I believe the answer on how to do a clean safe interface is
> > to allocate the memory and tell the card about it in the driver,
> > and then allow user space to mmap it. With the driver mmap operation
> > informing the network card of the mapping.
> >
>
> You can't mmap() a buffer every time your going to do a send/recv, it's
> way to costly.
Definitely not. But if the memory malloc returns is originally
from a mmaped buffer area (mmaped from your driver) it can be useful.
I assume somewhere your card has the smarts to transform virtual to
physical addresses and this is what the mmap sets up.
That can be handled in user space by querying the mmaped region. But
if the card does not have the smarts to do the virtual to physical
translation, or at the very least limit the set of physical pages a
user space a do DMA to/from that is a fundamental security issue and
means all of the optimizations are not safe. And you must enter/exit
the kernel to send a DMA transaction.
> > A good implementation of mpi on top of that is an interesting
> > question. Replacing malloc and free and having everything run on
> > top of the mmapped buffer sounds like a possibility. But it is
> > additionally desirable for the memory used by an MPI job to come
> > from hugetlbfs, or the equivalent. And I don't know if a driver
> > can provide huge pages.
> >
> > At this point I am strongly tempted to see what it would take to come
> > up with an MPI-2.1 to fix this issue.
> >
>
> all current MPI apps uses MPI-1
Given that mpich does not even implement mpi_put/mpi_get I can
easily believe it for this case. All of the MPI file I/O which
also does get used at least to some extent also is part of MPI-2.
> > > so use get_user_pages.
> > >
> > > > 6. point 4: pinning is VERY expensive (point 1), so I can't pin the
> > > > buffers every time they're used.
> > >
> > > Umm, pinning memory all the time means you get a bunch of nice DoS
> > > attachs due to the huge amount of memory.
> >
> > I wonder if there is an easy way to optimize this if you don't have
> > swap configured. In general it is a bug if an MPI job swaps.
> >
>
> hmm, it's not a problem as long as you only page out data page used only
> under initialization, or pages that are used very infrequent. That is
> actually a good thing, since you could fit a bit more live data in
> memory.
Right. Defining it as a bug was to emphasize the point that paging is
a non-issue and for the most part an MPI job is already pinned in
memory. I totally agree that having swapping enabled and being able
to page out every unused page in the is useful.
> > In general there is one mpi process per cpu running on a machine. So
> > I have trouble seeing this as a denial of service.
> >
> > > > 7. The only way to cache buffers (to see if they're used before and
> > > > hence pinned) is the user space virtual address. A syscall, thus ioctl
> > > > to a device file is prohibitive expensive under point 1.
> > >
> > > That's a horribly b0rked approach..
> > >
> > > Again, where's your driver source so we can help you to find a better
> > > approach out of that mess?
> >
> > With some digging I can find the source for both quadrics and myrinet
> > drivers, and they have the same issues. This is a general problem
> > for running MPI jobs so it is probably worth finding a solution that
> > works for those people whose source we can obtain.
> >
>
> Hmm, no the drivers, don't have the issue, the MPI implementations
> do.
The drivers have the issue of how to provide an interface for
the mpi implementation that sits on top of them. I totally agree this
looks like a bug in MPI.
> The two used approaches are 1) replace malloc() and friends, which break
> with fortran 90 compilers 2) tell glibc never to release alloced memory
> thru sbrk(-n) or munmap() which also break with f90 compilers, and run
> the risk of bloating memory usage.
Actually there is a third. Hack the vm layer and require a highly
patched kernel. That is the approach quadrics was using last time I
looked although they promised something different in their next major
rev.
Is it pgi or intels f90 compilers that break, and how do they break.
Replacing malloc and friends should be well defined if you simply
replace or wrap the symbols glibc provides.
Quite possibly the answer is to call those compilers ABI
non-conformant and get them fixed. Especially given that they are not
compatible with g77 in fortran mode there is a good case for this. By
default the native compiler is correct.
So far the only fortran issues I have seen that could affect malloc
are adding extra under scores. What issue are you running into?
Eric
next prev parent reply other threads:[~2003-05-06 9:12 UTC|newest]
Thread overview: 207+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-05-05 8:19 The disappearing sys_call_table export Terje Eggestad
2003-05-05 8:23 ` Christoph Hellwig
2003-05-05 9:33 ` Terje Eggestad
2003-05-05 9:38 ` Arjan van de Ven
2003-05-05 10:12 ` Terje Eggestad
2003-05-05 10:25 ` Christoph Hellwig
2003-05-05 11:23 ` Terje Eggestad
2003-05-05 11:27 ` Arjan van de Ven
2003-05-05 11:31 ` Terje Eggestad
2003-05-05 11:33 ` Arjan van de Ven
2003-05-05 15:53 ` Tigran Aivazian
2003-05-05 14:57 ` Christoph Hellwig
2003-05-05 14:59 ` Arjan van de Ven
2003-05-05 12:52 ` Christoph Hellwig
2003-05-05 13:41 ` Terje Eggestad
2003-05-05 13:43 ` Christoph Hellwig
2003-05-05 13:50 ` Terje Eggestad
2003-05-05 13:54 ` Arjan van de Ven
2003-05-05 13:55 ` Christoph Hellwig
2003-05-05 14:28 ` Carl-Daniel Hailfinger
2003-05-05 14:34 ` Christoph Hellwig
2003-05-05 15:25 ` Carl-Daniel Hailfinger
2003-05-06 7:30 ` Eric W. Biederman
2003-05-06 8:14 ` Terje Eggestad
2003-05-06 9:21 ` Eric W. Biederman [this message]
2003-05-06 11:21 ` Terje Eggestad
2003-05-06 11:37 ` Eric W. Biederman
2003-05-06 12:08 ` Terje Eggestad
2003-05-05 11:16 ` Alan Cox
2003-05-05 13:23 ` Terje Eggestad
2003-05-08 12:25 ` Terje Malmedal
2003-05-08 12:29 ` Christoph Hellwig
2003-05-08 13:18 ` Terje Malmedal
2003-05-08 14:25 ` Christoph Hellwig
2003-05-08 15:29 ` Terje Malmedal
2003-05-08 18:13 ` Jesse Pollard
2003-05-08 19:17 ` Christoph Hellwig
2003-05-09 9:18 ` Terje Malmedal
2003-05-08 14:58 ` Alan Cox
2003-05-09 8:56 ` Terje Malmedal
2003-05-07 2:14 ` Ben Lau
2003-05-05 8:27 ` Arjan van de Ven
2003-05-05 9:01 ` Dmitry A. Fedorov
2003-05-05 9:19 ` Christoph Hellwig
2003-05-05 9:32 ` Arjan van de Ven
-- strict thread matches above, loose matches on Subject: below --
2003-05-05 13:30 Dmitry A. Fedorov
2003-05-05 13:42 ` Christoph Hellwig
2003-05-05 14:46 ` Dmitry A. Fedorov
2003-05-05 13:45 ` viro
2003-05-05 14:29 ` Dmitry A. Fedorov
[not found] <mailman.1052142720.4060.linux-kernel2news@redhat.com>
2003-05-05 20:50 ` Pete Zaitcev
2003-05-06 2:17 ` Dmitry A. Fedorov
2003-05-05 21:29 Chuck Ebbert
2003-05-05 22:49 ` Terje Eggestad
2003-05-06 2:23 ` Dmitry A. Fedorov
2003-05-06 7:27 ` Terje Eggestad
2003-05-06 8:21 ` Dmitry A. Fedorov
2003-05-06 8:45 Yoav Weiss
2003-05-06 9:15 ` David S. Miller
2003-05-06 19:45 ` David Schwartz
2003-05-06 10:06 ` Dmitry A. Fedorov
2003-05-06 17:01 ` Jerry Cooperstein
2003-05-06 17:45 ` Yoav Weiss
2003-05-06 15:51 Yoav Weiss
2003-05-06 20:48 Chuck Ebbert
2003-05-07 15:34 petter wahlman
2003-05-07 15:48 ` Arjan van de Ven
2003-05-07 16:00 ` Richard B. Johnson
2003-05-07 16:08 ` petter wahlman
2003-05-07 16:45 ` Richard B. Johnson
2003-05-07 16:59 ` Richard B. Johnson
2003-05-07 18:07 ` petter wahlman
2003-05-07 18:33 ` Richard B. Johnson
2003-05-08 8:58 ` petter wahlman
2003-05-08 15:11 ` Richard B. Johnson
2003-05-07 21:27 ` Jesse Pollard
2003-05-07 17:21 ` Jesse Pollard
2003-05-07 16:18 ` Steffen Persvold
2003-05-08 12:23 ` Eric W. Biederman
2003-05-07 19:04 Chuck Ebbert
2003-05-08 9:58 ` Terje Eggestad
2003-05-08 9:59 ` Arjan van de Ven
2003-05-08 10:20 ` viro
2003-05-08 12:54 ` Terje Eggestad
2003-05-08 12:58 ` Christoph Hellwig
2003-05-08 19:10 ` Shachar Shemesh
2003-05-08 19:15 ` Christoph Hellwig
2003-05-08 21:48 ` J.A. Magallon
2003-05-09 7:43 ` Muli Ben-Yehuda
2003-05-09 7:42 ` Muli Ben-Yehuda
2003-05-09 8:08 ` Greg KH
2003-05-09 19:07 ` Muli Ben-Yehuda
2003-05-08 14:08 Chuck Ebbert
2003-05-08 14:36 ` Christoph Hellwig
2003-05-08 14:42 ` Alan Cox
2003-05-08 14:56 ` Jesse Pollard
2003-05-08 15:22 ` Alan Cox
2003-05-08 17:02 ` William Stearns
2003-05-08 18:28 ` Jesse Pollard
2003-05-10 14:38 ` Ahmed Masud
2003-05-10 16:50 ` Arjan van de Ven
2003-05-10 17:51 ` Ahmed Masud
2003-05-10 17:56 ` Arjan van de Ven
2003-05-10 18:03 ` Ahmed Masud
2003-05-10 18:09 ` Ahmed Masud
2003-05-10 18:43 ` Werner Almesberger
2003-05-10 18:26 ` Werner Almesberger
2003-05-11 11:01 ` Terje Malmedal
2003-05-11 11:57 ` Ahmed Masud
2003-05-08 19:43 Chuck Ebbert
2003-05-08 19:48 ` Christoph Hellwig
2003-05-08 21:44 ` Alan Cox
2003-05-08 19:43 Chuck Ebbert
2003-05-08 19:58 ` Christoph Hellwig
2003-05-09 13:53 ` Jesse Pollard
2003-05-09 14:37 ` Ragnar =?unknown-8bit?Q?Kj=F8rstad?=
2003-05-12 14:19 ` Jesse Pollard
2003-05-12 15:56 ` Christoph Hellwig
2003-05-08 19:43 Chuck Ebbert
2003-05-09 7:50 Chuck Ebbert
2003-05-09 7:57 ` Christoph Hellwig
2003-05-09 7:50 Chuck Ebbert
2003-05-09 7:59 ` Christoph Hellwig
2003-05-09 12:18 ` Alan Cox
2003-05-09 17:07 ` Valdis.Kletnieks
2003-05-10 15:34 ` Alan Cox
2003-05-09 9:11 Chuck Ebbert
2003-05-09 10:47 ` Christoph Hellwig
2003-05-09 9:43 Chuck Ebbert
2003-05-09 11:09 Chuck Ebbert
2003-05-09 12:41 Chuck Ebbert
2003-05-09 12:47 ` Christoph Hellwig
2003-05-09 17:07 Chuck Ebbert
2003-05-09 17:07 Chuck Ebbert
2003-05-09 18:27 ` Richard B. Johnson
2003-05-09 19:02 ` Valdis.Kletnieks
2003-05-09 19:18 ` Richard B. Johnson
2003-05-09 19:25 ` Valdis.Kletnieks
2003-05-09 21:22 Chuck Ebbert
2003-05-10 19:18 Yoav Weiss
2003-05-10 19:53 ` Muli Ben-Yehuda
2003-05-10 20:06 ` Yoav Weiss
2003-05-11 3:54 ` Ahmed Masud
2003-05-10 20:48 ` David Wagner
2003-05-10 19:32 Chuck Ebbert
2003-05-10 21:45 Yoav Weiss
2003-05-11 16:32 Chuck Ebbert
2003-05-11 17:20 ` David Wagner
2003-05-11 17:53 ` Yoav Weiss
2003-05-11 20:39 Chuck Ebbert
2003-05-11 22:32 ` Yoav Weiss
2003-05-11 21:46 ` Alan Cox
2003-05-11 22:57 ` David Schwartz
2003-05-14 21:08 ` H. Peter Anvin
2003-05-11 23:22 ` Yoav Weiss
2003-05-11 22:32 ` Ahmed Masud
[not found] <20030511164010$5d34@gated-at.bofh.it>
2003-05-12 0:47 ` Ben Pfaff
2003-05-12 16:32 Chuck Ebbert
2003-05-12 16:46 ` Alan Cox
[not found] <20030512164017$6c09@gated-at.bofh.it>
2003-05-12 17:02 ` Pascal Schmidt
2003-05-12 21:51 Chuck Ebbert
2003-05-12 21:05 ` Alan Cox
2003-05-12 22:12 ` Valdis.Kletnieks
2003-05-12 21:19 ` Alan Cox
2003-05-12 22:29 ` Valdis.Kletnieks
2003-05-13 12:31 ` Ahmed Masud
2003-05-12 22:57 Yoav Weiss
2003-05-12 23:58 ` Bryan Andersen
2003-05-13 12:11 ` Jesse Pollard
2003-05-13 13:44 ` Yoav Weiss
2003-05-13 21:26 ` Jesse Pollard
2003-05-13 22:21 ` Yoav Weiss
2003-05-14 13:05 ` Jesse Pollard
2003-05-13 1:57 Chuck Ebbert
2003-05-13 2:25 ` Yoav Weiss
2003-05-13 1:57 Chuck Ebbert
2003-05-13 12:24 ` Jesse Pollard
2003-05-13 9:52 Chuck Ebbert
2003-05-13 13:32 ` Yoav Weiss
2003-05-14 7:44 ` Mike Touloumtzis
2003-05-14 10:34 ` Ahmed Masud
2003-05-14 20:58 ` Mike Touloumtzis
2003-05-14 21:32 ` Richard B. Johnson
2003-05-14 21:37 ` Yoav Weiss
2003-05-14 21:51 ` Richard B. Johnson
2003-05-15 13:17 ` Jesse Pollard
2003-05-15 15:16 ` Chris Ricker
2003-05-15 15:31 ` Richard B. Johnson
2003-05-15 15:33 ` Chris Ricker
2003-05-15 15:46 ` Richard B. Johnson
2003-05-15 16:21 ` Ahmed Masud
2003-05-15 2:06 ` Ahmed Masud
2003-05-13 13:58 Yoav Weiss
2003-05-13 22:51 ` Ahmed Masud
2003-05-13 23:58 ` Yoav Weiss
2003-06-12 23:20 ` Nigel Cunningham
2003-06-15 22:37 ` Yoav Weiss
2003-05-13 14:45 Chuck Ebbert
2003-05-13 21:32 ` Jesse Pollard
2003-05-13 14:45 Chuck Ebbert
2003-05-13 19:00 ` jjs
2003-05-13 21:44 ` Jesse Pollard
2003-05-14 8:41 Chuck Ebbert
2003-05-14 23:24 Chuck Ebbert
2003-05-15 0:49 ` David Schwartz
2003-05-15 8:16 Chuck Ebbert
2003-05-16 16:15 Chuck Ebbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m13cjsbfc7.fsf@frodo.biederman.org \
--to=ebiederm@xmission.com \
--cc=D.A.Fedorov@inp.nsk.su \
--cc=arjanv@redhat.com \
--cc=hch@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=terje.eggestad@scali.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox