qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1
Date: Sun, 30 Nov 2008 18:20:38 +0100	[thread overview]
Message-ID: <20081130172038.GA32172@random.random> (raw)
In-Reply-To: <20081128191819.GA18031@shareable.org>

On Fri, Nov 28, 2008 at 07:18:19PM +0000, Jamie Lokier wrote:
> Blue Swirl wrote:
> > >  I wonder how can possibly aio_readv/writev be missing in posix aio?
> > >  Unbelievable. It'd be totally trivial to add those to glibc, much
> > >  easier infact than to pthread_create by hand, but how can we add a
> > >  dependency on a certain glibc version? Ironically it'll be more
> > >  user-friendly to add dependency on linux kernel-aio implementation
> > >  that is already available for ages and it's guaranteed to run faster
> > >  (or at least not slower).
> > 
> > There's also lio_listio that provides for vectored AIO.
> 
> I think lio_listio is the missing aio_readv/writev.
> 
> It's more versatile, and that'll by why POSIX never bothered with
> aio_readv/writev.
> 
> Doesn't explain why they didn't _start_ with aio_readv before
> inventing lio_listio, but there you go.  Unix history.

Well, I before grepped for readv or writev syscalls inside the
glibc-2.6.1/sysdeps/pthread and there was nothing there, so lio_listio
doesn't seem to be helpful at all. If that was a _kernel_ API then the
kernel could see the whole queue immediately and coalesce all
oustanding contiguous I/O in a single DMA operation, but the userland
queue here will not be visible to the kernel. So unless we can execute
the readv and writev syscalls, O_DIRECT performance with direct DMA
API will be destroyed compared to bounce buffering because the guest
OS will submit large DMA operations that will be executed as 4k DMA
operation in the storage hardware, the memcpy overhead that we're
eliminating is minor compared to such a major I/O bottleneck with
qemu cache=off.

The only way we could possibly use lio_listio, would be to improve
glibc so the lio_listio op will be smart enough to call readv/writev
if it finds contiguous I/O being queued, but overall this would be
still largely inefficient. If you check the dma api, I'm preparing
struct iovec *iov ready to submit to the kernel either through the
inexistent aio_readv/writev or with the kernel-API
IOCB_CMD_PREADV/WRITEV (they obviously both take the well defined
struct iovec as param so there's zero overhead).

So even if we improve lio_listio, lio_listio would introduce
artificial splitting-recolaescing overhead just because of its weird
API. Entirely different would be if lio_listio would resemble the
kernel sys_iosubmit API and had a PREADV/WRITEV type to submit iovecs,
but this only has a LIO_READ/WRITE, no sign of LIO_READV/WRITEV
unfortunately :(. Amittedly it's not so common having to use
readv/writev on contiguous I/O but the emulated DMA with SG truly
requires this. Anything that can't handle a native iovec we can't use.

Likely we'll have to add a pthread_create based our own aio
implementation for non-linux and kernel-AIO for linux, and get rid of
librt as a whole. It's pointless to mix our own userland aio (that will
support readv/writev too), with the posix one. And if this was just a
linux project kernel AIO would suffice. All DB that I know need to use
readv/writev with AIO and O_DIRECT for similar reasons as us, already
used kernel AIO.

  parent reply	other threads:[~2008-11-30 17:20 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-27 12:35 [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Andrea Arcangeli
2008-11-27 12:43 ` [Qemu-devel] [RFC 2/2] bdrv_aio_readv/writev_em Andrea Arcangeli
2008-11-28 11:09   ` Jamie Lokier
2008-11-27 19:14 ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Blue Swirl
2008-11-28  1:56   ` Andrea Arcangeli
2008-11-28 17:59     ` Blue Swirl
2008-11-28 18:50       ` Andrea Arcangeli
2008-11-28 19:03         ` Blue Swirl
2008-11-28 19:18           ` Jamie Lokier
2008-11-29 19:49             ` Avi Kivity
2008-11-30 17:20             ` Andrea Arcangeli [this message]
2008-11-30 22:31             ` Anthony Liguori
2008-11-30 18:04           ` Andrea Arcangeli
2008-11-30 17:41         ` [Qemu-devel] [RFC 1/1] pci-dma-api-v2 Andrea Arcangeli
2008-11-30 18:36           ` [Qemu-devel] " Blue Swirl
2008-11-30 19:04             ` Andrea Arcangeli
2008-11-30 19:11               ` Blue Swirl
2008-11-30 19:20                 ` Andrea Arcangeli
2008-11-30 21:36                   ` Blue Swirl
2008-11-30 22:54                     ` Anthony Liguori
2008-11-30 22:50           ` [Qemu-devel] " Anthony Liguori
2008-12-01  9:41             ` Avi Kivity
2008-12-01 16:37               ` Anthony Liguori
2008-12-02  9:45                 ` Avi Kivity
2008-11-30 22:38         ` [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Anthony Liguori
2008-11-30 22:51           ` Jamie Lokier
2008-11-30 22:34       ` Anthony Liguori
2008-11-29 19:48   ` Avi Kivity
2008-11-30 17:29     ` Andrea Arcangeli
2008-11-30 20:27       ` Avi Kivity
2008-11-30 22:33         ` Andrea Arcangeli
2008-11-30 22:33   ` Anthony Liguori

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081130172038.GA32172@random.random \
    --to=aarcange@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).