From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1L6pyU-0007Hm-G3 for qemu-devel@nongnu.org; Sun, 30 Nov 2008 12:20:50 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1L6pyQ-0007Fe-S9 for qemu-devel@nongnu.org; Sun, 30 Nov 2008 12:20:50 -0500 Received: from [199.232.76.173] (port=34506 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L6pyQ-0007FO-IZ for qemu-devel@nongnu.org; Sun, 30 Nov 2008 12:20:46 -0500 Received: from mx2.redhat.com ([66.187.237.31]:48759) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1L6pyQ-0007yF-1I for qemu-devel@nongnu.org; Sun, 30 Nov 2008 12:20:46 -0500 Received: from int-mx2.corp.redhat.com (int-mx2.corp.redhat.com [172.16.27.26]) by mx2.redhat.com (8.13.8/8.13.8) with ESMTP id mAUHKe5L017520 for ; Sun, 30 Nov 2008 12:20:42 -0500 Received: from ns3.rdu.redhat.com (ns3.rdu.redhat.com [10.11.255.199]) by int-mx2.corp.redhat.com (8.13.1/8.13.1) with ESMTP id mAUHKeiT017224 for ; Sun, 30 Nov 2008 12:20:40 -0500 Received: from random.random (vpn-10-12.str.redhat.com [10.32.10.12]) by ns3.rdu.redhat.com (8.13.8/8.13.8) with ESMTP id mAUHKdju021269 for ; Sun, 30 Nov 2008 12:20:39 -0500 Date: Sun, 30 Nov 2008 18:20:38 +0100 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [RFC 1/2] pci-dma-api-v1 Message-ID: <20081130172038.GA32172@random.random> References: <20081127123538.GC10348@random.random> <20081128015602.GA31011@random.random> <20081128185001.GD31011@random.random> <20081128191819.GA18031@shareable.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081128191819.GA18031@shareable.org> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org On Fri, Nov 28, 2008 at 07:18:19PM +0000, Jamie Lokier wrote: > Blue Swirl wrote: > > > I wonder how can possibly aio_readv/writev be missing in posix aio? > > > Unbelievable. It'd be totally trivial to add those to glibc, much > > > easier infact than to pthread_create by hand, but how can we add a > > > dependency on a certain glibc version? Ironically it'll be more > > > user-friendly to add dependency on linux kernel-aio implementation > > > that is already available for ages and it's guaranteed to run faster > > > (or at least not slower). > > > > There's also lio_listio that provides for vectored AIO. > > I think lio_listio is the missing aio_readv/writev. > > It's more versatile, and that'll by why POSIX never bothered with > aio_readv/writev. > > Doesn't explain why they didn't _start_ with aio_readv before > inventing lio_listio, but there you go. Unix history. Well, I before grepped for readv or writev syscalls inside the glibc-2.6.1/sysdeps/pthread and there was nothing there, so lio_listio doesn't seem to be helpful at all. If that was a _kernel_ API then the kernel could see the whole queue immediately and coalesce all oustanding contiguous I/O in a single DMA operation, but the userland queue here will not be visible to the kernel. So unless we can execute the readv and writev syscalls, O_DIRECT performance with direct DMA API will be destroyed compared to bounce buffering because the guest OS will submit large DMA operations that will be executed as 4k DMA operation in the storage hardware, the memcpy overhead that we're eliminating is minor compared to such a major I/O bottleneck with qemu cache=off. The only way we could possibly use lio_listio, would be to improve glibc so the lio_listio op will be smart enough to call readv/writev if it finds contiguous I/O being queued, but overall this would be still largely inefficient. If you check the dma api, I'm preparing struct iovec *iov ready to submit to the kernel either through the inexistent aio_readv/writev or with the kernel-API IOCB_CMD_PREADV/WRITEV (they obviously both take the well defined struct iovec as param so there's zero overhead). So even if we improve lio_listio, lio_listio would introduce artificial splitting-recolaescing overhead just because of its weird API. Entirely different would be if lio_listio would resemble the kernel sys_iosubmit API and had a PREADV/WRITEV type to submit iovecs, but this only has a LIO_READ/WRITE, no sign of LIO_READV/WRITEV unfortunately :(. Amittedly it's not so common having to use readv/writev on contiguous I/O but the emulated DMA with SG truly requires this. Anything that can't handle a native iovec we can't use. Likely we'll have to add a pthread_create based our own aio implementation for non-linux and kernel-AIO for linux, and get rid of librt as a whole. It's pointless to mix our own userland aio (that will support readv/writev too), with the posix one. And if this was just a linux project kernel AIO would suffice. All DB that I know need to use readv/writev with AIO and O_DIRECT for similar reasons as us, already used kernel AIO.