From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Ilwxs-0001A2-88 for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:20 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1Ilwxp-00019e-JY for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:18 -0400 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Ilwxp-00019b-HF for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:17 -0400 Received: from mail.codesourcery.com ([65.74.133.4]) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1Ilwxp-0007wF-A3 for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:17 -0400 From: Paul Brook Subject: Re: [Qemu-devel] Faster, generic IO/DMA model with vectored AIO? Date: Sun, 28 Oct 2007 02:29:09 +0100 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200710280129.10640.paul@codesourcery.com> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: Blue Swirl > I changed Slirp output to use vectored IO to avoid the slowdown from > memcpy (see the patch for the work in progress, gives a small > performance improvement). But then I got the idea that using AIO would > be nice at the outgoing end of the network IO processing. In fact, > vectored AIO model could even be used for the generic DMA! The benefit > is that no buffering or copying should be needed. An interesting idea, however I don't want to underestimate the difficulty of implementing this correctly. I suspect to get real benefits you need to support zero-copy async operation all the way through. Things get really hairy if you allow some operations to complete synchronously, and some to be deferred. I've done async operation for SCSI and USB. The latter is really not pretty, and the former has some notable warts. A generic IODMA framework needs to make sure it covers these requirements without making things worse. Hopefully it'll also help fix the things that are wrong with them. > For the specific Sparc32 case, unfortunately Lance bus byte swapping > makes buffering necessary at that stage, unless we can make N vectors > with just a single byte faster than memcpy + bswap of memory block > with size N. We really want to be dealing with largeish blocks. The {ptr,size} vector is 64 or 128 bytes per element, so the overhead on blocks < 64 bytes if going to be really brutal. Also time taken to do address translation will be O(number of vectors). > Inside Qemu the vectors would use target physical addresses (struct > qemu_iovec), but at some point the addresses would change to host > pointers suitable for real AIO. Phrases like "at some point" worry me :-) I think it would be good to get a top-down description of what each different entity (initiating device, host endpoint, bus translation, memory) is responsible for, and how they all fit together. I have some ideas, but without more detailed investigation can't tell if they will actually work in practice, or if they fit into the code fragments you've posted. My suspicion is they don't as I can't make head or tail of how your gdma_aiov.diff patch would be used in practice. Paul