From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1Ilwxs-0001A2-88
	for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:20 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1Ilwxp-00019e-JY
	for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:18 -0400
Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Ilwxp-00019b-HF
	for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:17 -0400
Received: from mail.codesourcery.com ([65.74.133.4])
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <paul@codesourcery.com>) id 1Ilwxp-0007wF-A3
	for qemu-devel@nongnu.org; Sat, 27 Oct 2007 21:29:17 -0400
From: Paul Brook <paul@codesourcery.com>
Subject: Re: [Qemu-devel] Faster, generic IO/DMA model with vectored AIO?
Date: Sun, 28 Oct 2007 02:29:09 +0100
References: <f43fc5580710270556j15805369x334879e501a48e06@mail.gmail.com>
In-Reply-To: <f43fc5580710270556j15805369x334879e501a48e06@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200710280129.10640.paul@codesourcery.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: Blue Swirl <blauwirbel@gmail.com>

> I changed Slirp output to use vectored IO to avoid the slowdown from
> memcpy (see the patch for the work in progress, gives a small
> performance improvement). But then I got the idea that using AIO would
> be nice at the outgoing end of the network IO processing. In fact,
> vectored AIO model could even be used for the generic DMA! The benefit
> is that no buffering or copying should be needed.

An interesting idea, however I don't want to underestimate the difficulty of 
implementing this correctly.  I suspect to get real benefits you need to 
support zero-copy async operation all the way through.  Things get really 
hairy if you allow some operations to complete synchronously, and some to be 
deferred. 

I've done async operation for SCSI and USB. The latter is really not pretty, 
and the former has some notable warts. A generic IODMA framework needs to 
make sure it covers these requirements without making things worse. Hopefully 
it'll also help fix the things that are wrong with them.

> For the specific Sparc32 case, unfortunately Lance bus byte swapping
> makes buffering necessary at that stage, unless we can make N vectors
> with just a single byte faster than memcpy + bswap of memory block
> with size N.

We really want to be dealing with largeish blocks. The {ptr,size} vector is 64 
or 128 bytes per element, so the overhead on blocks < 64 bytes if going to be 
really brutal. Also time taken to do address translation will be O(number of 
vectors).

> Inside Qemu the vectors would use target physical addresses (struct
> qemu_iovec), but at some point the addresses would change to host
> pointers suitable for real AIO.

Phrases like "at some point" worry me :-)

I think it would be good to get a top-down description of what each different 
entity (initiating device, host endpoint, bus translation, memory) is 
responsible for, and how they all fit together.


I have some ideas, but without more detailed investigation can't tell if they 
will actually work in practice, or if they fit into the code fragments you've 
posted. My suspicion is they don't as I can't make head or tail of how your 
gdma_aiov.diff patch would be used in practice.

Paul