From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LOysK-0005QZ-Cb
	for qemu-devel@nongnu.org; Mon, 19 Jan 2009 13:29:28 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LOysG-0005Ph-R8
	for qemu-devel@nongnu.org; Mon, 19 Jan 2009 13:29:27 -0500
Received: from [199.232.76.173] (port=50624 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LOysG-0005PV-8k
	for qemu-devel@nongnu.org; Mon, 19 Jan 2009 13:29:24 -0500
Received: from mx2.redhat.com ([66.187.237.31]:54817)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <avi@redhat.com>) id 1LOysE-0000v0-RJ
	for qemu-devel@nongnu.org; Mon, 19 Jan 2009 13:29:24 -0500
Received: from int-mx2.corp.redhat.com (int-mx2.corp.redhat.com [172.16.27.26])
	by mx2.redhat.com (8.13.8/8.13.8) with ESMTP id n0JITKP6023192
	for <qemu-devel@nongnu.org>; Mon, 19 Jan 2009 13:29:20 -0500
Message-ID: <4974C694.8070004@redhat.com>
Date: Mon, 19 Jan 2009 20:29:40 +0200
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH 1/5] Add target memory mapping API
References: <1232308399-21679-1-git-send-email-avi@redhat.com>	<1232308399-21679-2-git-send-email-avi@redhat.com>	<18804.34053.211615.181730@mariner.uk.xensource.com>	<4974943B.4020507@redhat.com>	<18804.44271.868488.32192@mariner.uk.xensource.com>	<4974B82F.9020805@redhat.com>
	<18804.48642.929024.908906@mariner.uk.xensource.com>
In-Reply-To: <18804.48642.929024.908906@mariner.uk.xensource.com>
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

Ian Jackson wrote:
>>> Efficient read-modify-write may be very hard for some setups to
>>> achieve.  It can't be done with the bounce buffer implementation.
>>> I think ond good rule of thumb would be to make sure that the interface
>>> as specified can be implemented in terms of cpu_physical_memory_rw.
>>>       
>> What is the motivation for efficient rmw?
>>     
>
> I think you've misunderstood me.  I don't think there is such a
> motivation.  I was saying it was so difficult to implement that we
> might as well exclude it.
>   

Then we agree. The map API is for read OR write operations, not both at 
the same time.

>   
>>> That would be one alternative but isn't it the case that (for example)
>>> with a partial DMA completion, the guest can assume that the
>>> supposedly-untouched parts of the DMA target memory actually remain
>>> untouched rather than (say) zeroed ?
>>>       
>> For block devices, I don't think it can.
>>     
>
> `Block devices' ?  We're talking about (say) IDE controllers here.  I
> would be very surprised if an IDE controller used DMA to overwrite RAM
> beyond the amount of successful transfer.
>
> If a Unix variant does zero copy IO using DMA direct into process
> memory space, then it must even rely on the IDE controller not doing
> DMA beyond the end of the successful transfer, as the read(2) API
> promises to the calling process that data beyond the successful read
> is left untouched.
>
> And even if the IDE spec happily says that the (IDE) host (ie our
> guest) is not allowed to assume that that memory (ie the memory beyond
> the extent of the successful part of a partially successful transfer)
> is unchanged, there will almost certainly be some other IO device on
> some some platform that will make that promise.
>
> So we need a call into the DMA API from the device model to say which
> regions have actually been touched.
>
>   

It's not possible to implement this efficiently. The qemu block layer 
will submit the results of the map operation to the kernel in an async 
zero copy operation. The kernel may break up this operation into several 
parts (if the underlying backing store is fragmented) and submit in 
parallel to the underlying device(s). Those requests will complete 
out-of-order, so you can't guarantee that if an error occurs all memory 
before will have been written and none after.

I really doubt that any guest will be affected by this. It's a tradeoff 
between decent performance and needlessly accurate emulation. I don't 
see how we can choose the latter.

>>> In a system where we're trying to do zero copy, we may issue the map
>>> request for a large transfer, before we know how much the host kernel
>>> will actually provide.
>>>       
>> Won't it be at least 1GB?  Partition you requests to that size.
>>     
>
> No, I mean, before we know how much data qemu's read(2) will transfer.
>   

You don't know afterwards either. Maybe read() is specced as you say, 
but practical implementations will return the minimum bytes read, not exact.

Think software RAID.

>>  In any case, this will only occur with mmio.  I don't think the
>> guest can assume much in such cases.
>>     
>
> No, it won't only occur with mmio.
>
> In the initial implementation in Xen, we will almost certainly simply
> emulate everything with cpu_physical_memory_rw.  So it will happen all
> the time.
>   

Try it out. I'm sure it will work just fine (if incredibly slowly, 
unless you provide multiple bounce buffers).

>>> Err, no, I don't really see that.  In my proposal the `handle' is
>>> actually allocated by the caller.  The implementation provides the
>>> private data and that can be empty.  There is no additional memory
>>> allocation.
>>>       
>> You need to store multiple handles (one per sg element), so you need to 
>> allocate a variable size vector for it.  Preallocation may be possible 
>> but perhaps wasteful.
>>     
>
> See my reply to Anthony Ligouri, which shows how this can be avoided.
> Since you hope for a single call to map everything, you can do an sg
> list with a single handle.
>   

That's a very different API.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.