From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:36821)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SeFza-0005TX-Mn
	for qemu-devel@nongnu.org; Mon, 11 Jun 2012 21:34:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SeFzY-0003BI-Hd
	for qemu-devel@nongnu.org; Mon, 11 Jun 2012 21:33:58 -0400
Received: from mail-ob0-f173.google.com ([209.85.214.173]:40853)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <anthony@codemonkey.ws>) id 1SeFzY-0003AK-9Y
	for qemu-devel@nongnu.org; Mon, 11 Jun 2012 21:33:56 -0400
Received: by obbwd20 with SMTP id wd20so8142902obb.4
	for <qemu-devel@nongnu.org>; Mon, 11 Jun 2012 18:33:54 -0700 (PDT)
Message-ID: <4FD69C80.3080807@codemonkey.ws>
Date: Mon, 11 Jun 2012 20:33:52 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
References: <CAEgOgz5RadrsLQ9fniu40t4TwK-ubMcJOjhcVP9HVyHLBXH=+g@mail.gmail.com>
	<4FD208F6.3020307@codemonkey.ws>
	<CAEgOgz7bcfoA5DYM8T0Jz4OVOJkj3j85wjph825FXwL-QGY79w@mail.gmail.com>
	<4FD5EF75.7060707@us.ibm.com>
	<CAFEAcA9aLUgMD-QtSr3uWtJPWA7Oj0BYTf67TdhGxN0_fnEicQ@mail.gmail.com>
	<4FD60834.2020208@codemonkey.ws> <4FD62B5C.4040106@redhat.com>
	<4FD63A7F.8090902@codemonkey.ws> <1339452058.9220.32.camel@pasglop>
	<4FD67132.1060001@codemonkey.ws> <1339458400.9220.49.camel@pasglop>
In-Reply-To: <1339458400.9220.49.camel@pasglop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] QOMification of AXI streams
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Maydell <peter.maydell@linaro.org>, Michal Simek <monstr@monstr.eu>, "qemu-devel@nongnu.org Developers" <qemu-devel@nongnu.org>, Peter Crosthwaite <peter.crosthwaite@petalogix.com>, Paul Brook <paul@codesourcery.com>, "Edgar E. Iglesias" <edgar.iglesias@gmail.com>, =?UTF-8?B?QW5kcmVhcyBGw6RyYmVy?= <afaerber@suse.de>, John Williams <john.williams@petalogix.com>, Avi Kivity <avi@redhat.com>

On 06/11/2012 06:46 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-06-11 at 17:29 -0500, Anthony Liguori wrote:
>
>> When we add a subregion, it always removes the offset from the address when it
>> dispatches.  This more often than not works out well but for what you're
>> describing above, it sounds like you'd really want to get an adjusted size (that
>> could be transformed).
>>
>> Today we generate a linear dispatch table.  This prevents us from applying
>> device-level transforms.
>
> subregions being relative to the parent would probably work for me.
>
> Typically I have something like:
>
>    0x1234_0000_0000 .. 0x1234_3fff_ffff : window to PCI memory
>
> Which generates PCI cycles into the range:
>
>    0x0000_c000_0000 .. 0x0000_ffff_ffff
>
> Which is a combination of bit masking&  offset, ie, the HW algorithm is
> to forward some bits (0x3fff_ffff in this case) and replace the other
> ones (with 0xc000_0000) in this case.
>
> Is that properly represented in the current scheme by a subregion ?

Does this mean that both 0x1234..0000_0000is a valid address to issue PCI 
requests along with 0x0000_c000_000?

If they both are valid, that sounds pretty much like what aliases were made for.

>> So if you stick with the notion of subregions, you would still have a single
>> MemoryRegion at the PCI bus layer that has all of it's children as sub regions.
>>    Presumably that "scan for all siblings" is a binary search which shouldn't
>> really be that expensive considering that we're likely to have a shallow depth
>> in the memory hierarchy.
>
> Possibly but on every DMA access ...

We can always as a TLB :-)

>>> Additionally to handle iommu's etc... you need the option for a given
>>> memory region to have functions to perform the transformation in the
>>> upstream direction.
>>
>> I think that transformation function lives in the bus layer MemoryRegion.  It's
>> a bit tricky though because you need some sort of notion of "who is asking".  So
>> you need:
>>
>> dma_memory_write(MemoryRegion *parent, DeviceState *caller,
>>                    const void *data, size_t size);
>
> Why do you need the parent argument ? Can't it be implied from the
> DeviceState ? Or do you envision devices sitting on multiple busses ? Or
> it's just a matter that each device "class" will store that separately ?

Yeah, you may have multiple MemoryRegions you want to write to.  I don't think 
we should assume all devices have exactly one address space it writes to.

> Also you want the parent to be the direct parent P2P no ? Not the host
> bridge ? Ie. to be completely correct, you want to look at every sibling
> device under a given bridge, then go up if there is no match, etc...
> until you reach the PHB at which point you hit the iommu and eventually
> the system fabric.

Yes.

>> Not quite...  subtractive decoding only happens for very specific devices IIUC.
>>    For instance, an PCI-ISA bridge.  Normally, it's positive decoding and a
>> bridge has to describe the full region of MMIO/PIO that it handles.
>
> That's for downstream. Upstream is essentially substractive as far as I
> can tell. IE. If no sibling device decodes the cycle, then it goes up
> no ?

Why would that be necessary?  Are upstream requests p2p?  I would have assumed 
that they went to the host bus and then were dispatched by the host bus (but I'm 
just guess).

>> So it's only necessary to transverse down the tree again for the very special
>> case of PCI-ISA bridges.  Normally you can tell just by looking at siblings.
>>
>>> That will be a significant overhead for your DMA ops I believe, though
>>> doable.
>>
>> Worst case scenario, 256 devices with what, a 3 level deep hierarchy?  we're
>> still talking about 24 simple address compares.  That shouldn't be so bad.
>
> Per DMA access...
>
>>> Then we'd have to add map/unmap to MemoryRegion as well, with the
>>> understanding that they may not be supported at every level...
>>
>> map/unmap can always fall back to bounce buffers.
>
> Right.
>
>>> So yeah, it sounds doable and it would handle what DMAContext doesn't
>>> handle which is access to peer devices without going all the way back to
>>> the "top level", but it's complex and ... I need something in qemu
>>> 1.2 :-)
>>
>> I think we need a longer term vision here.  We can find incremental solutions
>> for the short term but I'm pretty nervous about having two parallel APIs only to
>> discover that we need to converge in 2 years.
>
> Can we agree that what we need to avoid having to change every second
> day is the driver API ?
>
> In which case, what about I come up with some pci_* DMA APIs such as the
> ones you suggested above and fixup a handful of devices we care about to
> them.

Yes!

If I recall, that was what I suggested originally and you talked me out of it :-)

I think that's the best future proof design.

Regards,

Anthony Liguori

>
> Under the hood, we can make it use the DMAContext for now and have that
> working for us, but we can easily "merge" DMAContext and MemoryRegion
> later since it's not directly exposed to devices.
>
> IE. The devices would only use:
>
>    pci_device_read/write/map/unmap(PCIDevice,...)
>
> And DMAContext remains burried in the implementation.
>
> Would that be ok with you ?
>
> Cheers,
> Ben.
>
>> Regards,
>>
>> Anthony Liguori
>>
>>
>>> In addition there's the memory barrier business so we probably want to
>>> keep the idea of having DMA specific accessors ...
>>>
>>> Could we keep the DMAContext for now and just rename it to MemoryRegion
>>> (keeping the accessors) when we go for a more in depth transformation ?
>>>
>>> Cheers,
>>> Ben.
>>>
>>>
>>>
>
>