LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, aafabbri,
	iommu, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1311983933.8793.42.camel@pasglop>

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply

* [PATCH] perf: powerpc: Disable pagefaults during callchain stack read
From: David Ahern @ 2011-07-30 20:53 UTC (permalink / raw)
  To: benh, anton
  Cc: Peter Zijlstra, peterz, linux-kernel, paulus, acme, David Ahern,
	mingo, linuxppc-dev

Panic observed on an older kernel when collecting call chains for
the context-switch software event:

 [<b0180e00>]rb_erase+0x1b4/0x3e8
 [<b00430f4>]__dequeue_entity+0x50/0xe8
 [<b0043304>]set_next_entity+0x178/0x1bc
 [<b0043440>]pick_next_task_fair+0xb0/0x118
 [<b02ada80>]schedule+0x500/0x614
 [<b02afaa8>]rwsem_down_failed_common+0xf0/0x264
 [<b02afca0>]rwsem_down_read_failed+0x34/0x54
 [<b02aed4c>]down_read+0x3c/0x54
 [<b0023b58>]do_page_fault+0x114/0x5e8
 [<b001e350>]handle_page_fault+0xc/0x80
 [<b0022dec>]perf_callchain+0x224/0x31c
 [<b009ba70>]perf_prepare_sample+0x240/0x2fc
 [<b009d760>]__perf_event_overflow+0x280/0x398
 [<b009d914>]perf_swevent_overflow+0x9c/0x10c
 [<b009db54>]perf_swevent_ctx_event+0x1d0/0x230
 [<b009dc38>]do_perf_sw_event+0x84/0xe4
 [<b009dde8>]perf_sw_event_context_switch+0x150/0x1b4
 [<b009de90>]perf_event_task_sched_out+0x44/0x2d4
 [<b02ad840>]schedule+0x2c0/0x614
 [<b0047dc0>]__cond_resched+0x34/0x90
 [<b02adcc8>]_cond_resched+0x4c/0x68
 [<b00bccf8>]move_page_tables+0xb0/0x418
 [<b00d7ee0>]setup_arg_pages+0x184/0x2a0
 [<b0110914>]load_elf_binary+0x394/0x1208
 [<b00d6e28>]search_binary_handler+0xe0/0x2c4
 [<b00d834c>]do_execve+0x1bc/0x268
 [<b0015394>]sys_execve+0x84/0xc8
 [<b001df10>]ret_from_syscall+0x0/0x3c

A page fault occurred walking the callchain while creating a perf
sample for the context-switch event. To handle the page fault the
mmap_sem is needed, but it is currently held by setup_arg_pages.
(setup_arg_pages calls shift_arg_pages with the mmap_sem held.
shift_arg_pages then calls move_page_tables which has a cond_resched
at the top of its for loop - hitting that cond_resched is what caused
the context switch.)

This is an extension of Anton's proposed patch:
https://lkml.org/lkml/2011/7/24/151
adding case for 32-bit ppc.

Tested on the system that first generated the panic and then again
with latest kernel using a PPC VM. I am not able to test the 64-bit
path - I do not have H/W for it and 64-bit PPC VMs (qemu on Intel)
is horribly slow.

Signed-off-by: David Ahern <dsahern@gmail.com>
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Anton Blanchard <anton@samba.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Mackerras <paulus@samba.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-kernel@vger.kernel.org

---
 arch/powerpc/kernel/perf_callchain.c |   20 +++++++++++++++++---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/perf_callchain.c b/arch/powerpc/kernel/perf_callchain.c
index d05ae42..564c1d8 100644
--- a/arch/powerpc/kernel/perf_callchain.c
+++ b/arch/powerpc/kernel/perf_callchain.c
@@ -154,8 +154,12 @@ static int read_user_stack_64(unsigned long __user *ptr, unsigned long *ret)
 	    ((unsigned long)ptr & 7))
 		return -EFAULT;
 
-	if (!__get_user_inatomic(*ret, ptr))
+	pagefault_disable();
+	if (!__get_user_inatomic(*ret, ptr)) {
+		pagefault_enable();
 		return 0;
+	}
+	pagefault_enable();
 
 	return read_user_stack_slow(ptr, ret, 8);
 }
@@ -166,8 +170,12 @@ static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
 	    ((unsigned long)ptr & 3))
 		return -EFAULT;
 
-	if (!__get_user_inatomic(*ret, ptr))
+	pagefault_disable();
+	if (!__get_user_inatomic(*ret, ptr)) {
+		pagefault_enable();
 		return 0;
+	}
+	pagefault_enable();
 
 	return read_user_stack_slow(ptr, ret, 4);
 }
@@ -294,11 +302,17 @@ static inline int current_is_64bit(void)
  */
 static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
 {
+	int rc;
+
 	if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
 	    ((unsigned long)ptr & 3))
 		return -EFAULT;
 
-	return __get_user_inatomic(*ret, ptr);
+	pagefault_disable();
+	rc = __get_user_inatomic(*ret, ptr);
+	pagefault_enable();
+
+	return rc;
 }
 
 static inline void perf_callchain_user_64(struct perf_callchain_entry *entry,
-- 
1.7.6

^ permalink raw reply related

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-07-30 22:21 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci@vger.kernel.org,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev
In-Reply-To: <1311983933.8793.42.camel@pasglop>

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.
> 
> David, Alexei, please make sure I haven't missed anything :-)

And I think I have :-)

  * Config space

VFIO currently handles that as a byte stream. It's quite gross to be
honest and it's not right. You shouldn't lose access size information
between guest and host when performing real accesses.

Some config space registers can have side effects and not respecting
access sizes can be nasty.

Cheers,
Ben.

> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.
> 
> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control
> 
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).
> 
> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
> 
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.
> 
> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).
> 
> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.
> 
> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.
> 
>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.
> 
> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.
> 
> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.
> 
>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.
> 
>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.
> 
>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.
> 
> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.
> 
>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86
> 
>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.
> 
>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.
> 
> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.
> 
> Cheers,
> Ben.

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-07-30 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, aafabbri,
	iommu, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1312050011.2265.185.camel@x201.home>

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)

> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.

 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.

>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-07-30 23:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, aafabbri,
	iommu, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1312050011.2265.185.camel@x201.home>

On Sat, 2011-07-30 at 12:20 -0600, Alex Williamson wrote:

> On x86, the USB controllers don't typically live behind a PCIe-to-PCI
> bridge, so don't suffer the source identifier problem, but they do often
> share an interrupt.  But even then, we can count on most modern devices
> supporting PCI2.3, and thus the DisINTx feature, which allows us to
> share interrupts.  In any case, yes, it's more rare but we need to know
> how to handle devices behind PCI bridges.  However I disagree that we
> need to assign all the devices behind such a bridge to the guest.

Well, ok so let's dig a bit more here :-) First, yes I agree they don't
all need to appear to the guest. My point is really that we must prevent
them to be "used" by somebody else, either host or another guest.

Now once you get there, I personally prefer having a clear "group"
ownership rather than having devices stay in some "limbo" under vfio
control but it's an implementation detail.

Regarding DisINTx, well, it's a bit like putting separate PCIe functions
into separate guests, it looks good ... but you are taking a chance.
Note that I do intend to do some of that for power ... well I think, I
haven't completely made my mind.

pHyp for has a stricter requirement, PEs essentially are everything
behind a bridge. If you have a slot, you have some kind of bridge above
this slot and everything on it will be a PE.

The problem I see is that with your filtering of config space, BAR
emulation, DisINTx etc... you essentially assume that you can reasonably
reliably isolate devices. But in practice, it's chancy. Some devices for
example have "backdoors" into their own config space via MMIO. If I have
such a device in a guest, I can completely override your DisINTx and
thus DOS your host or another guest with a shared interrupt. I can move
my MMIO around and DOS another function by overlapping the addresses.

You can really only be protect yourself against a device if you have it
behind a bridge (in addition to having a filtering iommu), which limits
the MMIO span (and thus letting the guest whack the BARs randomly will
only allow that guest to shoot itself in the foot).

Some bridges also provide a way to block INTx below them which comes in
handy but it's bridge specific. Some devices can be coerced to send the
INTx "assert" message and never de-assert it (for example by doing a
soft-reset while it's asserted, which can be done with some devices with
an MMIO).

Anything below a PCIe -> PCI/PCI-X needs to also be "grouped" due to
simple lack of proper filtering by the iommu (PCI-X in theory has RIDs
and fowards them up, but this isn't very reliable, for example it fails
over with split transactions).

Fortunately in PCIe land, we most have bridges above everything. The
problem somewhat remains with functions of a device, how can you be sure
that there isn't a way via some MMIO to create side effects on the other
functions of the device ? (For example by checkstopping the whole
thing). You can't really :-)

So it boils down of the "level" of safety/isolation you want to provide,
and I suppose to some extent it's a user decision but the user needs to
be informed to some extent. A hard problem :-)

> There's a difference between removing the device from the host and
> exposing the device to the guest.  If I have a NIC and HBA behind a
> bridge, it's perfectly reasonable that I might only assign the NIC to
> the guest, but as you describe, we then need to prevent the host, or any
> other guest from making use of the HBA.

Yes. However the other device is in "limbo" and it may be not clear to
the user why it can't be used anymore :-)

The question is more, the user needs to "know" (or libvirt does, or
somebody ... ) that in order to pass-through device A, it must also
"remove" device B from the host. How can you even provide a meaningful
error message to the user if all VFIO does is give you something like
-EBUSY ?

So the information about the grouping constraint must trickle down
somewhat.

Look at it from a GUI perspective for example. Imagine a front-end
showing you devices in your system and allowing you to "Drag & drop"
them to your guest. How do you represent that need for grouping ? First
how do you expose it from kernel/libvirt to the GUI tool and how do you
represent it to the user ?

By grouping the devices in logical groups which end up being the
"objects" you can drag around, at least you provide some amount of
clarity. Now if you follow that path down to how the GUI app, libvirt
and possibly qemu need to know / resolve the dependency, being given the
"groups" as the primary information of what can be used for pass-through
makes everything a lot simpler.

> > - The -minimum- granularity of pass-through is not always a single
> > device and not always under SW control
> 
> But IMHO, we need to preserve the granularity of exposing a device to a
> guest as a single device.  That might mean some devices are held hostage
> by an agent on the host.

Maybe but wouldn't that be even more confusing from a user perspective ?
And I think it makes it harder from an implementation of admin &
management tools perspective too.

> > - Having a magic heuristic in libvirt to figure out those constraints is
> > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> > knowledge of PCI resource management and getting it wrong in many many
> > cases, something that took years to fix essentially by ripping it all
> > out. This is kernel knowledge and thus we need the kernel to expose in a
> > way or another what those constraints are, what those "partitionable
> > groups" are.
> > 
> > - That does -not- mean that we cannot specify for each individual device
> > within such a group where we want to put it in qemu (what devfn etc...).
> > As long as there is a clear understanding that the "ownership" of the
> > device goes with the group, this is somewhat orthogonal to how they are
> > represented in qemu. (Not completely... if the iommu is exposed to the
> > guest ,via paravirt for example, some of these constraints must be
> > exposed but I'll talk about that more later).
> 
> Or we can choose not to expose all of the devices in the group to the
> guest?

As I said, I don't mind if you don't, I'm just worried about the
consequences of that from a usability standpoint. Having advanced
command line option to fine tune is fine. Being able to specify within a
"group" which devices to show and at what address if fine.

But I believe the basic entity to be manipulated from an interface
standpoitn remains the group.

To get back to my GUI example, once you've D&D your group of devices
over, you can have the option to open that group and check/uncheck
individual devices & assign them addresses if you want. That doesn't
change the fact that practically speaking, the whole group is now owned
by the guest.

I will go further than that actually. If you look at how the isolation
HW works on POWER, the fact that I have the MMIO segmentation means that
I can simply give the entire group MMIO space to the guest. No problem
of small BARs, no need to slow-map them ... etc.. that's a pretty handy
feature don't you think ?

But that means that those other devices -will- be there, mapped along
with the one you care about. We may not expose it in config space but it
will be accessible. I suppose we can keep its IO/MEM decoding disabled.
But my point is that for all intend and purpose, it's actually owned by
the guest.

> > The interface currently proposed for VFIO (and associated uiommu)
> > doesn't handle that problem at all. Instead, it is entirely centered
> > around a specific "feature" of the VTd iommu's for creating arbitrary
> > domains with arbitrary devices (tho those devices -do- have the same
> > constraints exposed above, don't try to put 2 legacy PCI devices behind
> > the same bridge into 2 different domains !), but the API totally ignores
> > the problem, leaves it to libvirt "magic foo" and focuses on something
> > that is both quite secondary in the grand scheme of things, and quite
> > x86 VTd specific in the implementation and API definition.
> 
> To be fair, libvirt's "magic foo" is built out of the necessity that
> nobody else is defining the rules.

Sure, which is why I propose that the kernel exposes the rules since
it's really the one right place to have that sort of HW constraint
knowledge, especially since it can be partially at least platform
specific.

 .../...

> > Maybe something like /sys/devgroups ? This probably warrants involving
> > more kernel people into the discussion.
> 
> I don't yet buy into passing groups to qemu since I don't buy into the
> idea of always exposing all of those devices to qemu.  Would it be
> sufficient to expose iommu nodes in sysfs that link to the devices
> behind them and describe properties and capabilities of the iommu
> itself?  More on this at the end.

Well, iommu aren't the only factor. I mentioned shared interrupts (and
my unwillingness to always trust DisINTx), there's also the MMIO
grouping I mentioned above (in which case it's an x86 -limitation- with
small BARs that I don't want to inherit, especially since it's based on
PAGE_SIZE and we commonly have 64K page size on POWER), etc...

So I'm not too fan of making it entirely look like the iommu is the
primary factor, but we -can-, that would be workable. I still prefer
calling a cat a cat and exposing the grouping for what it is, as I think
I've explained already above, tho. 

 .../...

> > Now some of this can be fixed with tweaks, and we've started doing it
> > (we have a working pass-through using VFIO, forgot to mention that, it's
> > just that we don't like what we had to do to get there).
> 
> This is a result of wanting to support *unmodified* x86 guests.  We
> don't have the luxury of having a predefined pvDMA spec that all x86
> OSes adhere to. 

No but you could emulate a HW iommu no ?

>  The 32bit problem is unfortunate, but the priority use
> case for assigning devices to guests is high performance I/O, which
> usually entails modern, 64bit hardware.  I'd like to see us get to the
> point of having emulated IOMMU hardware on x86, which could then be
> backed by VFIO, but for now guest pinning is the most practical and
> useful.

For your current case maybe. It's just not very future proof imho.
Anyways, it's fixable, but the APIs as they are make it a bit clumsy.

 .../...

> > Also our next generation chipset may drop support for PIO completely.
> > 
> > On the other hand, because PIO is just a special range of MMIO for us,
> > we can do normal pass-through on it and don't need any of the emulation
> > done qemu.
> 
> Maybe we can add mmap support to PIO regions on non-x86.

We have to yes. I haven't looked into it yet, it should be easy if VFIO
kernel side starts using the "proper" PCI mmap interfaces in kernel (the
same interfaces sysfs & proc use).

> >   * MMIO constraints
> > 
> > The QEMU side VFIO code hard wires various constraints that are entirely
> > based on various requirements you decided you have on x86 but don't
> > necessarily apply to us :-)
> > 
> > Due to our paravirt nature, we don't need to masquerade the MSI-X table
> > for example. At all. If the guest configures crap into it, too bad, it
> > can only shoot itself in the foot since the host bridge enforce
> > validation anyways as I explained earlier. Because it's all paravirt, we
> > don't need to "translate" the interrupt vectors & addresses, the guest
> > will call hyercalls to configure things anyways.
> 
> With interrupt remapping, we can allow the guest access to the MSI-X
> table, but since that takes the host out of the loop, there's
> effectively no way for the guest to correctly program it directly by
> itself.

Right, I think what we need here is some kind of capabilities to
"disable" those "features" of qemu vfio.c that aren't needed on our
platform :-) Shouldn't be too hard. We need to make this runtime tho
since different machines can have different "capabilities".

> > We don't need to prevent MMIO pass-through for small BARs at all. This
> > should be some kind of capability or flag passed by the arch. Our
> > segmentation of the MMIO domain means that we can give entire segments
> > to the guest and let it access anything in there (those segments are a
> > multiple of the page size always). Worst case it will access outside of
> > a device BAR within a segment and will cause the PE to go into error
> > state, shooting itself in the foot, there is no risk of side effect
> > outside of the guest boundaries.
> 
> Sure, this could be some kind of capability flag, maybe even implicit in
> certain configurations.

Yup.

> > In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> > paravirt guests expect the BARs to have been already allocated for them
> > by the firmware and will pick up the addresses from the device-tree :-)
> > 
> > Today we use a "hack", putting all 0's in there and triggering the linux
> > code path to reassign unassigned resources (which will use BAR
> > emulation) but that's not what we are -supposed- to do. Not a big deal
> > and having the emulation there won't -hurt- us, it's just that we don't
> > really need any of it.
> > 
> > We have a small issue with ROMs. Our current KVM only works with huge
> > pages for guest memory but that is being fixed. So the way qemu maps the
> > ROM copy into the guest address space doesn't work. It might be handy
> > anyways to have a way for qemu to use MMIO emulation for ROM access as a
> > fallback. I'll look into it.
> 
> So that means ROMs don't work for you on emulated devices either?  The
> reason we read it once and map it into the guest is because Michael
> Tsirkin found a section in the PCI spec that indicates devices can share
> address decoders between BARs and ROM.

Yes, he is correct.

>   This means we can't just leave
> the enabled bit set in the ROM BAR, because it could actually disable an
> address decoder for a regular BAR.  We could slow-map the actual ROM,
> enabling it around each read, but shadowing it seemed far more
> efficient.

Right. We can slow map the ROM, or we can not care :-) At the end of the
day, what is the difference here between a "guest" under qemu and the
real thing bare metal on the machine ? IE. They have the same issue vs.
accessing the ROM. IE. I don't see why qemu should try to make it safe
to access it at any time while it isn't on a real machine. Since VFIO
resets the devices before putting them in guest space, they should be
accessible no ? (Might require a hard reset for some devices tho ... )

In any case, it's not a big deal and we can sort it out, I'm happy to
fallback to slow map to start with and eventually we will support small
pages mappings on POWER anyways, it's a temporary limitation.

> >   * EEH
> > 
> > This is the name of those fancy error handling & isolation features I
> > mentioned earlier. To some extent it's a superset of AER, but we don't
> > generally expose AER to guests (or even the host), it's swallowed by
> > firmware into something else that provides a superset (well mostly) of
> > the AER information, and allow us to do those additional things like
> > isolating/de-isolating, reset control etc...
> > 
> > Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> > huge deal, I mention it for completeness.
> 
> We expect to do AER via the VFIO netlink interface, which even though
> its bashed below, would be quite extensible to supporting different
> kinds of errors.

As could platform specific ioctls :-)

> >    * Misc
> > 
> > There's lots of small bits and pieces... in no special order:
> > 
> >  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> > netlink and a bit of ioctl's ... it's not like there's something
> > fundamentally  better for netlink vs. ioctl... it really depends what
> > you are doing, and in this case I fail to see what netlink brings you
> > other than bloat and more stupid userspace library deps.
> 
> The netlink interface is primarily for host->guest signaling.  I've only
> implemented the remove command (since we're lacking a pcie-host in qemu
> to do AER), but it seems to work quite well.  If you have suggestions
> for how else we might do it, please let me know.  This seems to be the
> sort of thing netlink is supposed to be used for.

I don't understand what the advantage of netlink is compared to just
extending your existing VFIO ioctl interface, possibly using children
fd's as we do for example with spufs but it's not a huge deal. It just
that netlink has its own gotchas and I don't like multi-headed
interfaces.

> >  - I don't like too much the fact that VFIO provides yet another
> > different API to do what we already have at least 2 kernel APIs for, ie,
> > BAR mapping and config space access. At least it should be better at
> > using the backend infrastructure of the 2 others (sysfs & procfs). I
> > understand it wants to filter in some case (config space) and -maybe-
> > yet another API is the right way to go but allow me to have my doubts.
> 
> The use of PCI sysfs is actually one of my complaints about current
> device assignment.  To do assignment with an unprivileged guest we need
> to open the PCI sysfs config file for it, then change ownership on a
> handful of other PCI sysfs files, then there's this other pci-stub thing
> to maintain ownership, but the kvm ioctls don't actually require it and
> can grab onto any free device...  We are duplicating some of that in
> VFIO, but we also put the ownership of the device behind a single device
> file.  We do have the uiommu problem that we can't give an unprivileged
> user ownership of that, but your usage model may actually make that
> easier.  More below...
> 
> > One thing I thought about but you don't seem to like it ... was to use
> > the need to represent the partitionable entity as groups in sysfs that I
> > talked about earlier. Those could have per-device subdirs with the usual
> > config & resource files, same semantic as the ones in the real device,
> > but when accessed via the group they get filtering. I might or might not
> > be practical in the end, tbd, but it would allow apps using a slightly
> > modified libpci for example to exploit some of this.
> 
> I may be tainted by our disagreement that all the devices in a group
> need to be exposed to the guest and qemu could just take a pointer to a
> sysfs directory.  That seems very unlike qemu and pushes more of the
> policy into qemu, which seems like the wrong direction.

I don't see how it pushes "policy" into qemu.

The "policy" here is imposed by the HW setup and exposed by the
kernel :-) Giving qemu a group means qemu takes "owership" of that bunch
of devices, so far I don't see what's policy about that. From there, it
would be "handy" for people to just stop there and just see all the
devices of the group show up in the guest, but by all means feel free to
suggest a command line interface that allows to more precisely specify
which of the devices in the group to pass through and at what address.

> >  - The qemu vfio code hooks directly into ioapic ... of course that
> > won't fly with anything !x86
> 
> I spent a lot of time looking for an architecture neutral solution here,
> but I don't think it exists.  Please prove me wrong.

No it doesn't I agree, that's why it should be some kind of notifier or
function pointer setup by the platform specific code.

>   The problem is
> that we have to disable INTx on an assigned device after it fires (VFIO
> does this automatically).  If we don't do this, a non-responsive or
> malicious guest could sit on the interrupt, causing it to fire
> repeatedly as a DoS on the host.  The only indication that we can rely
> on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
> We can't just wait for device accesses because a) the device CSRs are
> (hopefully) direct mapped and we'd have to slow map them or attempt to
> do some kind of dirty logging to detect when they're accesses b) what
> constitutes an interrupt service is device specific.
> 
> That means we need to figure out how PCI interrupt 'A' (or B...)
> translates to a GSI (Global System Interrupt - ACPI definition, but
> hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
> which will also see the APIC EOI.  And just to spice things up, the
> guest can change the PCI to GSI mappings via ACPI.  I think the set of
> callbacks I've added are generic (maybe I left ioapic in the name), but
> yes they do need to be implemented for other architectures.  Patches
> appreciated from those with knowledge of the systems and/or access to
> device specs.  This is the only reason that I make QEMU VFIO only build
> for x86.

Right, and we need to cook a similiar sauce for POWER, it's an area that
has to be arch specific (and in fact specific to the specific HW machine
being emulated), so we just need to find out what's the cleanest way for
the plaform to "register" the right callbacks here.

Not a big deal, I just felt like mentioning it :-)

> >  - The various "objects" dealt with here, -especially- interrupts and
> > iommu, need a better in-kernel API so that fast in-kernel emulation can
> > take over from qemu based emulation. The way we need to do some of this
> > on POWER differs from x86. We can elaborate later, it's not necessarily
> > a killer either but essentially we'll take the bulk of interrupt
> > handling away from VFIO to the point where it won't see any of it at
> > all.
> 
> The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
> bypass QEMU.  This is exactly what VHOST does today and fairly trivial
> to enable for MSI once we get it merged.  INTx would require us to be
> able to define a level triggered irqfd in KVM and it's not yet clear if
> we care that much about INTx performance.

I care enough because our exit cost to qemu is much higher than x86, and
I can pretty easily emulate my PIC entirely in real mode (from within
the guest context) which is what I intend to do :-)

On the other hand, I have no reason to treat MSI or LSI differently, so
all I really need to is get back to the underlying platform HW interrupt
number and I think I can do that. So as long as I have a hook to know
what's there and what has been enabled, thse interrupts will simply
cease to be visible to either qemu or vfio.

Another reason why I don't like allowing shared interrupts in differrent
guests with DisINTx :-) Because that means that such interrupts would
have to go back all the way to qemu/vfio :-) But I can always have a
fallback there, it's really the problem of "trusting" DisINTx that
concerns me.

> We don't currently have a plan for accelerating IOMMU access since our
> current usage model doesn't need one.  We also need to consider MSI-X
> table acceleration for x86.  I hope we'll be able to use the new KVM
> ioctls for this.

Ok, we can give direct access to the MSI-X table to the guest on power
so that isn't an issue for us.

> Thanks for the write up, I think it will be good to let everyone digest
> it before we discuss this at KVM forum.

Agreed. As I think I may have mentioned already, I won't be able to make
it to the forum, but Paulus will and I'll be in a closeby timezone, so I
might be able to join a call if it's deemed useful.

> Rather than your "groups" idea, I've been mulling over whether we can
> just expose the dependencies, configuration, and capabilities in sysfs
> and build qemu commandlines to describe it.  For instance, if we simply
> start with creating iommu nodes in sysfs, we could create links under
> each iommu directory to the devices behind them.  Some kind of
> capability file could define properties like whether it's page table
> based or fixed iova window or the granularity of mapping the devices
> behind it.  Once we have that, we could probably make uiommu attach to
> each of those nodes.

Well, s/iommu/groups and you are pretty close to my original idea :-)

I don't mind that much what the details are, but I like the idea of not
having to construct a 3-pages command line every time I want to
pass-through a device, most "simple" usage scenario don't care that
much.

> That means we know /dev/uiommu7 (random example) is our access to a
> specific iommu with a given set of devices behind it.

Linking those sysfs iommus or groups to a /dev/ entry is fine by me.

>   If that iommu is
> a PE (via those capability files), then a user space entity (trying hard
> not to call it libvirt) can unbind all those devices from the host,
> maybe bind the ones it wants to assign to a guest to vfio and bind the
> others to pci-stub for safe keeping.  If you trust a user with
> everything in a PE, bind all the devices to VFIO, chown all
> the /dev/vfioX entries for those devices, and the /dev/uiommuX device.
>
> We might then come up with qemu command lines to describe interesting
> configurations, such as:
> 
> -device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
> -device pci-bus,...,iommu=iommu0,id=pci.0 \
> -device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0
> 
> The userspace entity would obviously need to put things in the same PE
> in the right place, but it doesn't seem to take a lot of sysfs info to
> get that right.
> 
> Today we do DMA mapping via the VFIO device because the capabilities of
> the IOMMU domains change depending on which devices are connected (for
> VT-d, the least common denominator of the IOMMUs in play).  Forcing the
> DMA mappings through VFIO naturally forces the call order.  If we moved
> to something like above, we could switch the DMA mapping to the uiommu
> device, since the IOMMU would have fixed capabilities.

That makes sense.

> What gaps would something like this leave for your IOMMU granularity
> problems?  I'll need to think through how it works when we don't want to
> expose the iommu to the guest, maybe a model=none (default) that doesn't
> need to be connected to a pci bus and maps all guest memory.  Thanks,

Well, I would map those "iommus" to PEs, so what remains is the path to
put all the "other" bits and pieces such as inform qemu of the location
and size of the MMIO segment(s) (so we can map the whole thing and not
bother with individual BARs) etc... 

Cheers,
Ben.

^ permalink raw reply

* GPIO IRQ on P1022
From: Felix Radensky @ 2011-07-31 10:38 UTC (permalink / raw)
  To: linuxppc-dev@ozlabs.org, jacmet, Tabi Timur-B04825

Hi,

I'm running kernel 3.0 on a custom board based on Freescale P1022.
The interrupt line of on-board FPGA is connected to GPIO2_9. FPGA
IRQ is level, active low. The GPIOs are mapped like this:

GPIOs 160-191, /soc@ffe00000/gpio-controller@f200:

GPIOs 192-223, /soc@ffe00000/gpio-controller@f100:

GPIOs 224-255, /soc@ffe00000/gpio-controller@f000:

I've verified that pin mixing is done correctly, and the
FPGA IRQ line is indeed configured as GPIO.

I have the following code in my driver:

     #define FPGA_IRQ_GPIO 169

     err = gpio_request(FPGA_IRQ_GPIO, "FPGA IRQ");
     if (err) {
         printk(KERN_ERR "Failed to request FPGA IRQ GPIO, err=%d\n", 
err);
         goto out;
     }

     gpio_direction_input(FPGA_IRQ_GPIO);

     irq = gpio_to_irq(FPGA_IRQ_GPIO);
     if (irq < 0) {
         printk(KERN_ERR "Failed to map FPGA GPIO to IRQ\n");
         goto out;
     }

     err = request_irq(irq, gsat_interrupt,
               IRQF_TRIGGER_FALLING, DRVNAME, priv);

     Interrupt handler reads FPGA interrupt status register to clear 
interrupt
     and exits.

     What happens when I load my driver is single execution of 
interrupt handler
     followed by system freeze. Even if I call disable_irq() in 
interrupt handler the
     system still freezes.

     I've added some prints to mpc8xxx_gpio.c driver, here's what I get:

     mpc8xxx_gpio_to_irq: offset 9
     mpc8xxx_gpio_irq_map: virq 31
     irq: irq 9 on host /soc@ffe00000/gpio-controller@f200 mapped to 
virtual irq 31
     mpc8xxx_irq_set_type: virq 9 flow_type 2
     mpc8xxx_irq_unmask: irq 9
     mpc8xxx_gpio_irq_cascade: irq 47
     mpc8xxx_irq_mask: irq 9
     mpc8xxx_irq_ack: irq 9


What am I doing wrong ?

Thanks a lot.

Felix.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Tabi Timur-B04825 @ 2011-07-31 13:59 UTC (permalink / raw)
  To: Felix Radensky; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <4E35309E.4000202@embedded-sol.com>

Felix Radensky wrote:
>
>      What happens when I load my driver is single execution of interrupt
> handler
>      followed by system freeze. Even if I call disable_irq() in interrupt
> handler the
>      system still freezes.

I don't know anything about the GPIO layer, but I think you're going to=20
need to debug this a little more.  Where exactly is the freeze?  Are you=20
sure the interrupt handler is being called only once?  Perhaps you're not=20
clearing the interrupt status and your handler is being called repeatedly?

--=20
Timur Tabi
Linux kernel developer at Freescale=

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: Avi Kivity @ 2011-07-31 14:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, David Gibson, Alex Williamson,
	Anthony Liguori, linuxppc-dev
In-Reply-To: <1311983933.8793.42.camel@pasglop>

On 07/30/2011 02:58 AM, Benjamin Herrenschmidt wrote:
> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.

How about a sysfs entry partition=<partition-id>? then libvirt knows not 
to assign devices from the same partition to different guests (and not 
to let the host play with them, either).

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.
>
> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.

I have a feeling you'll be getting the same capabilities sooner or 
later, or you won't be able to make use of S/R IOV VFs.  While we should 
support the older hardware, the interfaces should be designed with the 
newer hardware in mind.

> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.

Such magic is nice for a developer playing with qemu but in general less 
useful for a managed system where the various cards need to be exposed 
to the user interface anyway.

> * IOMMU
>
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
>
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
>
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.

A single level iommu cannot be exposed to guests.  Well, it can be 
exposed as an iommu that does not provide per-device mapping.

A two level iommu can be emulated and exposed to the guest.  See 
http://support.amd.com/us/Processor_TechDocs/48882.pdf for an example.

> This means:
>
>    - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
>
>    - It requires the guest to be pinned. Pass-through ->  no more swap

Newer iommus (and devices, unfortunately) (will) support I/O page faults 
and then the requirement can be removed.

>    - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb&  bounce buffering.

Is this a problem in practice?

>    - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.

Then you need to provide that same interface, and implement it using the 
real iommu.

> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...

Does the guest iomap each request?  Why?

Emulating the iommu in the kernel is of course the way to go if that's 
the case, still won't performance suck even then?

> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
>
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors&  addresses, the guest
> will call hyercalls to configure things anyways.

So, you have interrupt redirection?  That is, MSI-x table values encode 
the vcpu, not pcpu?

Alex, with interrupt redirection, we can skip this as well?  Perhaps 
only if the guest enables interrupt redirection?

If so, it's not arch specific, it's interrupt redirection specific.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Does the BAR value contain the segment base address?  Or is that added 
later?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH v2 2/4] powerpc, mpc52xx: add a4m072 board support
From: Grant Likely @ 2011-07-31  4:08 UTC (permalink / raw)
  To: Heiko Schocher; +Cc: devicetree-discuss, linuxppc-dev, Wolfgang Denk
In-Reply-To: <1308739150-31527-1-git-send-email-hs@denx.de>

On Wed, Jun 22, 2011 at 12:39:10PM +0200, Heiko Schocher wrote:
> Signed-off-by: Heiko Schocher <hs@denx.de>
> cc: Grant Likely <grant.likely@secretlab.ca>
> cc: devicetree-discuss@ozlabs.org
> cc: Wolfgang Denk <wd@denx.de>
> cc: Wolfram Sang <w.sang@pengutronix.de>
> ---
> For this patchseries following patch is needed:
> 
> http://patchwork.ozlabs.org/patch/91919/
> 
> Grant? Do you have some comments on that patch?
> 
> changes for v2:
>   add comment from Wolfram Sang:
>   use mpc5200.dtsi
> 
>  arch/powerpc/boot/dts/a4m072.dts             |  172 ++++++++++++++++++++++++++
>  arch/powerpc/platforms/52xx/mpc5200_simple.c |    1 +
>  2 files changed, 173 insertions(+), 0 deletions(-)
>  create mode 100644 arch/powerpc/boot/dts/a4m072.dts
> 
> diff --git a/arch/powerpc/boot/dts/a4m072.dts b/arch/powerpc/boot/dts/a4m072.dts
> new file mode 100644
> index 0000000..adb6746
> --- /dev/null
> +++ b/arch/powerpc/boot/dts/a4m072.dts
> @@ -0,0 +1,172 @@
> +/*
> + * a4m072 board Device Tree Source
> + *
> + * Copyright (C) 2011 DENX Software Engineering GmbH
> + * Heiko Schocher <hs@denx.de>
> + *
> + * Copyright (C) 2007 Semihalf
> + * Marian Balakowicz <m8@semihalf.com>
> + *
> + * This program is free software; you can redistribute  it and/or modify it
> + * under  the terms of  the GNU General  Public License as published by the
> + * Free Software Foundation;  either version 2 of the  License, or (at your
> + * option) any later version.
> + */
> +
> +/include/ "mpc5200b.dtsi"

Ah, I missed this follow up patch.  Yes, this is better.

> +
> +/ {
> +	model = "anonymous,a4m072";
> +	compatible = "anonymous,a4m072";
> +
> +	soc5200@f0000000 {
> +		#address-cells = <1>;
> +		#size-cells = <1>;
> +		compatible = "fsl,mpc5200b-immr";
> +		ranges = <0 0xf0000000 0x0000c000>;
> +		reg = <0xf0000000 0x00000100>;
> +		bus-frequency = <0>; /* From boot loader */
> +		system-frequency = <0>; /* From boot loader */
> +
> +		cdm@200 {
> +			fsl,ext_48mhz_en = <0x0>;
> +			fsl,fd_enable = <0x01>;
> +			fsl,fd_counters = <0xbbbb>;

Are these new properties documented?  They need to be.  Also,
convention is to use '-' instead of '_' in property names.

> +		};
> +
> +		timer@600 {
> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
> +			reg = <0x600 0x80>;
> +			interrupts = <1 9 0>;
> +			fsl,has-wdt;
> +		};

Isn't this node already in the mpc5200b.dtsi file?

Otherwise, this patch looks pretty good.

g.

^ permalink raw reply

* Re: [PATCH 2/4] powerpc, mpc52xx: add a4m072 board support
From: Grant Likely @ 2011-07-31  4:05 UTC (permalink / raw)
  To: Heiko Schocher; +Cc: devicetree-discuss, linuxppc-dev, Wolfgang Denk
In-Reply-To: <1308729311-15375-3-git-send-email-hs@denx.de>

On Wed, Jun 22, 2011 at 09:55:09AM +0200, Heiko Schocher wrote:
> Signed-off-by: Heiko Schocher <hs@denx.de>
> cc: Grant Likely <grant.likely@secretlab.ca>
> cc: devicetree-discuss@ozlabs.org
> cc: Wolfgang Denk <wd@denx.de>
> ---
> For this patchseries following patch is needed:
> 
> http://patchwork.ozlabs.org/patch/91919/
> 
> Grant? Do you have some comments on that patch?
> 
>  arch/powerpc/boot/dts/a4m072.dts             |  273 ++++++++++++++++++++++++++
>  arch/powerpc/platforms/52xx/mpc5200_simple.c |    1 +
>  2 files changed, 274 insertions(+), 0 deletions(-)
>  create mode 100644 arch/powerpc/boot/dts/a4m072.dts
> 
> diff --git a/arch/powerpc/boot/dts/a4m072.dts b/arch/powerpc/boot/dts/a4m072.dts
> new file mode 100644
> index 0000000..cea1c6f
> --- /dev/null
> +++ b/arch/powerpc/boot/dts/a4m072.dts
> @@ -0,0 +1,273 @@
> +/*
> + * a4m072 board Device Tree Source
> + *
> + * Copyright (C) 2011 DENX Software Engineering GmbH
> + * Heiko Schocher <hs@denx.de>
> + *
> + * Copyright (C) 2007 Semihalf
> + * Marian Balakowicz <m8@semihalf.com>
> + *
> + * This program is free software; you can redistribute  it and/or modify it
> + * under  the terms of  the GNU General  Public License as published by the
> + * Free Software Foundation;  either version 2 of the  License, or (at your
> + * option) any later version.
> + */
> +
> +/dts-v1/;
> +
> +/ {
> +	model = "anonymous,a4m072";
> +	compatible = "anonymous,a4m072";

anonymous?  This bears some description.

Also, 5200b boards can use the mpc5200b.dtsi include file.  This one
should too.

> +	#address-cells = <1>;
> +	#size-cells = <1>;
> +	interrupt-parent = <&mpc5200_pic>;
> +
> +	cpus {
> +		#address-cells = <1>;
> +		#size-cells = <0>;
> +
> +		PowerPC,5200@0 {
> +			device_type = "cpu";
> +			reg = <0>;
> +			d-cache-line-size = <32>;
> +			i-cache-line-size = <32>;
> +			d-cache-size = <0x4000>;	// L1, 16K
> +			i-cache-size = <0x4000>;	// L1, 16K
> +			timebase-frequency = <0>; /* From boot loader */
> +			bus-frequency = <0>; /* From boot loader */
> +			clock-frequency = <0>; /* From boot loader */
> +		};
> +	};
> +
> +	memory {
> +		device_type = "memory";
> +		reg = <0x00000000 0x04000000>;
> +	};
> +
> +	soc5200@f0000000 {
> +		#address-cells = <1>;
> +		#size-cells = <1>;
> +		compatible = "fsl,mpc5200b-immr";
> +		ranges = <0 0xf0000000 0x0000c000>;
> +		reg = <0xf0000000 0x00000100>;
> +		bus-frequency = <0>; /* From boot loader */
> +		system-frequency = <0>; /* From boot loader */
> +
> +		cdm@200 {
> +			compatible = "fsl,mpc5200b-cdm","fsl,mpc5200-cdm";
> +			reg = <0x200 0x38>;
> +			fsl,ext_48mhz_en = <0x0>;
> +			fsl,fd_enable = <0x01>;
> +			fsl,fd_counters = <0xbbbb>;
> +		};
> +
> +		mpc5200_pic: interrupt-controller@500 {
> +			// 5200 interrupts are encoded into two levels;
> +			interrupt-controller;
> +			#interrupt-cells = <3>;
> +			compatible = "fsl,mpc5200b-pic","fsl,mpc5200-pic";
> +			reg = <0x500 0x80>;
> +		};
> +
> +		timer@600 {
> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
> +			reg = <0x600 0x80>;
> +			interrupts = <1 9 0>;
> +			fsl,has-wdt;
> +		};
> +
> +		gpt3: timer@630 { /* General Purpose Timer in GPIO mode */
> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
> +			reg = <0x630 0x10>;
> +			interrupts = <1 12 0>;
> +			gpio-controller;
> +			#gpio-cells = <2>;
> +		};
> +
> +		gpt4: timer@640 { /* General Purpose Timer in GPIO mode */
> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
> +			reg = <0x640 0x10>;
> +			interrupts = <1 13 0>;
> +			gpio-controller;
> +			#gpio-cells = <2>;
> +		};
> +
> +		gpt5: timer@650 { /* General Purpose Timer in GPIO mode */
> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
> +			reg = <0x650 0x10>;
> +			interrupts = <1 14 0>;
> +			gpio-controller;
> +			#gpio-cells = <2>;
> +		};
> +
> +		can@900 {
> +			compatible = "fsl,mpc5200b-mscan","fsl,mpc5200-mscan";
> +			interrupts = <2 17 0>;
> +			reg = <0x900 0x80>;
> +		};
> +
> +		can@980 {
> +			compatible = "fsl,mpc5200b-mscan","fsl,mpc5200-mscan";
> +			interrupts = <2 18 0>;
> +			reg = <0x980 0x80>;
> +		};
> +
> +		gpio_simple: gpio@b00 {
> +			compatible = "fsl,mpc5200b-gpio","fsl,mpc5200-gpio";
> +			reg = <0xb00 0x40>;
> +			interrupts = <1 7 0>;
> +			gpio-controller;
> +			#gpio-cells = <2>;
> +			fsl,port_config = <0x19051444>;
> +		};
> +
> +		gpio_wkup: gpio@c00 {
> +			compatible = "fsl,mpc5200b-gpio-wkup","fsl,mpc5200-gpio-wkup";
> +			reg = <0xc00 0x40>;
> +			interrupts = <1 8 0 0 3 0>;
> +			gpio-controller;
> +			#gpio-cells = <2>;
> +		};
> +
> +		usb@1000 {
> +			compatible = "fsl,mpc5200b-ohci","fsl,mpc5200-ohci","ohci-be";
> +			reg = <0x1000 0xff>;
> +			interrupts = <2 6 0>;
> +		};
> +
> +		dma-controller@1200 {
> +			compatible = "fsl,mpc5200b-bestcomm","fsl,mpc5200-bestcomm";
> +			reg = <0x1200 0x80>;
> +			interrupts = <3 0 0  3 1 0  3 2 0  3 3 0
> +			              3 4 0  3 5 0  3 6 0  3 7 0
> +			              3 8 0  3 9 0  3 10 0  3 11 0
> +			              3 12 0  3 13 0  3 14 0  3 15 0>;
> +		};
> +
> +		xlb@1f00 {
> +			compatible = "fsl,mpc5200b-xlb","fsl,mpc5200-xlb";
> +			reg = <0x1f00 0x100>;
> +		};
> +
> +		psc@2000 {
> +			compatible = "fsl,mpc5200b-psc-uart","fsl,mpc5200-psc-uart";
> +			reg = <0x2000 0x100>;
> +			interrupts = <2 1 0>;
> +		};
> +
> +		psc@2200 {
> +			compatible = "fsl,mpc5200b-psc-uart","fsl,mpc5200-psc-uart";
> +			reg = <0x2200 0x100>;
> +			interrupts = <2 2 0>;
> +		};
> +
> +		psc@2400 {
> +			compatible = "fsl,mpc5200b-psc-uart","fsl,mpc5200-psc-uart";
> +			reg = <0x2400 0x100>;
> +			interrupts = <2 3 0>;
> +		};
> +
> +		psc@2c00 {
> +			compatible = "fsl,mpc5200b-psc-uart","fsl,mpc5200-psc-uart";
> +			reg = <0x2c00 0x100>;
> +			interrupts = <2 4 0>;
> +		};
> +
> +		ethernet@3000 {
> +			compatible = "fsl,mpc5200b-fec","fsl,mpc5200-fec";
> +			reg = <0x3000 0x400>;
> +			local-mac-address = [ 00 00 00 00 00 00 ];
> +			interrupts = <2 5 0>;
> +			phy-handle = <&phy0>;
> +		};
> +
> +		mdio@3000 {
> +			#address-cells = <1>;
> +			#size-cells = <0>;
> +			compatible = "fsl,mpc5200b-mdio","fsl,mpc5200-mdio";
> +			reg = <0x3000 0x400>;
> +			interrupts = <2 5 0>;
> +
> +			phy0: ethernet-phy@1f {
> +				reg = <0x1f>;
> +				interrupts = <1 2 0>; /* IRQ 2 active low */
> +			};
> +		};
> +
> +		ata@3a00 {
> +			compatible = "fsl,mpc5200b-ata","fsl,mpc5200-ata";
> +			reg = <0x3a00 0x100>;
> +			interrupts = <2 7 0>;
> +		};
> +
> +		i2c@3d40 {
> +			#address-cells = <1>;
> +			#size-cells = <0>;
> +			compatible = "fsl,mpc5200b-i2c","fsl,mpc5200-i2c","fsl-i2c";
> +			reg = <0x3d40 0x40>;
> +			interrupts = <2 16 0>;
> +
> +			 hwmon@2e {
> +				compatible = "nsc,lm87";
> +				reg = <0x2e>;
> +			};
> +			 rtc@51 {
> +				compatible = "nxp,rtc8564";
> +				reg = <0x51>;
> +			};
> +		};
> +
> +		sram@8000 {
> +			compatible = "fsl,mpc5200b-sram","fsl,mpc5200-sram";
> +			reg = <0x8000 0x4000>;
> +		};
> +	};
> +
> +	localbus {
> +		compatible = "fsl,mpc5200b-lpb","simple-bus";
> +		#address-cells = <2>;
> +		#size-cells = <1>;
> +		ranges = <0 0 0xfe000000 0x02000000
> +			  1 0 0x62000000 0x00400000
> +			  2 0 0x64000000 0x00200000
> +			  3 0 0x66000000 0x01000000
> +			  6 0 0x68000000 0x01000000
> +			  7 0 0x6a000000 0x00000004
> +			 >;
> +
> +		flash@0,0 {
> +			compatible = "cfi-flash";
> +			reg = <0 0 0x02000000>;
> +			bank-width = <2>;
> +			#size-cells = <1>;
> +			#address-cells = <1>;
> +		};
> +		sram0@1,0 {
> +			compatible = "mtd-ram";
> +			reg = <1 0x00000 0x00400000>;
> +			bank-width = <2>;
> +		};
> +	};
> +
> +	pci@f0000d00 {
> +		#interrupt-cells = <1>;
> +		#size-cells = <2>;
> +		#address-cells = <3>;
> +		device_type = "pci";
> +		compatible = "fsl,mpc5200-pci";
> +		reg = <0xf0000d00 0x100>;
> +		interrupt-map-mask = <0xf800 0 0 7>;
> +		interrupt-map = <
> +				 /* IDSEL 0x16 */
> +				 0xc000 0 0 1 &mpc5200_pic 1 3 3
> +				 0xc000 0 0 2 &mpc5200_pic 1 3 3
> +				 0xc000 0 0 3 &mpc5200_pic 1 3 3
> +				 0xc000 0 0 4 &mpc5200_pic 1 3 3>;
> +		clock-frequency = <0>; /* From boot loader */
> +		interrupts = <2 8 0 2 9 0 2 10 0>;
> +		bus-range = <0 0>;
> +		ranges = <0x42000000 0 0x80000000 0x80000000 0 0x10000000
> +			  0x02000000 0 0x90000000 0x90000000 0 0x10000000
> +			  0x01000000 0 0x00000000 0xa0000000 0 0x01000000>;
> +	};
> +};
> diff --git a/arch/powerpc/platforms/52xx/mpc5200_simple.c b/arch/powerpc/platforms/52xx/mpc5200_simple.c
> index e36d6e2..192b4ff 100644
> --- a/arch/powerpc/platforms/52xx/mpc5200_simple.c
> +++ b/arch/powerpc/platforms/52xx/mpc5200_simple.c
> @@ -50,6 +50,7 @@ static void __init mpc5200_simple_setup_arch(void)
>  
>  /* list of the supported boards */
>  static const char *board[] __initdata = {
> +	"anonymous,a4m072",
>  	"intercontrol,digsy-mtc",
>  	"manroland,mucmc52",
>  	"manroland,uc101",
> -- 
> 1.7.5.4
> 

^ permalink raw reply

* Re: [PATCH] powerpc/85xx: fix memory controller compatible for edac
From: Grant Likely @ 2011-07-31  4:03 UTC (permalink / raw)
  To: Shaohui Xie; +Cc: mm-commits, kumar.gala, avorontsov, akpm, linuxppc-dev, davem
In-Reply-To: <1311659193-694-1-git-send-email-Shaohui.Xie@freescale.com>

On Tue, Jul 26, 2011 at 01:46:33PM +0800, Shaohui Xie wrote:
> compatible in dts has been changed, so driver need to update accordingly.
> 
> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
> ---
> apply for http://git.kernel.org/pub/scm/linux/kernel/git/galak/powerpc.git
> 'next' branch.
> 
>  drivers/edac/mpc85xx_edac.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
> index 13f6cc5..94c064a 100644
> --- a/drivers/edac/mpc85xx_edac.c
> +++ b/drivers/edac/mpc85xx_edac.c
> @@ -1253,7 +1253,7 @@ static struct of_device_id mpc85xx_mc_err_of_match[] = {
>  	{ .compatible = "fsl,p1020-memory-controller", },
>  	{ .compatible = "fsl,p1021-memory-controller", },
>  	{ .compatible = "fsl,p2020-memory-controller", },
> -	{ .compatible = "fsl,p4080-memory-controller", },
> +	{ .compatible = "fsl,qoriq-memory-controller", },

Are there any implementations in the field that depend on the p4080 value?

g.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Felix Radensky @ 2011-07-31 14:56 UTC (permalink / raw)
  To: Tabi Timur-B04825; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <4E355FB7.3030904@freescale.com>

Hi Timur,

On 07/31/2011 04:59 PM, Tabi Timur-B04825 wrote:
> Felix Radensky wrote:
>>       What happens when I load my driver is single execution of interrupt
>> handler
>>       followed by system freeze. Even if I call disable_irq() in interrupt
>> handler the
>>       system still freezes.
> I don't know anything about the GPIO layer, but I think you're going to
> need to debug this a little more.  Where exactly is the freeze?  Are you
> sure the interrupt handler is being called only once?  Perhaps you're not
> clearing the interrupt status and your handler is being called repeatedly?
>

It was verified with oscilloscope that interrupt handler clears the
interrupt. The interrupt line goes from low to high and stays there.
I have prints in interrupt handler, they appear only once.

It's difficult to say where it freezes. I've tried magic sysrq on serial
console, but got nothing.


Felix.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Wolfgang Grandegger @ 2011-07-31 15:19 UTC (permalink / raw)
  To: Felix Radensky; +Cc: linuxppc-dev@ozlabs.org, Tabi Timur-B04825
In-Reply-To: <4E35309E.4000202@embedded-sol.com>

On 07/31/2011 12:38 PM, Felix Radensky wrote:
> Hi,
> 
> I'm running kernel 3.0 on a custom board based on Freescale P1022.
> The interrupt line of on-board FPGA is connected to GPIO2_9. FPGA
> IRQ is level, active low. The GPIOs are mapped like this:
> 
> GPIOs 160-191, /soc@ffe00000/gpio-controller@f200:
> 
> GPIOs 192-223, /soc@ffe00000/gpio-controller@f100:
> 
> GPIOs 224-255, /soc@ffe00000/gpio-controller@f000:
> 
> I've verified that pin mixing is done correctly, and the
> FPGA IRQ line is indeed configured as GPIO.
> 
> I have the following code in my driver:
> 
>     #define FPGA_IRQ_GPIO 169
> 
>     err = gpio_request(FPGA_IRQ_GPIO, "FPGA IRQ");
>     if (err) {
>         printk(KERN_ERR "Failed to request FPGA IRQ GPIO, err=%d\n", err);
>         goto out;
>     }
> 
>     gpio_direction_input(FPGA_IRQ_GPIO);
> 
>     irq = gpio_to_irq(FPGA_IRQ_GPIO);
>     if (irq < 0) {
>         printk(KERN_ERR "Failed to map FPGA GPIO to IRQ\n");
>         goto out;
>     }
> 
>     err = request_irq(irq, gsat_interrupt,
>               IRQF_TRIGGER_FALLING, DRVNAME, priv);
> 
>     Interrupt handler reads FPGA interrupt status register to clear
> interrupt
>     and exits.
> 
>     What happens when I load my driver is single execution of interrupt
> handler
>     followed by system freeze. Even if I call disable_irq() in interrupt
> handler the
>     system still freezes.

Try disable_irq_nosync() instead.

Wolfgang.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Felix Radensky @ 2011-07-31 15:51 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: linuxppc-dev@ozlabs.org, Tabi Timur-B04825
In-Reply-To: <4E357285.8080708@grandegger.com>

Hi Wolfgang,

On 07/31/2011 06:19 PM, Wolfgang Grandegger wrote:
> On 07/31/2011 12:38 PM, Felix Radensky wrote:
>> Hi,
>>
>> I'm running kernel 3.0 on a custom board based on Freescale P1022.
>> The interrupt line of on-board FPGA is connected to GPIO2_9. FPGA
>> IRQ is level, active low. The GPIOs are mapped like this:
>>
>> GPIOs 160-191, /soc@ffe00000/gpio-controller@f200:
>>
>> GPIOs 192-223, /soc@ffe00000/gpio-controller@f100:
>>
>> GPIOs 224-255, /soc@ffe00000/gpio-controller@f000:
>>
>> I've verified that pin mixing is done correctly, and the
>> FPGA IRQ line is indeed configured as GPIO.
>>
>> I have the following code in my driver:
>>
>>      #define FPGA_IRQ_GPIO 169
>>
>>      err = gpio_request(FPGA_IRQ_GPIO, "FPGA IRQ");
>>      if (err) {
>>          printk(KERN_ERR "Failed to request FPGA IRQ GPIO, err=%d\n", err);
>>          goto out;
>>      }
>>
>>      gpio_direction_input(FPGA_IRQ_GPIO);
>>
>>      irq = gpio_to_irq(FPGA_IRQ_GPIO);
>>      if (irq<  0) {
>>          printk(KERN_ERR "Failed to map FPGA GPIO to IRQ\n");
>>          goto out;
>>      }
>>
>>      err = request_irq(irq, gsat_interrupt,
>>                IRQF_TRIGGER_FALLING, DRVNAME, priv);
>>
>>      Interrupt handler reads FPGA interrupt status register to clear
>> interrupt
>>      and exits.
>>
>>      What happens when I load my driver is single execution of interrupt
>> handler
>>      followed by system freeze. Even if I call disable_irq() in interrupt
>> handler the
>>      system still freezes.
> Try disable_irq_nosync() instead.
>
>

Thanks.  However this doesn't help either.

Felix.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Wolfgang Grandegger @ 2011-07-31 17:49 UTC (permalink / raw)
  To: Felix Radensky; +Cc: linuxppc-dev@ozlabs.org, Tabi Timur-B04825
In-Reply-To: <4E357A1A.1080606@embedded-sol.com>

Hi Felix,

On 07/31/2011 05:51 PM, Felix Radensky wrote:
> Hi Wolfgang,
> 
> On 07/31/2011 06:19 PM, Wolfgang Grandegger wrote:
>> On 07/31/2011 12:38 PM, Felix Radensky wrote:
>>> Hi,
>>>
>>> I'm running kernel 3.0 on a custom board based on Freescale P1022.
>>> The interrupt line of on-board FPGA is connected to GPIO2_9. FPGA
>>> IRQ is level, active low. The GPIOs are mapped like this:

Here you say that it's a level sensitive interrupt but ...

>>> GPIOs 160-191, /soc@ffe00000/gpio-controller@f200:
>>>
>>> GPIOs 192-223, /soc@ffe00000/gpio-controller@f100:
>>>
>>> GPIOs 224-255, /soc@ffe00000/gpio-controller@f000:
>>>
>>> I've verified that pin mixing is done correctly, and the
>>> FPGA IRQ line is indeed configured as GPIO.
>>>
>>> I have the following code in my driver:
>>>
>>>      #define FPGA_IRQ_GPIO 169
>>>
>>>      err = gpio_request(FPGA_IRQ_GPIO, "FPGA IRQ");
>>>      if (err) {
>>>          printk(KERN_ERR "Failed to request FPGA IRQ GPIO, err=%d\n",
>>> err);
>>>          goto out;
>>>      }
>>>
>>>      gpio_direction_input(FPGA_IRQ_GPIO);
>>>
>>>      irq = gpio_to_irq(FPGA_IRQ_GPIO);
>>>      if (irq<  0) {
>>>          printk(KERN_ERR "Failed to map FPGA GPIO to IRQ\n");
>>>          goto out;
>>>      }
>>>
>>>      err = request_irq(irq, gsat_interrupt,
>>>                IRQF_TRIGGER_FALLING, DRVNAME, priv);

.. you request here an edge triggered interrupt.

Wolfgang.

^ permalink raw reply

* Re: GPIO IRQ on P1022
From: Felix Radensky @ 2011-07-31 19:28 UTC (permalink / raw)
  To: Wolfgang Grandegger; +Cc: linuxppc-dev@ozlabs.org, Tabi Timur-B04825
In-Reply-To: <4E3595B6.4010406@grandegger.com>

Hi Wolfgang,

On 07/31/2011 08:49 PM, Wolfgang Grandegger wrote:
> Hi Felix,
>
> On 07/31/2011 05:51 PM, Felix Radensky wrote:
>> Hi Wolfgang,
>>
>> On 07/31/2011 06:19 PM, Wolfgang Grandegger wrote:
>>> On 07/31/2011 12:38 PM, Felix Radensky wrote:
>>>> Hi,
>>>>
>>>> I'm running kernel 3.0 on a custom board based on Freescale P1022.
>>>> The interrupt line of on-board FPGA is connected to GPIO2_9. FPGA
>>>> IRQ is level, active low. The GPIOs are mapped like this:
> Here you say that it's a level sensitive interrupt but ...
>
>>>> GPIOs 160-191, /soc@ffe00000/gpio-controller@f200:
>>>>
>>>> GPIOs 192-223, /soc@ffe00000/gpio-controller@f100:
>>>>
>>>> GPIOs 224-255, /soc@ffe00000/gpio-controller@f000:
>>>>
>>>> I've verified that pin mixing is done correctly, and the
>>>> FPGA IRQ line is indeed configured as GPIO.
>>>>
>>>> I have the following code in my driver:
>>>>
>>>>       #define FPGA_IRQ_GPIO 169
>>>>
>>>>       err = gpio_request(FPGA_IRQ_GPIO, "FPGA IRQ");
>>>>       if (err) {
>>>>           printk(KERN_ERR "Failed to request FPGA IRQ GPIO, err=%d\n",
>>>> err);
>>>>           goto out;
>>>>       }
>>>>
>>>>       gpio_direction_input(FPGA_IRQ_GPIO);
>>>>
>>>>       irq = gpio_to_irq(FPGA_IRQ_GPIO);
>>>>       if (irq<   0) {
>>>>           printk(KERN_ERR "Failed to map FPGA GPIO to IRQ\n");
>>>>           goto out;
>>>>       }
>>>>
>>>>       err = request_irq(irq, gsat_interrupt,
>>>>                 IRQF_TRIGGER_FALLING, DRVNAME, priv);
> .. you request here an edge triggered interrupt.

Yes, that is is correct. mpc8xxx_gpio.c driver does not allow
level sensitive interrupts, so I had no choice but to specify
IRQF_TRIGGER_FALLING.

Felix.

^ permalink raw reply

* RE: [PATCH] powerpc/85xx: fix memory controller compatible for edac
From: Xie Shaohui-B21989 @ 2011-08-01  2:44 UTC (permalink / raw)
  To: Grant Likely
  Cc: mm-commits@vger.kernel.org, akpm@linux-foundation.org,
	avorontsov@mvista.com, Gala Kumar-B11780,
	linuxppc-dev@lists.ozlabs.org, davem@davemloft.net
In-Reply-To: <20110731040328.GK24334@ponder.secretlab.ca>

>From: Grant Likely [mailto:glikely@secretlab.ca] On Behalf Of Grant Likely
>Sent: Sunday, July 31, 2011 12:03 PM
>To: Xie Shaohui-B21989
>Cc: linuxppc-dev@lists.ozlabs.org; Gala Kumar-B11780; mm-
>commits@vger.kernel.org; avorontsov@mvista.com; davem@davemloft.net;
>akpm@linux-foundation.org
>Subject: Re: [PATCH] powerpc/85xx: fix memory controller compatible for
>edac
>
>On Tue, Jul 26, 2011 at 01:46:33PM +0800, Shaohui Xie wrote:
>> compatible in dts has been changed, so driver need to update accordingly=
.
>>
>> Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
>> ---
>> apply for
>> http://git.kernel.org/pub/scm/linux/kernel/git/galak/powerpc.git
>> 'next' branch.
>>
>>  drivers/edac/mpc85xx_edac.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/drivers/edac/mpc85xx_edac.c b/drivers/edac/mpc85xx_edac.c
>> index 13f6cc5..94c064a 100644
>> --- a/drivers/edac/mpc85xx_edac.c
>> +++ b/drivers/edac/mpc85xx_edac.c
>> @@ -1253,7 +1253,7 @@ static struct of_device_id
>mpc85xx_mc_err_of_match[] =3D {
>>  	{ .compatible =3D "fsl,p1020-memory-controller", },
>>  	{ .compatible =3D "fsl,p1021-memory-controller", },
>>  	{ .compatible =3D "fsl,p2020-memory-controller", },
>> -	{ .compatible =3D "fsl,p4080-memory-controller", },
>> +	{ .compatible =3D "fsl,qoriq-memory-controller", },
>
>Are there any implementations in the field that depend on the p4080 value?
>

[Xie Shaohui] The 'p4080' is introduced by commit cd1542c8197

edac: mpc85xx: add support for new MPCxxx/Pxxxx EDAC controllers

Simply add proper IDs into the device table.


My patch intended for fix the compliance of dts and driver.



Best Regards,=20
Shaohui Xie

^ permalink raw reply

* Re: kvm PCI assignment & VFIO ramblings
From: David Gibson @ 2011-08-01  2:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, Alex Williamson, Anthony Liguori,
	linuxppc-dev
In-Reply-To: <1311983933.8793.42.camel@pasglop>

On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote:
[snip]
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?

Not quite.  We already require the not-yet-upstream patches which add
guest-side (emulated) IOMMU support to qemu.  The approach we're using
for the passthrough (or at least will when I fix up my patches again)
is that we only map all guest ram into the vfio iommu if and only if
there is no guest visible iommu advertised in the qdev.

This kind of makes sense - if there is no iommu from the guest
perspective, the guest will expect to see all its physical memory 1:1
in DMA.

The hacky bit is that when there *is* a guest visible iommu, it's
assumed that whatever interface the guest iommu uses is somehow wired
up to vfio map/unmap calls.  For us at the moment, this means
passthrough devices for us must be assigned to a special (guest) pci
domain which sets up a suitable wires up the paravirt iommu to the vfio iommu.

In theory under some circumstances, with full emu, you could wire up
an emulated guest iommu interface to a different host iommu
implementation via this mechanism.  However that wouldn't work if the
guest and host iommus capabilities are too different, and in any case
would require considerable extra abstraction work on the qemu guest
iommu code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Baruch Siach @ 2011-08-01  4:59 UTC (permalink / raw)
  To: Scott Wood
  Cc: linuxppc-dev@lists.ozlabs.org, Gala Kumar-B11780,
	Tabi Timur-B04825
In-Reply-To: <20110728152033.7f5b4c10@schlenkerla.am.freescale.net>

Hi Scott,

On Thu, Jul 28, 2011 at 03:20:33PM -0500, Scott Wood wrote:
> On Thu, 28 Jul 2011 19:56:53 +0000
> Tabi Timur-B04825 <B04825@freescale.com> wrote:
> 
> > On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
> > > CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
> > > breaking clear_pages(), probably others too.
> > >
> > > Cc: Kumar Gala <galak@kernel.crashing.org>
> > > Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> > > ---
> > > Is this the right approach?
> > 
> > It doesn't work for me.
> > 
> > I need something that if an e500v2 platform (e.g. the P1022DS) is
> > selected, then I won't be able to select any e500mc platforms (e.g.
> > P4080DS).  And if I don't select any e500v2 platforms, then I will be
> > able to select an e500mc platform.  This patch doesn't seem to do
> > that.
> > 
> > It might be necessary to split the entire menu into two parts, one for
> > e500v2 parts and one for e500mc parts.
> 
> How about making the "Processor Type" entry be either E500 or E500MC, both
> of which select PPC_85xx?

Thanks for the tip. A patch along these lines follows.

baruch

-- 
                                                     ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Baruch Siach @ 2011-08-01  5:02 UTC (permalink / raw)
  To: Timur Tabi; +Cc: Kumar Gala, linuxppc-dev
In-Reply-To: <4E31C04D.7040700@freescale.com>

Hi Timur,

On Thu, Jul 28, 2011 at 03:02:21PM -0500, Timur Tabi wrote:
>  wrote:
> > On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
> >> CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
> >> breaking clear_pages(), probably others too.
> >>
> >> Cc: Kumar Gala <galak@kernel.crashing.org>
> >> Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> >> ---
> >> Is this the right approach?
> > 
> > It doesn't work for me.
> 
> I also get this error if I try to build corenet32_smp_defconfig:
> 
> arch/powerpc/platforms/Kconfig.cputype:136:error: recursive dependency detected!
> arch/powerpc/platforms/Kconfig.cputype:136:	symbol PPC_E500MC is selected by
> P2040_RDB
> arch/powerpc/platforms/85xx/Kconfig:176:	symbol P2040_RDB depends on PPC_E500MC

Thanks for reporting. Where can I get this corenet32_smp_defconfig for 
testing?

baruch

-- 
                                                     ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* [PATCH] powerpc: 85xx: separate e500 from e500mc
From: Baruch Siach @ 2011-08-01  5:12 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Baruch Siach, Timur Tabi
In-Reply-To: <20110801045938.GA5716@sapphire.tkos.co.il>

CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
breaking clear_pages(), probably others too.

This patch adds a new "Processor Type" entry for e500mc, and makes e500 systems
depend on PPC_E500.

Cc: Kumar Gala <galak@kernel.crashing.org>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
---

Sending again with the correct list address. Sorry for the noise.

 arch/powerpc/platforms/85xx/Kconfig    |   12 +++++++++---
 arch/powerpc/platforms/Kconfig.cputype |   27 +++++++++++++++------------
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/85xx/Kconfig b/arch/powerpc/platforms/85xx/Kconfig
index b6976e1..9530fca 100644
--- a/arch/powerpc/platforms/85xx/Kconfig
+++ b/arch/powerpc/platforms/85xx/Kconfig
@@ -13,6 +13,8 @@ if FSL_SOC_BOOKE
 
 if PPC32
 
+if PPC_E500
+
 config MPC8540_ADS
 	bool "Freescale MPC8540 ADS"
 	select DEFAULT_UIMAGE
@@ -155,10 +157,13 @@ config SBC8560
 	help
 	  This option enables support for the Wind River SBC8560 board
 
+endif # PPC_E500
+
+if PPC_E500MC
+
 config P3041_DS
 	bool "Freescale P3041 DS"
 	select DEFAULT_UIMAGE
-	select PPC_E500MC
 	select PHYS_64BIT
 	select SWIOTLB
 	select MPC8xxx_GPIO
@@ -169,7 +174,6 @@ config P3041_DS
 config P4080_DS
 	bool "Freescale P4080 DS"
 	select DEFAULT_UIMAGE
-	select PPC_E500MC
 	select PHYS_64BIT
 	select SWIOTLB
 	select MPC8xxx_GPIO
@@ -177,13 +181,15 @@ config P4080_DS
 	help
 	  This option enables support for the P4080 DS board
 
+endif # PPC_E500MC
+
 endif # PPC32
 
 config P5020_DS
 	bool "Freescale P5020 DS"
+	depends on PPC_E500MC
 	select DEFAULT_UIMAGE
 	select E500
-	select PPC_E500MC
 	select PHYS_64BIT
 	select SWIOTLB
 	select MPC8xxx_GPIO
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 2165b65..71e3cfb 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -11,13 +11,13 @@ choice
 	prompt "Processor Type"
 	depends on PPC32
 	help
-	  There are five families of 32 bit PowerPC chips supported.
+	  There are six families of 32 bit PowerPC chips supported.
 	  The most common ones are the desktop and server CPUs (601, 603,
 	  604, 740, 750, 74xx) CPUs from Freescale and IBM, with their
 	  embedded 512x/52xx/82xx/83xx/86xx counterparts.
-	  The other embeeded parts, namely 4xx, 8xx, e200 (55xx) and e500
-	  (85xx) each form a family of their own that is not compatible
-	  with the others.
+	  The other embeeded parts, namely 4xx, 8xx, e200 (55xx), e500
+	  (85xx), and e500mc each form a family of their own that is not
+	  compatible with the others.
 
 	  If unsure, select 52xx/6xx/7xx/74xx/82xx/83xx/86xx.
 
@@ -25,10 +25,15 @@ config PPC_BOOK3S_32
 	bool "512x/52xx/6xx/7xx/74xx/82xx/83xx/86xx"
 	select PPC_FPU
 
-config PPC_85xx
-	bool "Freescale 85xx"
+config PPC_E500
+	bool "Freescale e500v1/e500v2 (85xx, P10xx, P20xx)"
+	select PPC_85xx
 	select E500
 
+config PPC_E500MC
+	bool "Freescale e500mc/e5500 (P30xx, P40xx, P50xx)"
+	select PPC_85xx
+
 config PPC_8xx
 	bool "Freescale 8xx"
 	select FSL_SOC
@@ -128,15 +133,13 @@ config TUNE_CELL
 config 8xx
 	bool
 
-config E500
+config PPC_85xx
+	bool
 	select FSL_EMB_PERFMON
 	select PPC_FSL_BOOK3E
-	bool
 
-config PPC_E500MC
-	bool "e500mc Support"
-	select PPC_FPU
-	depends on E500
+config E500
+	bool
 
 config PPC_FPU
 	bool
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH] powerpc: Move kdump default base address to half RMO size on 64bit
From: Anton Blanchard @ 2011-08-01  5:27 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev


We are seeing boot failures on some very large boxes even with
commit b5416ca9f824 (powerpc: Move kdump default base address to
64MB on 64bit).

This patch halves the RMO so both kernels get about the same
amount of RMO memory. On large machines this region will be
at least 256MB, so each kernel will get 128MB.

We cap it at 256MB (small SLB size) since some early allocations need
to be in the bolted SLB region. We could relax this on machines with
1TB SLBs in a future patch.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-powerpc/arch/powerpc/include/asm/kdump.h
===================================================================
--- linux-powerpc.orig/arch/powerpc/include/asm/kdump.h	2011-07-26 11:11:35.583436932 +1000
+++ linux-powerpc/arch/powerpc/include/asm/kdump.h	2011-07-26 11:17:13.159317079 +1000
@@ -3,17 +3,7 @@
 
 #include <asm/page.h>
 
-/*
- * If CONFIG_RELOCATABLE is enabled we can place the kdump kernel anywhere.
- * To keep enough space in the RMO for the first stage kernel on 64bit, we
- * place it at 64MB. If CONFIG_RELOCATABLE is not enabled we must place
- * the second stage at 32MB.
- */
-#if defined(CONFIG_RELOCATABLE) && defined(CONFIG_PPC64)
-#define KDUMP_KERNELBASE	0x4000000
-#else
 #define KDUMP_KERNELBASE	0x2000000
-#endif
 
 /* How many bytes to reserve at zero for kdump. The reserve limit should
  * be greater or equal to the trampoline's end address.
Index: linux-powerpc/arch/powerpc/kernel/machine_kexec.c
===================================================================
--- linux-powerpc.orig/arch/powerpc/kernel/machine_kexec.c	2011-07-26 11:10:27.932259619 +1000
+++ linux-powerpc/arch/powerpc/kernel/machine_kexec.c	2011-07-26 11:18:17.830444562 +1000
@@ -136,12 +136,16 @@ void __init reserve_crashkernel(void)
 	crashk_res.start = KDUMP_KERNELBASE;
 #else
 	if (!crashk_res.start) {
+#ifdef CONFIG_PPC64
 		/*
-		 * unspecified address, choose a region of specified size
-		 * can overlap with initrd (ignoring corruption when retained)
-		 * ppc64 requires kernel and some stacks to be in first segemnt
+		 * On 64bit we split the RMO in half but cap it at half of
+		 * a small SLB (128MB) since the crash kernel needs to place
+		 * itself and some stacks to be in the first segment.
 		 */
+		crashk_res.start = min(0x80000000ULL, (ppc64_rma_size / 2));
+#else
 		crashk_res.start = KDUMP_KERNELBASE;
+#endif
 	}
 
 	crash_base = PAGE_ALIGN(crashk_res.start);

^ permalink raw reply

* [PATCH] powerpc: Lack! of! ibm,io-events! not! that! important!
From: Anton Blanchard @ 2011-08-01  5:30 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev


The ibm,io-events code is a bit verbose with its error messages.
Reverse the reporting so we only print when we successfully enable
I/O event interrupts.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux-powerpc/arch/powerpc/platforms/pseries/io_event_irq.c
===================================================================
--- linux-powerpc.orig/arch/powerpc/platforms/pseries/io_event_irq.c	2011-07-26 08:50:11.291231586 +1000
+++ linux-powerpc/arch/powerpc/platforms/pseries/io_event_irq.c	2011-07-26 08:51:39.992772071 +1000
@@ -212,17 +212,15 @@ static int __init ioei_init(void)
 	struct device_node *np;
 
 	ioei_check_exception_token = rtas_token("check-exception");
-	if (ioei_check_exception_token == RTAS_UNKNOWN_SERVICE) {
-		pr_warning("IO Event IRQ not supported on this system !\n");
+	if (ioei_check_exception_token == RTAS_UNKNOWN_SERVICE)
 		return -ENODEV;
-	}
+
 	np = of_find_node_by_path("/event-sources/ibm,io-events");
 	if (np) {
 		request_event_sources_irqs(np, ioei_interrupt, "IO_EVENT");
+		pr_info("IBM I/O event interrupts enabled\n");
 		of_node_put(np);
 	} else {
-		pr_err("io_event_irq: No ibm,io-events on system! "
-		       "IO Event interrupt disabled.\n");
 		return -ENODEV;
 	}
 	return 0;

^ permalink raw reply

* Re: [PATCH v2 2/4] powerpc, mpc52xx: add a4m072 board support
From: Heiko Schocher @ 2011-08-01  5:30 UTC (permalink / raw)
  To: Grant Likely; +Cc: devicetree-discuss, linuxppc-dev, Wolfgang Denk
In-Reply-To: <20110731040819.GM24334@ponder.secretlab.ca>

Hello Grant,

Grant Likely wrote:
> On Wed, Jun 22, 2011 at 12:39:10PM +0200, Heiko Schocher wrote:
>> Signed-off-by: Heiko Schocher <hs@denx.de>
>> cc: Grant Likely <grant.likely@secretlab.ca>
>> cc: devicetree-discuss@ozlabs.org
>> cc: Wolfgang Denk <wd@denx.de>
>> cc: Wolfram Sang <w.sang@pengutronix.de>
>> ---
>> For this patchseries following patch is needed:
>>
>> http://patchwork.ozlabs.org/patch/91919/
>>
>> Grant? Do you have some comments on that patch?
>>
>> changes for v2:
>>   add comment from Wolfram Sang:
>>   use mpc5200.dtsi
>>
>>  arch/powerpc/boot/dts/a4m072.dts             |  172 ++++++++++++++++++++++++++
>>  arch/powerpc/platforms/52xx/mpc5200_simple.c |    1 +
>>  2 files changed, 173 insertions(+), 0 deletions(-)
>>  create mode 100644 arch/powerpc/boot/dts/a4m072.dts
>>
>> diff --git a/arch/powerpc/boot/dts/a4m072.dts b/arch/powerpc/boot/dts/a4m072.dts
>> new file mode 100644
>> index 0000000..adb6746
>> --- /dev/null
>> +++ b/arch/powerpc/boot/dts/a4m072.dts
>> @@ -0,0 +1,172 @@
>> +/*
>> + * a4m072 board Device Tree Source
>> + *
>> + * Copyright (C) 2011 DENX Software Engineering GmbH
>> + * Heiko Schocher <hs@denx.de>
>> + *
>> + * Copyright (C) 2007 Semihalf
>> + * Marian Balakowicz <m8@semihalf.com>
>> + *
>> + * This program is free software; you can redistribute  it and/or modify it
>> + * under  the terms of  the GNU General  Public License as published by the
>> + * Free Software Foundation;  either version 2 of the  License, or (at your
>> + * option) any later version.
>> + */
>> +
>> +/include/ "mpc5200b.dtsi"
> 
> Ah, I missed this follow up patch.  Yes, this is better.

;-)

>> +
>> +/ {
>> +	model = "anonymous,a4m072";
>> +	compatible = "anonymous,a4m072";

The customer don;t want, that his name appear, so I decided here,
to use "anonymous" ... what name should used here?

>> +
>> +	soc5200@f0000000 {
>> +		#address-cells = <1>;
>> +		#size-cells = <1>;
>> +		compatible = "fsl,mpc5200b-immr";
>> +		ranges = <0 0xf0000000 0x0000c000>;
>> +		reg = <0xf0000000 0x00000100>;
>> +		bus-frequency = <0>; /* From boot loader */
>> +		system-frequency = <0>; /* From boot loader */
>> +
>> +		cdm@200 {
>> +			fsl,ext_48mhz_en = <0x0>;
>> +			fsl,fd_enable = <0x01>;
>> +			fsl,fd_counters = <0xbbbb>;
> 
> Are these new properties documented?  They need to be.  Also,
> convention is to use '-' instead of '_' in property names.

Yes, see patch here:

>> For this patchseries following patch is needed:
>>
>> http://patchwork.ozlabs.org/patch/91919/

>> +		};
>> +
>> +		timer@600 {
>> +			compatible = "fsl,mpc5200b-gpt","fsl,mpc5200-gpt";
>> +			reg = <0x600 0x80>;
>> +			interrupts = <1 9 0>;
>> +			fsl,has-wdt;
>> +		};
> 
> Isn't this node already in the mpc5200b.dtsi file?

Yes, you are right, remove this.

> Otherwise, this patch looks pretty good.

Thanks for your review! I wait for a comment on patch
http://patchwork.ozlabs.org/patch/91919/ from you and rework this
2 patches.

bye,
Heiko
-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany

^ permalink raw reply

* For Power Arch, the Exception Vector always locate at 0x0100-0x0FFF physical address?
From: Wizard @ 2011-08-01  7:25 UTC (permalink / raw)
  To: linuxppc-dev

Hi

I am not sure whether this is the right place for this question.

>From the PEM, it says the physical area is reserved for the exception
vector, this is always true?
And how many vectors could it hold?

-- 
Wizard

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox