LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: kvm PCI assignment & VFIO ramblings
From: Alex Williamson @ 2011-07-30 18:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: chrisw, Alexey Kardashevskiy, kvm, Paul Mackerras,
	linux-pci@vger.kernel.org, qemu-devel, David Gibson, aafabbri,
	iommu, Anthony Liguori, linuxppc-dev, benve
In-Reply-To: <1311983933.8793.42.camel@pasglop>

On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote:
> Hi folks !
> 
> So I promised Anthony I would try to summarize some of the comments &
> issues we have vs. VFIO after we've tried to use it for PCI pass-through
> on POWER. It's pretty long, there are various items with more or less
> impact, some of it is easily fixable, some are API issues, and we'll
> probably want to discuss them separately, but for now here's a brain
> dump.

Thanks Ben.  For those wondering what happened to VFIO and where it
lives now, Tom Lyon turned it over to me.  I've been continuing to hack
and bug fix and prep it for upstream.  My trees are here:

git://github.com/awilliam/linux-vfio.git vfio
git://github.com/awilliam/qemu-vfio.git vfio

I was hoping we were close to being ready for an upstream push, but we
obviously need to work through the issues Ben and company have been
hitting.

> David, Alexei, please make sure I haven't missed anything :-)
> 
> * Granularity of pass-through
> 
> So let's first start with what is probably the main issue and the most
> contentious, which is the problem of dealing with the various
> constraints which define the granularity of pass-through, along with
> exploiting features like the VTd iommu domains.
> 
> For the sake of clarity, let me first talk a bit about the "granularity"
> issue I've mentioned above.
> 
> There are various constraints that can/will force several devices to be
> "owned" by the same guest and on the same side of the host/guest
> boundary. This is generally because some kind of HW resource is shared
> and thus not doing so would break the isolation barrier and enable a
> guest to disrupt the operations of the host and/or another guest.
> 
> Some of those constraints are well know, such as shared interrupts. Some
> are more subtle, for example, if a PCIe->PCI bridge exist in the system,
> there is no way for the iommu to identify transactions from devices
> coming from the PCI segment of that bridge with a granularity other than
> "behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
> behind such a bridge must be treated as a single "entity" for
> pass-trough purposes.

On x86, the USB controllers don't typically live behind a PCIe-to-PCI
bridge, so don't suffer the source identifier problem, but they do often
share an interrupt.  But even then, we can count on most modern devices
supporting PCI2.3, and thus the DisINTx feature, which allows us to
share interrupts.  In any case, yes, it's more rare but we need to know
how to handle devices behind PCI bridges.  However I disagree that we
need to assign all the devices behind such a bridge to the guest.
There's a difference between removing the device from the host and
exposing the device to the guest.  If I have a NIC and HBA behind a
bridge, it's perfectly reasonable that I might only assign the NIC to
the guest, but as you describe, we then need to prevent the host, or any
other guest from making use of the HBA.

> In IBM POWER land, we call this a "partitionable endpoint" (the term
> "endpoint" here is historic, such a PE can be made of several PCIe
> "endpoints"). I think "partitionable" is a pretty good name tho to
> represent the constraints, so I'll call this a "partitionable group"
> from now on. 
> 
> Other examples of such HW imposed constraints can be a shared iommu with
> no filtering capability (some older POWER hardware which we might want
> to support fall into that category, each PCI host bridge is its own
> domain but doesn't have a finer granularity... however those machines
> tend to have a lot of host bridges :)
> 
> If we are ever going to consider applying some of this to non-PCI
> devices (see the ongoing discussions here), then we will be faced with
> the crazyness of embedded designers which probably means all sort of new
> constraints we can't even begin to think about
> 
> This leads me to those initial conclusions:
> 
> - The -minimum- granularity of pass-through is not always a single
> device and not always under SW control

But IMHO, we need to preserve the granularity of exposing a device to a
guest as a single device.  That might mean some devices are held hostage
by an agent on the host.

> - Having a magic heuristic in libvirt to figure out those constraints is
> WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
> knowledge of PCI resource management and getting it wrong in many many
> cases, something that took years to fix essentially by ripping it all
> out. This is kernel knowledge and thus we need the kernel to expose in a
> way or another what those constraints are, what those "partitionable
> groups" are.
> 
> - That does -not- mean that we cannot specify for each individual device
> within such a group where we want to put it in qemu (what devfn etc...).
> As long as there is a clear understanding that the "ownership" of the
> device goes with the group, this is somewhat orthogonal to how they are
> represented in qemu. (Not completely... if the iommu is exposed to the
> guest ,via paravirt for example, some of these constraints must be
> exposed but I'll talk about that more later).

Or we can choose not to expose all of the devices in the group to the
guest?

> The interface currently proposed for VFIO (and associated uiommu)
> doesn't handle that problem at all. Instead, it is entirely centered
> around a specific "feature" of the VTd iommu's for creating arbitrary
> domains with arbitrary devices (tho those devices -do- have the same
> constraints exposed above, don't try to put 2 legacy PCI devices behind
> the same bridge into 2 different domains !), but the API totally ignores
> the problem, leaves it to libvirt "magic foo" and focuses on something
> that is both quite secondary in the grand scheme of things, and quite
> x86 VTd specific in the implementation and API definition.

To be fair, libvirt's "magic foo" is built out of the necessity that
nobody else is defining the rules.

> Now, I'm not saying these programmable iommu domains aren't a nice
> feature and that we shouldn't exploit them when available, but as it is,
> it is too much a central part of the API.
> 
> I'll talk a little bit more about recent POWER iommu's here to
> illustrate where I'm coming from with my idea of groups:
> 
> On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
> of domain and a per-RID filtering. However it differs from VTd in a few
> ways:
> 
> The "domains" (aka PEs) encompass more than just an iommu filtering
> scheme. The MMIO space and PIO space are also segmented, and those
> segments assigned to domains. Interrupts (well, MSI ports at least) are
> assigned to domains. Inbound PCIe error messages are targeted to
> domains, etc...
> 
> Basically, the PEs provide a very strong isolation feature which
> includes errors, and has the ability to immediately "isolate" a PE on
> the first occurence of an error. For example, if an inbound PCIe error
> is signaled by a device on a PE or such a device does a DMA to a
> non-authorized address, the whole PE gets into error state. All
> subsequent stores (both DMA and MMIO) are swallowed and reads return all
> 1's, interrupts are blocked. This is designed to prevent any propagation
> of bad data, which is a very important feature in large high reliability
> systems.
> 
> Software then has the ability to selectively turn back on MMIO and/or
> DMA, perform diagnostics, reset devices etc...
> 
> Because the domains encompass more than just DMA, but also segment the
> MMIO space, it is not practical at all to dynamically reconfigure them
> at runtime to "move" devices into domains. The firmware or early kernel
> code (it depends) will assign devices BARs using an algorithm that keeps
> them within PE segment boundaries, etc....
> 
> Additionally (and this is indeed a "restriction" compared to VTd, though
> I expect our future IO chips to lift it to some extent), PE don't get
> separate DMA address spaces. There is one 64-bit DMA address space per
> PCI host bridge, and it is 'segmented' with each segment being assigned
> to a PE. Due to the way PE assignment works in hardware, it is not
> practical to make several devices share a segment unless they are on the
> same bus. Also the resulting limit in the amount of 32-bit DMA space a
> device can access means that it's impractical to put too many devices in
> a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
> more about that later).
> 
> The above essentially extends the granularity requirement (or rather is
> another factor defining what the granularity of partitionable entities
> is). You can think of it as "pre-existing" domains.
> 
> I believe the way to solve that is to introduce a kernel interface to
> expose those "partitionable entities" to userspace. In addition, it
> occurs to me that the ability to manipulate VTd domains essentially
> boils down to manipulating those groups (creating larger ones with
> individual components).
> 
> I like the idea of defining / playing with those groups statically
> (using a command line tool or sysfs, possibly having a config file
> defining them in a persistent way) rather than having their lifetime
> tied to a uiommu file descriptor.
> 
> It also makes it a LOT easier to have a channel to manipulate
> platform/arch specific attributes of those domains if any.
> 
> So we could define an API or representation in sysfs that exposes what
> the partitionable entities are, and we may add to it an API to
> manipulate them. But we don't have to and I'm happy to keep the
> additional SW grouping you can do on VTd as a sepparate "add-on" API
> (tho I don't like at all the way it works with uiommu). However, qemu
> needs to know what the grouping is regardless of the domains, and it's
> not nice if it has to manipulate two different concepts here so
> eventually those "partitionable entities" from a qemu standpoint must
> look like domains.
> 
> My main point is that I don't want the "knowledge" here to be in libvirt
> or qemu. In fact, I want to be able to do something as simple as passing
> a reference to a PE to qemu (sysfs path ?) and have it just pickup all
> the devices in there and expose them to the guest.
> 
> This can be done in a way that isn't PCI specific as well (the
> definition of the groups and what is grouped would would obviously be
> somewhat bus specific and handled by platform code in the kernel).
> 
> Maybe something like /sys/devgroups ? This probably warrants involving
> more kernel people into the discussion.

I don't yet buy into passing groups to qemu since I don't buy into the
idea of always exposing all of those devices to qemu.  Would it be
sufficient to expose iommu nodes in sysfs that link to the devices
behind them and describe properties and capabilities of the iommu
itself?  More on this at the end.

> * IOMMU
> 
> Now more on iommu. I've described I think in enough details how ours
> work, there are others, I don't know what freescale or ARM are doing,
> sparc doesn't quite work like VTd either, etc...
> 
> The main problem isn't that much the mechanics of the iommu but really
> how it's exposed (or not) to guests.
> 
> VFIO here is basically designed for one and only one thing: expose the
> entire guest physical address space to the device more/less 1:1.
> 
> This means:
> 
>   - It only works with iommu's that provide complete DMA address spaces
> to devices. Won't work with a single 'segmented' address space like we
> have on POWER.
> 
>   - It requires the guest to be pinned. Pass-through -> no more swap
> 
>   - The guest cannot make use of the iommu to deal with 32-bit DMA
> devices, thus a guest with more than a few G of RAM (I don't know the
> exact limit on x86, depends on your IO hole I suppose), and you end up
> back to swiotlb & bounce buffering.
> 
>   - It doesn't work for POWER server anyways because of our need to
> provide a paravirt iommu interface to the guest since that's how pHyp
> works today and how existing OSes expect to operate.
> 
> Now some of this can be fixed with tweaks, and we've started doing it
> (we have a working pass-through using VFIO, forgot to mention that, it's
> just that we don't like what we had to do to get there).

This is a result of wanting to support *unmodified* x86 guests.  We
don't have the luxury of having a predefined pvDMA spec that all x86
OSes adhere to.  The 32bit problem is unfortunate, but the priority use
case for assigning devices to guests is high performance I/O, which
usually entails modern, 64bit hardware.  I'd like to see us get to the
point of having emulated IOMMU hardware on x86, which could then be
backed by VFIO, but for now guest pinning is the most practical and
useful.

> Basically, what we do today is:
> 
> - We add an ioctl to VFIO to expose to qemu the segment information. IE.
> What is the DMA address and size of the DMA "window" usable for a given
> device. This is a tweak, that should really be handled at the "domain"
> level.
> 
> That current hack won't work well if two devices share an iommu. Note
> that we have an additional constraint here due to our paravirt
> interfaces (specificed in PAPR) which is that PE domains must have a
> common parent. Basically, pHyp makes them look like a PCIe host bridge
> per domain in the guest. I think that's a pretty good idea and qemu
> might want to do the same.
> 
> - We hack out the currently unconditional mapping of the entire guest
> space in the iommu. Something will have to be done to "decide" whether
> to do that or not ... qemu argument -> ioctl ?
> 
> - We hook up the paravirt call to insert/remove a translation from the
> iommu to the VFIO map/unmap ioctl's.
> 
> This limps along but it's not great. Some of the problems are:
> 
> - I've already mentioned, the domain problem again :-) 
> 
> - Performance sucks of course, the vfio map ioctl wasn't mean for that
> and has quite a bit of overhead. However we'll want to do the paravirt
> call directly in the kernel eventually ...
> 
>   - ... which isn't trivial to get back to our underlying arch specific
> iommu object from there. We'll probably need a set of arch specific
> "sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
> link them to the real thing kernel-side.
> 
> - PAPR (the specification of our paravirt interface and the expectation
> of current OSes) wants iommu pages to be 4k by default, regardless of
> the kernel host page size, which makes things a bit tricky since our
> enterprise host kernels have a 64k base page size. Additionally, we have
> new PAPR interfaces that we want to exploit, to allow the guest to
> create secondary iommu segments (in 64-bit space), which can be used
> (under guest control) to do things like map the entire guest (here it
> is :-) or use larger iommu page sizes (if permitted by the host kernel,
> in our case we could allow 64k iommu page size with a 64k host kernel).
> 
> The above means we need arch specific APIs. So arch specific vfio
> ioctl's, either that or kvm ones going to vfio or something ... the
> current structure of vfio/kvm interaction doesn't make it easy.

FYI, we also have large page support for x86 VT-d, but it seems to only
be opportunistic right now.  I'll try to come back to the rest of this
below.

> * IO space
> 
> On most (if not all) non-x86 archs, each PCI host bridge provide a
> completely separate PCI address space. Qemu doesn't deal with that very
> well. For MMIO it can be handled since those PCI address spaces are
> "remapped" holes in the main CPU address space so devices can be
> registered by using BAR + offset of that window in qemu MMIO mapping.
> 
> For PIO things get nasty. We have totally separate PIO spaces and qemu
> doesn't seem to like that. We can try to play the offset trick as well,
> we haven't tried yet, but basically that's another one to fix. Not a
> huge deal I suppose but heh ...
> 
> Also our next generation chipset may drop support for PIO completely.
> 
> On the other hand, because PIO is just a special range of MMIO for us,
> we can do normal pass-through on it and don't need any of the emulation
> done qemu.

Maybe we can add mmap support to PIO regions on non-x86.

>   * MMIO constraints
> 
> The QEMU side VFIO code hard wires various constraints that are entirely
> based on various requirements you decided you have on x86 but don't
> necessarily apply to us :-)
> 
> Due to our paravirt nature, we don't need to masquerade the MSI-X table
> for example. At all. If the guest configures crap into it, too bad, it
> can only shoot itself in the foot since the host bridge enforce
> validation anyways as I explained earlier. Because it's all paravirt, we
> don't need to "translate" the interrupt vectors & addresses, the guest
> will call hyercalls to configure things anyways.

With interrupt remapping, we can allow the guest access to the MSI-X
table, but since that takes the host out of the loop, there's
effectively no way for the guest to correctly program it directly by
itself.

> We don't need to prevent MMIO pass-through for small BARs at all. This
> should be some kind of capability or flag passed by the arch. Our
> segmentation of the MMIO domain means that we can give entire segments
> to the guest and let it access anything in there (those segments are a
> multiple of the page size always). Worst case it will access outside of
> a device BAR within a segment and will cause the PE to go into error
> state, shooting itself in the foot, there is no risk of side effect
> outside of the guest boundaries.

Sure, this could be some kind of capability flag, maybe even implicit in
certain configurations.

> In fact, we don't even need to emulate BAR sizing etc... in theory. Our
> paravirt guests expect the BARs to have been already allocated for them
> by the firmware and will pick up the addresses from the device-tree :-)
> 
> Today we use a "hack", putting all 0's in there and triggering the linux
> code path to reassign unassigned resources (which will use BAR
> emulation) but that's not what we are -supposed- to do. Not a big deal
> and having the emulation there won't -hurt- us, it's just that we don't
> really need any of it.
> 
> We have a small issue with ROMs. Our current KVM only works with huge
> pages for guest memory but that is being fixed. So the way qemu maps the
> ROM copy into the guest address space doesn't work. It might be handy
> anyways to have a way for qemu to use MMIO emulation for ROM access as a
> fallback. I'll look into it.

So that means ROMs don't work for you on emulated devices either?  The
reason we read it once and map it into the guest is because Michael
Tsirkin found a section in the PCI spec that indicates devices can share
address decoders between BARs and ROM.  This means we can't just leave
the enabled bit set in the ROM BAR, because it could actually disable an
address decoder for a regular BAR.  We could slow-map the actual ROM,
enabling it around each read, but shadowing it seemed far more
efficient.

>   * EEH
> 
> This is the name of those fancy error handling & isolation features I
> mentioned earlier. To some extent it's a superset of AER, but we don't
> generally expose AER to guests (or even the host), it's swallowed by
> firmware into something else that provides a superset (well mostly) of
> the AER information, and allow us to do those additional things like
> isolating/de-isolating, reset control etc...
> 
> Here too, we'll need arch specific APIs through VFIO. Not necessarily a
> huge deal, I mention it for completeness.

We expect to do AER via the VFIO netlink interface, which even though
its bashed below, would be quite extensible to supporting different
kinds of errors.

>    * Misc
> 
> There's lots of small bits and pieces... in no special order:
> 
>  - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
> netlink and a bit of ioctl's ... it's not like there's something
> fundamentally  better for netlink vs. ioctl... it really depends what
> you are doing, and in this case I fail to see what netlink brings you
> other than bloat and more stupid userspace library deps.

The netlink interface is primarily for host->guest signaling.  I've only
implemented the remove command (since we're lacking a pcie-host in qemu
to do AER), but it seems to work quite well.  If you have suggestions
for how else we might do it, please let me know.  This seems to be the
sort of thing netlink is supposed to be used for.

>  - I don't like too much the fact that VFIO provides yet another
> different API to do what we already have at least 2 kernel APIs for, ie,
> BAR mapping and config space access. At least it should be better at
> using the backend infrastructure of the 2 others (sysfs & procfs). I
> understand it wants to filter in some case (config space) and -maybe-
> yet another API is the right way to go but allow me to have my doubts.

The use of PCI sysfs is actually one of my complaints about current
device assignment.  To do assignment with an unprivileged guest we need
to open the PCI sysfs config file for it, then change ownership on a
handful of other PCI sysfs files, then there's this other pci-stub thing
to maintain ownership, but the kvm ioctls don't actually require it and
can grab onto any free device...  We are duplicating some of that in
VFIO, but we also put the ownership of the device behind a single device
file.  We do have the uiommu problem that we can't give an unprivileged
user ownership of that, but your usage model may actually make that
easier.  More below...

> One thing I thought about but you don't seem to like it ... was to use
> the need to represent the partitionable entity as groups in sysfs that I
> talked about earlier. Those could have per-device subdirs with the usual
> config & resource files, same semantic as the ones in the real device,
> but when accessed via the group they get filtering. I might or might not
> be practical in the end, tbd, but it would allow apps using a slightly
> modified libpci for example to exploit some of this.

I may be tainted by our disagreement that all the devices in a group
need to be exposed to the guest and qemu could just take a pointer to a
sysfs directory.  That seems very unlike qemu and pushes more of the
policy into qemu, which seems like the wrong direction.

>  - The qemu vfio code hooks directly into ioapic ... of course that
> won't fly with anything !x86

I spent a lot of time looking for an architecture neutral solution here,
but I don't think it exists.  Please prove me wrong.  The problem is
that we have to disable INTx on an assigned device after it fires (VFIO
does this automatically).  If we don't do this, a non-responsive or
malicious guest could sit on the interrupt, causing it to fire
repeatedly as a DoS on the host.  The only indication that we can rely
on to re-enable INTx is when the guest CPU writes an EOI to the APIC.
We can't just wait for device accesses because a) the device CSRs are
(hopefully) direct mapped and we'd have to slow map them or attempt to
do some kind of dirty logging to detect when they're accesses b) what
constitutes an interrupt service is device specific.

That means we need to figure out how PCI interrupt 'A' (or B...)
translates to a GSI (Global System Interrupt - ACPI definition, but
hopefully a generic concept).  That GSI identifies a pin on an IOAPIC,
which will also see the APIC EOI.  And just to spice things up, the
guest can change the PCI to GSI mappings via ACPI.  I think the set of
callbacks I've added are generic (maybe I left ioapic in the name), but
yes they do need to be implemented for other architectures.  Patches
appreciated from those with knowledge of the systems and/or access to
device specs.  This is the only reason that I make QEMU VFIO only build
for x86.

>  - The various "objects" dealt with here, -especially- interrupts and
> iommu, need a better in-kernel API so that fast in-kernel emulation can
> take over from qemu based emulation. The way we need to do some of this
> on POWER differs from x86. We can elaborate later, it's not necessarily
> a killer either but essentially we'll take the bulk of interrupt
> handling away from VFIO to the point where it won't see any of it at
> all.

The plan for x86 is to connect VFIO eventfds directly to KVM irqfds and
bypass QEMU.  This is exactly what VHOST does today and fairly trivial
to enable for MSI once we get it merged.  INTx would require us to be
able to define a level triggered irqfd in KVM and it's not yet clear if
we care that much about INTx performance.

We don't currently have a plan for accelerating IOMMU access since our
current usage model doesn't need one.  We also need to consider MSI-X
table acceleration for x86.  I hope we'll be able to use the new KVM
ioctls for this.

>   - Non-PCI devices. That's a hot topic for embedded. I think the vast
> majority here is platform devices. There's quite a bit of vfio that
> isn't intrinsically PCI specific. We could have an in-kernel platform
> driver like we have an in-kernel PCI driver to attach to. The mapping of
> resources to userspace is rather generic, as goes for interrupts. I
> don't know whether that idea can be pushed much further, I don't have
> the bandwidth to look into it much at this point, but maybe it would be
> possible to refactor vfio a bit to better separate what is PCI specific
> to what is not. The idea would be to move the PCI specific bits to
> inside the "placeholder" PCI driver, and same goes for platform bits.
> "generic" ioctl's go to VFIO core, anything that doesn't handle, it
> passes them to the driver which allows the PCI one to handle things
> differently than the platform one, maybe an amba one while at it,
> etc.... just a thought, I haven't gone into the details at all.

This is on my radar, but I don't have a good model for it either.  I
suspect there won't be a whole lot left of VFIO if we make all the PCI
bits optional.  The right approach might be to figure out what's missing
between UIO and VFIO for non-PCI, implement that as a driver, then see
if we can base VFIO on using that for MMIO/PIO/INTx, leaving config and
MSI as a VFIO layer on top of the new UIO driver.

> I think that's all I had on my plate today, it's a long enough email
> anyway :-) Anthony suggested we put that on a wiki, I'm a bit
> wiki-disabled myself so he proposed to pickup my email and do that. We
> should probably discuss the various items in here separately as
> different threads to avoid too much confusion.
> 
> One other thing we should do on our side is publish somewhere our
> current hacks to get you an idea of where we are going and what we had
> to do (code speaks more than words). We'll try to do that asap, possibly
> next week.
> 
> Note that I'll be on/off the next few weeks, travelling and doing
> bringup. So expect latency in my replies.

Thanks for the write up, I think it will be good to let everyone digest
it before we discuss this at KVM forum.

Rather than your "groups" idea, I've been mulling over whether we can
just expose the dependencies, configuration, and capabilities in sysfs
and build qemu commandlines to describe it.  For instance, if we simply
start with creating iommu nodes in sysfs, we could create links under
each iommu directory to the devices behind them.  Some kind of
capability file could define properties like whether it's page table
based or fixed iova window or the granularity of mapping the devices
behind it.  Once we have that, we could probably make uiommu attach to
each of those nodes.

That means we know /dev/uiommu7 (random example) is our access to a
specific iommu with a given set of devices behind it.  If that iommu is
a PE (via those capability files), then a user space entity (trying hard
not to call it libvirt) can unbind all those devices from the host,
maybe bind the ones it wants to assign to a guest to vfio and bind the
others to pci-stub for safe keeping.  If you trust a user with
everything in a PE, bind all the devices to VFIO, chown all
the /dev/vfioX entries for those devices, and the /dev/uiommuX device.

We might then come up with qemu command lines to describe interesting
configurations, such as:

-device iommu,model=PAPR,uiommu=/dev/uiommu7,id=iommu.0 \
-device pci-bus,...,iommu=iommu0,id=pci.0 \
-device vfio,host=ssss:bb:dd.f,bus=pci.0,addr=dd.f,id=hostdev0

The userspace entity would obviously need to put things in the same PE
in the right place, but it doesn't seem to take a lot of sysfs info to
get that right.

Today we do DMA mapping via the VFIO device because the capabilities of
the IOMMU domains change depending on which devices are connected (for
VT-d, the least common denominator of the IOMMUs in play).  Forcing the
DMA mappings through VFIO naturally forces the call order.  If we moved
to something like above, we could switch the DMA mapping to the uiommu
device, since the IOMMU would have fixed capabilities.

What gaps would something like this leave for your IOMMU granularity
problems?  I'll need to think through how it works when we don't want to
expose the iommu to the guest, maybe a model=none (default) that doesn't
need to be connected to a pci bus and maps all guest memory.  Thanks,

Alex

^ permalink raw reply

* Re: HELP:PowerPc-Linux kernel
From: Gary Thomas @ 2011-07-30 12:28 UTC (permalink / raw)
  To: naresh.kamboju; +Cc: scottwood, vijay.t.nikam, cort, linuxppc-dev
In-Reply-To: <35CC4C9595855B43903A67B297EFA8E3C546BC@HYD-MKD-MBX01.wipro.com>

On 2011-07-30 06:21, naresh.kamboju@wipro.com wrote:
> Hi All,
>
> I have started working on powerpc board bring up. I have prepared dtb file and booted linux kernel with my debug statement.
> Problem:
> I could not see anything on the serial console. By using the emulator I can read __log_buf and found below info.
>
> How can I initialize the serial console?
> Here "ttyCPM0 at MMIO map 0xc504aa00 mem 0x0 (irq = 40) is a CPM UART" is detected by kernel and where we generally pass boot args as ttyS0.
> May I the relation between ttyCPM0 and ttyS0.
> How can I see kernel boot console on the serial port? It would be helpful if you share any workarounds.

Did you try passing 'console=ttyCPM0' to the bootargs?

Also, 2.6.21 is truly ancient.  Why not try a more recent kernel, especially if
you are just getting srarted?

>
> Below print out is from emulator by reading __log_buf and parsed as readable log.
>
> <6>Using MPC82xx ADS machine description
> .<3>Initializing container subsys cpu.
> <5>Linux version 2.6.21.7-hrt1-cfs-v22-grsec-WR2.0bl_cgl (vanga@linux) (gcc version 4.1.2 (Wind River Linux Sourcery G++ 4.1-91)) #18 Sat Jul 30 14:39:06 IST 2011
> .<7>Entering add_active_range(0, 0, 16384) 0 entries of 256 used
> .<6>No memory reg property [1] in devicetree
> .<7>Top of RAM: 0x4000000, Total RAM: 0x4000000
> .<7>Memory hole size: 0MB
> .<4>Zone PFN ranges:
> .<4>   DMA             0 ->     16384
> .<4>   Normal      16384 ->     16384
> .<    4>early_node_map[1] active PFN ranges
> .<4>     0:        0 ->     16384
> .<7>On node 0 totalpages: 16384
> .<7>   DMA zone: 128 pages used for memmap
> .<7>   DMA zone: 0 pages reserved
> .<7>   DMA zone: 16256 pages, LIFO batch:3
> .<7>   Normal zone: 0 pages used for memmap.<4>Built 1 zonelists.  Total pages: 16256.
> <5>Kernel command line: mem=64M console=ttyS0,9600n8 root=/dev/nfs rw nfsroot=172.16.50.152:/home/export,nolock,rsize=1024,wsize=1024.
> <5>---after parse_early_param------- .
> <5>---after parse_args------- .
> <5>entered    sort_main_extable
> <5>exit sort_main_extable
> <5>---after sort_main_extable()------- .
> <5>---after trap_init()------- .
> <5>---after rcu_init()-------
> .<6>No pci node on device tree.
> <5>---after init_IRQ()-------
> .<4>PID hash table entries: 256 (order: 8, 1024    bytes).
> <5>---after pidhash_init------- .
> <5>---after init_timers------- .
> <5>---after hrtimers_init------- .
> <5>---after softirq_init------- .
> <5>---after timekeeping_init-------
> .<7>time_init: decrementer frequency = 16.675000 MHz
> .<7>time_init: processor fre   quency   = 166.750000 MHz.
> <5>---after time_init------- .
> <5>---after profile_init------- .
> <5>---after early_boot_irqs_on------- .
> <5>---after local_irq_enable-------
> .<6>---entered in cpm_uart_console_init -- .
> <5>---after console_init------- .
> <5>---after pan   ic check------- .
> <5>---after lockdep_info------- .
> <5>---after locking_selftest-------
> .<4>Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
> .<4>Inode-cache hash table entries: 4096 (order: 2, 16384 bytes).
> <5>---after vfs_caches_init_early-------    .
> <5>---cpuset_init_early-------
> .<6>Memory: 61612k/65536k available (2848k kernel code, 3860k reserved, 84k data, 276k bss, 152k init).
> <5>---after kmem_cache_init------- .
> <5>---after locking_selftest------- .
> <5>---after radix_tree_init------- .
> <5>---after                       memleak_init------- .
> <5>---after setup_per_cpu_pageset------- .
> <5>---after numa_policy_init------- .
> <5>---after late_time_init------
> .<7>Calibrating delay loop... 33.28 BogoMIPS (lpj=66560).
> <5>---after calibrate_delay------- .
> <5>---after pidmap_init------   - .
> <5>---after pgtable_cache_init------- .
> <5>---after prio_tree_init------- .
> <5>---after anon_vma_init------- .
> <5>---after fork_init------- .
> <5>---after proc_caches_init------- .
> <5>---after buffer_init------- .
> <5>---after unnamed_dev_init------- .
> <5>---aft   er key_init------- .
> <5>---after security_init-------
> .<4>Mount-cache hash table entries: 512.
> <5>---after vfs_caches_init------- .
> <5>---after signals_init------- .
> <5>---after page_writeback_init-------
> .<3>Initializing container subsys cpuacct
> .<3>Initializi   ng container subsys debug.
> <5>---after container_init------- .
> <5>---after cpuset_init------- .
> <5>---after taskstats_init_early------ .
> <5>---after delayacct_init------- .
> <5>---after check_bugs------- .
> <5>---after acpi_early_init-------
> .<6>-------entry rest_   init--------
> .<6>-------kernel_thread --------
> .<6>----------after numa_default_policy---------
> .<6>-----unlock_kernel--------
> .<6>------------init_idle_bootup_task---------
> .<6>--------preempt_enable_no_resched------
> .<6>NET: Registered protocol family 16
> .<6>PC   I: Probing PCI hardware
> .<6>Generic PHY: Registered new driver
> .<6>NET: Registered protocol family 2
> .<6>------------after schedule---------
> .<6>----------preempt_disable-------.
> <5>-----entered cpu_idle -------
> <5>-----entered cpu_idle  set_thread_flag -------I   P route cache hash table entries: 1024 (order: 0, 4096 bytes)
> .<4>TCP established hash table entries: 2048 (order: 2, 16384 bytes)
> .<4>TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
> .<6>TCP: Hash tables configured (established 2048 bind 2048)
> .<6>TCP    reno registered
> .<6>JFS: nTxBlock = 481, nTxLock = 3854
> .<6>Time: timebase clocksource has been installed.
> .<6>Switched to high resolution mode on CPU 0
> .<6>Registering unionfs 2.1.6 (for 2.6.21.7)
> .<6>io scheduler noop registered
> .<6>Generic RTC Driver v1.07
> .<    3>i8042.c: No controller found.
> .<6>Serial: CPM driver $Revision: 0.02 $
> .<6>--cpm_uart_init()---dev = C07E6C08
> .<6>--uart_register_driver() ---ret = 0--
> .<6>cpm_uart_drv_probe: Adding CPM UART 0
> .<6>CPM uart[0]:config_port
> .<6>:CPM uart[0]:request port
> .<6>pinfo->sccp->scc_sccm
> .<6>CPM uart[0]:allocbuf
> .<6>CPM uart[0]:init_scc
> .<6>ttyCPM0 at MMIO map 0xc504aa00 mem 0x0 (irq = 40) is a CPM UART
> .<4>RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
> .<4>Default I/O scheduler not found, using no-   op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<4>Default I/O scheduler not found, using no-op
> .<6>nbd: registered device at major 43
> .<6>Broa   dcom BCM5411: Registered new driver
> .<6>Broadcom BCM5421: Registered new driver
> .<6>Broadcom BCM5461: Registered new driver
> .<6>fs_enet.c:v1.0 (Aug 8, 2005)
> .<3>BB MII Bus: Cannot register as MDIO bus
> .<4>fsl-bb-mdio: probe of fsl-bb-mdio.0 failed with error -1
> .<3>BB MII Bus: Cannot register as MDIO bus
> .<4>fsl-bb-mdio: probe of fsl-bb-mdio.1 failed with error -1
> .<6>No memory reg property [1] in devicetree
> .<6>No memory reg property [1] in devicetree
> .<6>i2c /dev entries driver.
> <5>physmap platform flash device: 020   00000 at fe000000
> .<6>physmap-flash.0: Found 1 x16 devices at 0x0 in 16-bit bank.
> <5>Support for command set 0002 not present
> .<4>gen_probe: No supported Vendor Command Set found
> .<3>physmap-flash physmap-flash.0: map_probe failed
> .<6>TCP cubic registered
> .<6>Initializing XFRM netlink socket
> .<6>NET: Registered protocol family 1
> .<6>NET: Registered protocol family 17
> .<6>NET: Registered protocol family 15
> .<6>802.1Q VLAN Support v1.8 Ben Greear<greearb@candelatech.com>
> .<6>All bugs added by David S. Miller<davem@red   hat.com>.
> <5>Looking up port of RPC 100003/2 on 172.16.50.152....
>
>
> Best regards
> Naresh Kamboju
> -----Original Message-----
> From: Vijay Nikam [mailto:vijay.t.nikam@gmail.com]
> Sent: Thursday, July 28, 2011 10:16 AM
> To: Naresh Kamboju (WT01 - GMT-Telecom Equipment)
> Cc: linuxppc-dev@lists.ozlabs.org; cort@fsmlabs.com; linas@austin.ibm.com; hollis@austin.ibm.com
> Subject: Re: HELP:PowerPc-Linux kernel
>
> Hello,
>
> Start with looking at the configuration of the board done which is
> similar to yours
> or based on the same CPU as yours. It is important to know role of
> device tree so
> read the documentation and understand the syntax and concept of device
> tree. Once
> the complete concept is understood then you should start the
> configuration and achieve
> successful creation of kernel image.
>
> Take a step forward and do some hands on. If any problem occurs then
> post for specific help,
> as porting itself is a big task and dosent have really straight forward steps.
> Good Luck
>
> Kind Regards,
> Vijay Nikam
>
> On Wed, Jul 27, 2011 at 8:33 PM,<naresh.kamboju@wipro.com>  wrote:
>>
>> Hi,
>>
>>
>>
>> I have take up the new assignment  Board bring up activity with Linux kernel on PowerPC MPC8272.
>>
>> I have been searching Linux bring up on PowerPC processor in Google and IBM wiki and not found good stuff.
>>
>> It would be more helpful for me if you could share related documents.
>>
>>
>>
>> Best regards
>>
>> Naresh Kamboju
>>
>>
>>
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
>>
>> www.wipro.com
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email.
>
> www.wipro.com
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply

* RE: HELP:PowerPc-Linux kernel
From: naresh.kamboju @ 2011-07-30 12:21 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: scottwood, vijay.t.nikam, cort
In-Reply-To: <CAGn8Sby3Swk8zXxDXnXDBM+bDn=RV2cmvqvwXZM+05P=Z1FMAg@mail.gmail.com>

Hi All,

I have started working on powerpc board bring up. I have prepared dtb file a=
nd booted linux kernel with my debug statement.
Problem:
I could not see anything on the serial console. By using the emulator I can=
 read __log_buf and found below info.

How can I initialize the serial console?
Here "ttyCPM0 at MMIO map 0xc504aa00 mem 0x0 (irq =3D 40) is a CPM UART" is=
 detected by kernel and where we generally pass boot args as ttyS0.
May I the relation between ttyCPM0 and ttyS0. 
How can I see kernel boot console on the serial port? It would be helpful if=
 you share any workarounds.

Below print out is from emulator by reading __log_buf and parsed as readable=
 log.

<6>Using MPC82xx ADS machine description
.<3>Initializing container subsys cpu.
<5>Linux version 2.6.21.7-hrt1-cfs-v22-grsec-WR2.0bl_cgl (vanga@linux) (gcc=
 version 4.1.2 (Wind River Linux Sourcery G++ 4.1-91)) #18 Sat Jul 30 14:39:=
06 IST 2011
.<7>Entering add_active_range(0, 0, 16384) 0 entries of 256 used
.<6>No memory reg property [1] in devicetree
.<7>Top of RAM: 0x4000000, Total RAM: 0x4000000
.<7>Memory hole size: 0MB
.<4>Zone PFN ranges:
.<4>  DMA             0 ->    16384
.<4>  Normal      16384 ->    16384
.<   4>early_node_map[1] active PFN ranges
.<4>    0:        0 ->    16384
.<7>On node 0 totalpages: 16384
.<7>  DMA zone: 128 pages used for memmap
.<7>  DMA zone: 0 pages reserved
.<7>  DMA zone: 16256 pages, LIFO batch:3
.<7>  Normal zone: 0 pages used for memmap.   <4>Built 1 zonelists.  Total p=
ages: 16256.
<5>Kernel command line: mem=3D64M console=3DttyS0,9600n8 root=3D/dev/nfs rw=
 nfsroot=3D172.16.50.152:/home/export,nolock,rsize=3D1024,wsize=3D1024.
<5>---after parse_early_param------- .
<5>---after parse_args------- .
<5>entered    sort_main_extable  
<5>exit sort_main_extable  
<5>---after sort_main_extable()------- .
<5>---after trap_init()------- .
<5>---after rcu_init()------- 
.<6>No pci node on device tree.
<5>---after init_IRQ()------- 
.<4>PID hash table entries: 256 (order: 8, 1024    bytes).
<5>---after pidhash_init------- .
<5>---after init_timers------- .
<5>---after hrtimers_init------- .
<5>---after softirq_init------- .
<5>---after timekeeping_init------- 
.<7>time_init: decrementer frequency =3D 16.675000 MHz
.<7>time_init: processor fre   quency   =3D 166.750000 MHz.
<5>---after time_init------- .
<5>---after profile_init------- .
<5>---after early_boot_irqs_on------- .
<5>---after local_irq_enable------- 
.<6>---entered in cpm_uart_console_init -- .
<5>---after console_init------- .
<5>---after pan   ic check------- .
<5>---after lockdep_info------- .
<5>---after locking_selftest------- 
.<4>Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
.<4>Inode-cache hash table entries: 4096 (order: 2, 16384 bytes).
<5>---after vfs_caches_init_early-------    .
<5>---cpuset_init_early------- 
.<6>Memory: 61612k/65536k available (2848k kernel code, 3860k reserved, 84k=
 data, 276k bss, 152k init).
<5>---after kmem_cache_init------- .
<5>---after locking_selftest------- .
<5>---after radix_tree_init------- .
<5>---after                       memleak_init------- .
<5>---after setup_per_cpu_pageset------- .
<5>---after numa_policy_init------- .
<5>---after late_time_init------ 
.<7>Calibrating delay loop... 33.28 BogoMIPS (lpj=3D66560).
<5>---after calibrate_delay------- .
<5>---after pidmap_init------   - .
<5>---after pgtable_cache_init------- .
<5>---after prio_tree_init------- .
<5>---after anon_vma_init------- .
<5>---after fork_init------- .
<5>---after proc_caches_init------- .
<5>---after buffer_init------- .
<5>---after unnamed_dev_init------- .
<5>---aft   er key_init------- .
<5>---after security_init------- 
.<4>Mount-cache hash table entries: 512.
<5>---after vfs_caches_init------- .
<5>---after signals_init------- .
<5>---after page_writeback_init------- 
.<3>Initializing container subsys cpuacct
.<3>Initializi   ng container subsys debug.
<5>---after container_init------- .
<5>---after cpuset_init------- .
<5>---after taskstats_init_early------ .
<5>---after delayacct_init------- .
<5>---after check_bugs------- .
<5>---after acpi_early_init------- 
.<6>-------entry rest_   init--------
.<6>-------kernel_thread --------
.<6>----------after numa_default_policy---------
.<6>-----unlock_kernel--------
.<6>------------init_idle_bootup_task---------
.<6>--------preempt_enable_no_resched------
.<6>NET: Registered protocol family 16
.<6>PC   I: Probing PCI hardware
.<6>Generic PHY: Registered new driver
.<6>NET: Registered protocol family 2
.<6>------------after schedule---------
.<6>----------preempt_disable-------.
<5>-----entered cpu_idle -------
<5>-----entered cpu_idle  set_thread_flag -------I   P route cache hash tabl=
e entries: 1024 (order: 0, 4096 bytes)
.<4>TCP established hash table entries: 2048 (order: 2, 16384 bytes)
.<4>TCP bind hash table entries: 2048 (order: 1, 8192 bytes)
.<6>TCP: Hash tables configured (established 2048 bind 2048)
.<6>TCP    reno registered
.<6>JFS: nTxBlock =3D 481, nTxLock =3D 3854
.<6>Time: timebase clocksource has been installed.
.<6>Switched to high resolution mode on CPU 0
.<6>Registering unionfs 2.1.6 (for 2.6.21.7)
.<6>io scheduler noop registered
.<6>Generic RTC Driver v1.07
.<   3>i8042.c: No controller found.
.<6>Serial: CPM driver $Revision: 0.02 $
.<6>--cpm_uart_init()---dev =3D C07E6C08 
.<6>--uart_register_driver() ---ret =3D 0--   
.<6>cpm_uart_drv_probe: Adding CPM UART 0
.<6>CPM uart[0]:config_port
.<6>:CPM uart[0]:request port
.<6>pinfo->sccp->scc_sccm 
.<6>CPM uart[0]:allocbuf
.<6>CPM uart[0]:init_scc
.<6>ttyCPM0 at MMIO map 0xc504aa00 mem 0x0 (irq =3D 40) is a CPM UART
.<4>RAMDISK driver initialized: 16 RAM disks of 32768K size 1024 blocksize
.<4>Default I/O scheduler not found, using no-   op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<4>Default I/O scheduler not found, using no-op
.<6>nbd: registered device at major 43
.<6>Broa   dcom BCM5411: Registered new driver
.<6>Broadcom BCM5421: Registered new driver
.<6>Broadcom BCM5461: Registered new driver
.<6>fs_enet.c:v1.0 (Aug 8, 2005)
.<3>BB MII Bus: Cannot register as MDIO bus
.<4>fsl-bb-mdio: probe of fsl-bb-mdio.0 failed with error -1   
.<3>BB MII Bus: Cannot register as MDIO bus
.<4>fsl-bb-mdio: probe of fsl-bb-mdio.1 failed with error -1
.<6>No memory reg property [1] in devicetree
.<6>No memory reg property [1] in devicetree
.<6>i2c /dev entries driver.
<5>physmap platform flash device: 020   00000 at fe000000
.<6>physmap-flash.0: Found 1 x16 devices at 0x0 in 16-bit bank.
<5>Support for command set 0002 not present
.<4>gen_probe: No supported Vendor Command Set found
.<3>physmap-flash physmap-flash.0: map_probe failed
.<6>TCP cubic registered
.<6>Initializing XFRM netlink socket
.<6>NET: Registered protocol family 1
.<6>NET: Registered protocol family 17
.<6>NET: Registered protocol family 15
.<6>802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
.<6>All bugs added by David S. Miller <davem@red   hat.com>.
<5>Looking up port of RPC 100003/2 on 172.16.50.152....


Best regards
Naresh Kamboju 
-----Original Message-----
From: Vijay Nikam [mailto:vijay.t.nikam@gmail.com] 
Sent: Thursday, July 28, 2011 10:16 AM
To: Naresh Kamboju (WT01 - GMT-Telecom Equipment)
Cc: linuxppc-dev@lists.ozlabs.org; cort@fsmlabs.com; linas@austin.ibm.com; h=
ollis@austin.ibm.com
Subject: Re: HELP:PowerPc-Linux kernel

Hello,

Start with looking at the configuration of the board done which is
similar to yours
or based on the same CPU as yours. It is important to know role of
device tree so
read the documentation and understand the syntax and concept of device
tree. Once
the complete concept is understood then you should start the
configuration and achieve
successful creation of kernel image.

Take a step forward and do some hands on. If any problem occurs then
post for specific help,
as porting itself is a big task and dosent have really straight forward step=
s.
Good Luck

Kind Regards,
Vijay Nikam

On Wed, Jul 27, 2011 at 8:33 PM, <naresh.kamboju@wipro.com> wrote:
>
> Hi,
>
>
>
> I have take up the new assignment =A0Board bring up activity with Linux ke=
rnel on PowerPC MPC8272.
>
> I have been searching Linux bring up on PowerPC processor in Google and IB=
M wiki and not found good stuff.
>
> It would be more helpful for me if you could share related documents.
>
>
>
> Best regards
>
> Naresh Kamboju
>
>
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments t=
o this message are intended for the exclusive use of the addressee(s) and ma=
y contain proprietary, confidential or privileged information. If you are no=
t the intended recipient, you should not disseminate, distribute or copy thi=
s e-mail. Please notify the sender immediately and destroy all copies of thi=
s message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient shou=
ld check this email and any attachments for the presence of viruses. The com=
pany accepts no liability for any damage caused by any virus transmitted by=
 this email.
>
> www.wipro.com
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

Please do not print this email unless it is absolutely necessary. =0A=
=0A=
The information contained in this electronic message and any attachments to=
 this message are intended for the exclusive use of the addressee(s) and may=
 contain proprietary, confidential or privileged information. If you are not=
 the intended recipient, you should not disseminate, distribute or copy this=
 e-mail. Please notify the sender immediately and destroy all copies of this=
 message and any attachments. =0A=
=0A=
WARNING: Computer viruses can be transmitted via email. The recipient should=
 check this email and any attachments for the presence of viruses. The compa=
ny accepts no liability for any damage caused by any virus transmitted by th=
is email. =0A=
=0A=
www.wipro.com

^ permalink raw reply

* kvm PCI assignment & VFIO ramblings
From: Benjamin Herrenschmidt @ 2011-07-29 23:58 UTC (permalink / raw)
  To: kvm
  Cc: Alexey Kardashevskiy, Paul Mackerras, linux-pci@vger.kernel.org,
	David Gibson, Alex Williamson, Anthony Liguori, linuxppc-dev

Hi folks !

So I promised Anthony I would try to summarize some of the comments &
issues we have vs. VFIO after we've tried to use it for PCI pass-through
on POWER. It's pretty long, there are various items with more or less
impact, some of it is easily fixable, some are API issues, and we'll
probably want to discuss them separately, but for now here's a brain
dump.

David, Alexei, please make sure I haven't missed anything :-)

* Granularity of pass-through

So let's first start with what is probably the main issue and the most
contentious, which is the problem of dealing with the various
constraints which define the granularity of pass-through, along with
exploiting features like the VTd iommu domains.

For the sake of clarity, let me first talk a bit about the "granularity"
issue I've mentioned above.

There are various constraints that can/will force several devices to be
"owned" by the same guest and on the same side of the host/guest
boundary. This is generally because some kind of HW resource is shared
and thus not doing so would break the isolation barrier and enable a
guest to disrupt the operations of the host and/or another guest.

Some of those constraints are well know, such as shared interrupts. Some
are more subtle, for example, if a PCIe->PCI bridge exist in the system,
there is no way for the iommu to identify transactions from devices
coming from the PCI segment of that bridge with a granularity other than
"behind the bridge". So typically a EHCI/OHCI/OHCI combo (a classic)
behind such a bridge must be treated as a single "entity" for
pass-trough purposes.

In IBM POWER land, we call this a "partitionable endpoint" (the term
"endpoint" here is historic, such a PE can be made of several PCIe
"endpoints"). I think "partitionable" is a pretty good name tho to
represent the constraints, so I'll call this a "partitionable group"
from now on. 

Other examples of such HW imposed constraints can be a shared iommu with
no filtering capability (some older POWER hardware which we might want
to support fall into that category, each PCI host bridge is its own
domain but doesn't have a finer granularity... however those machines
tend to have a lot of host bridges :)

If we are ever going to consider applying some of this to non-PCI
devices (see the ongoing discussions here), then we will be faced with
the crazyness of embedded designers which probably means all sort of new
constraints we can't even begin to think about

This leads me to those initial conclusions:

- The -minimum- granularity of pass-through is not always a single
device and not always under SW control

- Having a magic heuristic in libvirt to figure out those constraints is
WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel
knowledge of PCI resource management and getting it wrong in many many
cases, something that took years to fix essentially by ripping it all
out. This is kernel knowledge and thus we need the kernel to expose in a
way or another what those constraints are, what those "partitionable
groups" are.

- That does -not- mean that we cannot specify for each individual device
within such a group where we want to put it in qemu (what devfn etc...).
As long as there is a clear understanding that the "ownership" of the
device goes with the group, this is somewhat orthogonal to how they are
represented in qemu. (Not completely... if the iommu is exposed to the
guest ,via paravirt for example, some of these constraints must be
exposed but I'll talk about that more later).

The interface currently proposed for VFIO (and associated uiommu)
doesn't handle that problem at all. Instead, it is entirely centered
around a specific "feature" of the VTd iommu's for creating arbitrary
domains with arbitrary devices (tho those devices -do- have the same
constraints exposed above, don't try to put 2 legacy PCI devices behind
the same bridge into 2 different domains !), but the API totally ignores
the problem, leaves it to libvirt "magic foo" and focuses on something
that is both quite secondary in the grand scheme of things, and quite
x86 VTd specific in the implementation and API definition.

Now, I'm not saying these programmable iommu domains aren't a nice
feature and that we shouldn't exploit them when available, but as it is,
it is too much a central part of the API.

I'll talk a little bit more about recent POWER iommu's here to
illustrate where I'm coming from with my idea of groups:

On p7ioc (the IO chip used on recent P7 machines), there -is- a concept
of domain and a per-RID filtering. However it differs from VTd in a few
ways:

The "domains" (aka PEs) encompass more than just an iommu filtering
scheme. The MMIO space and PIO space are also segmented, and those
segments assigned to domains. Interrupts (well, MSI ports at least) are
assigned to domains. Inbound PCIe error messages are targeted to
domains, etc...

Basically, the PEs provide a very strong isolation feature which
includes errors, and has the ability to immediately "isolate" a PE on
the first occurence of an error. For example, if an inbound PCIe error
is signaled by a device on a PE or such a device does a DMA to a
non-authorized address, the whole PE gets into error state. All
subsequent stores (both DMA and MMIO) are swallowed and reads return all
1's, interrupts are blocked. This is designed to prevent any propagation
of bad data, which is a very important feature in large high reliability
systems.

Software then has the ability to selectively turn back on MMIO and/or
DMA, perform diagnostics, reset devices etc...

Because the domains encompass more than just DMA, but also segment the
MMIO space, it is not practical at all to dynamically reconfigure them
at runtime to "move" devices into domains. The firmware or early kernel
code (it depends) will assign devices BARs using an algorithm that keeps
them within PE segment boundaries, etc....

Additionally (and this is indeed a "restriction" compared to VTd, though
I expect our future IO chips to lift it to some extent), PE don't get
separate DMA address spaces. There is one 64-bit DMA address space per
PCI host bridge, and it is 'segmented' with each segment being assigned
to a PE. Due to the way PE assignment works in hardware, it is not
practical to make several devices share a segment unless they are on the
same bus. Also the resulting limit in the amount of 32-bit DMA space a
device can access means that it's impractical to put too many devices in
a PE anyways. (This is clearly designed for paravirt iommu, I'll talk
more about that later).

The above essentially extends the granularity requirement (or rather is
another factor defining what the granularity of partitionable entities
is). You can think of it as "pre-existing" domains.

I believe the way to solve that is to introduce a kernel interface to
expose those "partitionable entities" to userspace. In addition, it
occurs to me that the ability to manipulate VTd domains essentially
boils down to manipulating those groups (creating larger ones with
individual components).

I like the idea of defining / playing with those groups statically
(using a command line tool or sysfs, possibly having a config file
defining them in a persistent way) rather than having their lifetime
tied to a uiommu file descriptor.

It also makes it a LOT easier to have a channel to manipulate
platform/arch specific attributes of those domains if any.

So we could define an API or representation in sysfs that exposes what
the partitionable entities are, and we may add to it an API to
manipulate them. But we don't have to and I'm happy to keep the
additional SW grouping you can do on VTd as a sepparate "add-on" API
(tho I don't like at all the way it works with uiommu). However, qemu
needs to know what the grouping is regardless of the domains, and it's
not nice if it has to manipulate two different concepts here so
eventually those "partitionable entities" from a qemu standpoint must
look like domains.

My main point is that I don't want the "knowledge" here to be in libvirt
or qemu. In fact, I want to be able to do something as simple as passing
a reference to a PE to qemu (sysfs path ?) and have it just pickup all
the devices in there and expose them to the guest.

This can be done in a way that isn't PCI specific as well (the
definition of the groups and what is grouped would would obviously be
somewhat bus specific and handled by platform code in the kernel).

Maybe something like /sys/devgroups ? This probably warrants involving
more kernel people into the discussion.

* IOMMU

Now more on iommu. I've described I think in enough details how ours
work, there are others, I don't know what freescale or ARM are doing,
sparc doesn't quite work like VTd either, etc...

The main problem isn't that much the mechanics of the iommu but really
how it's exposed (or not) to guests.

VFIO here is basically designed for one and only one thing: expose the
entire guest physical address space to the device more/less 1:1.

This means:

  - It only works with iommu's that provide complete DMA address spaces
to devices. Won't work with a single 'segmented' address space like we
have on POWER.

  - It requires the guest to be pinned. Pass-through -> no more swap

  - The guest cannot make use of the iommu to deal with 32-bit DMA
devices, thus a guest with more than a few G of RAM (I don't know the
exact limit on x86, depends on your IO hole I suppose), and you end up
back to swiotlb & bounce buffering.

  - It doesn't work for POWER server anyways because of our need to
provide a paravirt iommu interface to the guest since that's how pHyp
works today and how existing OSes expect to operate.

Now some of this can be fixed with tweaks, and we've started doing it
(we have a working pass-through using VFIO, forgot to mention that, it's
just that we don't like what we had to do to get there).

Basically, what we do today is:

- We add an ioctl to VFIO to expose to qemu the segment information. IE.
What is the DMA address and size of the DMA "window" usable for a given
device. This is a tweak, that should really be handled at the "domain"
level.

That current hack won't work well if two devices share an iommu. Note
that we have an additional constraint here due to our paravirt
interfaces (specificed in PAPR) which is that PE domains must have a
common parent. Basically, pHyp makes them look like a PCIe host bridge
per domain in the guest. I think that's a pretty good idea and qemu
might want to do the same.

- We hack out the currently unconditional mapping of the entire guest
space in the iommu. Something will have to be done to "decide" whether
to do that or not ... qemu argument -> ioctl ?

- We hook up the paravirt call to insert/remove a translation from the
iommu to the VFIO map/unmap ioctl's.

This limps along but it's not great. Some of the problems are:

- I've already mentioned, the domain problem again :-) 

- Performance sucks of course, the vfio map ioctl wasn't mean for that
and has quite a bit of overhead. However we'll want to do the paravirt
call directly in the kernel eventually ...

  - ... which isn't trivial to get back to our underlying arch specific
iommu object from there. We'll probably need a set of arch specific
"sideband" ioctl's to "register" our paravirt iommu "bus numbers" and
link them to the real thing kernel-side.

- PAPR (the specification of our paravirt interface and the expectation
of current OSes) wants iommu pages to be 4k by default, regardless of
the kernel host page size, which makes things a bit tricky since our
enterprise host kernels have a 64k base page size. Additionally, we have
new PAPR interfaces that we want to exploit, to allow the guest to
create secondary iommu segments (in 64-bit space), which can be used
(under guest control) to do things like map the entire guest (here it
is :-) or use larger iommu page sizes (if permitted by the host kernel,
in our case we could allow 64k iommu page size with a 64k host kernel).

The above means we need arch specific APIs. So arch specific vfio
ioctl's, either that or kvm ones going to vfio or something ... the
current structure of vfio/kvm interaction doesn't make it easy.

* IO space

On most (if not all) non-x86 archs, each PCI host bridge provide a
completely separate PCI address space. Qemu doesn't deal with that very
well. For MMIO it can be handled since those PCI address spaces are
"remapped" holes in the main CPU address space so devices can be
registered by using BAR + offset of that window in qemu MMIO mapping.

For PIO things get nasty. We have totally separate PIO spaces and qemu
doesn't seem to like that. We can try to play the offset trick as well,
we haven't tried yet, but basically that's another one to fix. Not a
huge deal I suppose but heh ...

Also our next generation chipset may drop support for PIO completely.

On the other hand, because PIO is just a special range of MMIO for us,
we can do normal pass-through on it and don't need any of the emulation
done qemu.

  * MMIO constraints

The QEMU side VFIO code hard wires various constraints that are entirely
based on various requirements you decided you have on x86 but don't
necessarily apply to us :-)

Due to our paravirt nature, we don't need to masquerade the MSI-X table
for example. At all. If the guest configures crap into it, too bad, it
can only shoot itself in the foot since the host bridge enforce
validation anyways as I explained earlier. Because it's all paravirt, we
don't need to "translate" the interrupt vectors & addresses, the guest
will call hyercalls to configure things anyways.

We don't need to prevent MMIO pass-through for small BARs at all. This
should be some kind of capability or flag passed by the arch. Our
segmentation of the MMIO domain means that we can give entire segments
to the guest and let it access anything in there (those segments are a
multiple of the page size always). Worst case it will access outside of
a device BAR within a segment and will cause the PE to go into error
state, shooting itself in the foot, there is no risk of side effect
outside of the guest boundaries.

In fact, we don't even need to emulate BAR sizing etc... in theory. Our
paravirt guests expect the BARs to have been already allocated for them
by the firmware and will pick up the addresses from the device-tree :-)

Today we use a "hack", putting all 0's in there and triggering the linux
code path to reassign unassigned resources (which will use BAR
emulation) but that's not what we are -supposed- to do. Not a big deal
and having the emulation there won't -hurt- us, it's just that we don't
really need any of it.

We have a small issue with ROMs. Our current KVM only works with huge
pages for guest memory but that is being fixed. So the way qemu maps the
ROM copy into the guest address space doesn't work. It might be handy
anyways to have a way for qemu to use MMIO emulation for ROM access as a
fallback. I'll look into it.

  * EEH

This is the name of those fancy error handling & isolation features I
mentioned earlier. To some extent it's a superset of AER, but we don't
generally expose AER to guests (or even the host), it's swallowed by
firmware into something else that provides a superset (well mostly) of
the AER information, and allow us to do those additional things like
isolating/de-isolating, reset control etc...

Here too, we'll need arch specific APIs through VFIO. Not necessarily a
huge deal, I mention it for completeness.

   * Misc

There's lots of small bits and pieces... in no special order:

 - netlink ? WTF ! Seriously, we don't need a hybrid API with a bit of
netlink and a bit of ioctl's ... it's not like there's something
fundamentally  better for netlink vs. ioctl... it really depends what
you are doing, and in this case I fail to see what netlink brings you
other than bloat and more stupid userspace library deps.

 - I don't like too much the fact that VFIO provides yet another
different API to do what we already have at least 2 kernel APIs for, ie,
BAR mapping and config space access. At least it should be better at
using the backend infrastructure of the 2 others (sysfs & procfs). I
understand it wants to filter in some case (config space) and -maybe-
yet another API is the right way to go but allow me to have my doubts.

One thing I thought about but you don't seem to like it ... was to use
the need to represent the partitionable entity as groups in sysfs that I
talked about earlier. Those could have per-device subdirs with the usual
config & resource files, same semantic as the ones in the real device,
but when accessed via the group they get filtering. I might or might not
be practical in the end, tbd, but it would allow apps using a slightly
modified libpci for example to exploit some of this.

 - The qemu vfio code hooks directly into ioapic ... of course that
won't fly with anything !x86

 - The various "objects" dealt with here, -especially- interrupts and
iommu, need a better in-kernel API so that fast in-kernel emulation can
take over from qemu based emulation. The way we need to do some of this
on POWER differs from x86. We can elaborate later, it's not necessarily
a killer either but essentially we'll take the bulk of interrupt
handling away from VFIO to the point where it won't see any of it at
all.

  - Non-PCI devices. That's a hot topic for embedded. I think the vast
majority here is platform devices. There's quite a bit of vfio that
isn't intrinsically PCI specific. We could have an in-kernel platform
driver like we have an in-kernel PCI driver to attach to. The mapping of
resources to userspace is rather generic, as goes for interrupts. I
don't know whether that idea can be pushed much further, I don't have
the bandwidth to look into it much at this point, but maybe it would be
possible to refactor vfio a bit to better separate what is PCI specific
to what is not. The idea would be to move the PCI specific bits to
inside the "placeholder" PCI driver, and same goes for platform bits.
"generic" ioctl's go to VFIO core, anything that doesn't handle, it
passes them to the driver which allows the PCI one to handle things
differently than the platform one, maybe an amba one while at it,
etc.... just a thought, I haven't gone into the details at all.

I think that's all I had on my plate today, it's a long enough email
anyway :-) Anthony suggested we put that on a wiki, I'm a bit
wiki-disabled myself so he proposed to pickup my email and do that. We
should probably discuss the various items in here separately as
different threads to avoid too much confusion.

One other thing we should do on our side is publish somewhere our
current hacks to get you an idea of where we are going and what we had
to do (code speaks more than words). We'll try to do that asap, possibly
next week.

Note that I'll be on/off the next few weeks, travelling and doing
bringup. So expect latency in my replies.

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Baruch Siach @ 2011-07-29  7:23 UTC (permalink / raw)
  To: Tabi Timur-B04825; +Cc: Gala Kumar-B11780, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <CAOZdJXVsJUrCaNVHjhLkL=-GwAr79S9xQdoRLvjFG--k1jdRvw@mail.gmail.com>

Hi Tabi,

On Thu, Jul 28, 2011 at 07:56:53PM +0000, Tabi Timur-B04825 wrote:
> On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
> > CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
> > breaking clear_pages(), probably others too.
> >
> > Cc: Kumar Gala <galak@kernel.crashing.org>
> > Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> > ---
> > Is this the right approach?
> 
> It doesn't work for me.
> 
> I need something that if an e500v2 platform (e.g. the P1022DS) is
> selected, then I won't be able to select any e500mc platforms (e.g.
> P4080DS).  And if I don't select any e500v2 platforms, then I will be
> able to select an e500mc platform.  This patch doesn't seem to do
> that.

The source of the trouble seems to be the user selectable CONFIG_PPC_E500MC 
with the misleading "e500mc Support" description. I'll try to post something 
better next week.

> It might be necessary to split the entire menu into two parts, one for
> e500v2 parts and one for e500mc parts.

baurch

-- 
                                                     ~. .~   Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
   - baruch@tkos.co.il - tel: +972.2.679.5364, http://www.tkos.co.il -

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Scott Wood @ 2011-07-28 20:20 UTC (permalink / raw)
  To: Tabi Timur-B04825
  Cc: Baruch Siach, linuxppc-dev@lists.ozlabs.org, Gala Kumar-B11780
In-Reply-To: <CAOZdJXVsJUrCaNVHjhLkL=-GwAr79S9xQdoRLvjFG--k1jdRvw@mail.gmail.com>

On Thu, 28 Jul 2011 19:56:53 +0000
Tabi Timur-B04825 <B04825@freescale.com> wrote:

> On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
> > CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
> > breaking clear_pages(), probably others too.
> >
> > Cc: Kumar Gala <galak@kernel.crashing.org>
> > Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> > ---
> > Is this the right approach?
> 
> It doesn't work for me.
> 
> I need something that if an e500v2 platform (e.g. the P1022DS) is
> selected, then I won't be able to select any e500mc platforms (e.g.
> P4080DS).  And if I don't select any e500v2 platforms, then I will be
> able to select an e500mc platform.  This patch doesn't seem to do
> that.
> 
> It might be necessary to split the entire menu into two parts, one for
> e500v2 parts and one for e500mc parts.
> 

How about making the "Processor Type" entry be either E500 or E500MC, both
of which select PPC_85xx?

-Scott

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Timur Tabi @ 2011-07-28 20:02 UTC (permalink / raw)
  To: Baruch Siach; +Cc: Kumar Gala, linuxppc-dev
In-Reply-To: <CAOZdJXVsJUrCaNVHjhLkL=-GwAr79S9xQdoRLvjFG--k1jdRvw@mail.gmail.com>

 wrote:
> On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
>> CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6, thus
>> breaking clear_pages(), probably others too.
>>
>> Cc: Kumar Gala <galak@kernel.crashing.org>
>> Signed-off-by: Baruch Siach <baruch@tkos.co.il>
>> ---
>> Is this the right approach?
> 
> It doesn't work for me.

I also get this error if I try to build corenet32_smp_defconfig:

arch/powerpc/platforms/Kconfig.cputype:136:error: recursive dependency detected!
arch/powerpc/platforms/Kconfig.cputype:136:	symbol PPC_E500MC is selected by
P2040_RDB
arch/powerpc/platforms/85xx/Kconfig:176:	symbol P2040_RDB depends on PPC_E500MC

-- 
Timur Tabi
Linux kernel developer at Freescale

^ permalink raw reply

* Re: [RFC PATCH] powerpc: 85xx: Make e500/e500v2 depend on !E500MC
From: Tabi Timur-B04825 @ 2011-07-28 19:56 UTC (permalink / raw)
  To: Baruch Siach; +Cc: Gala Kumar-B11780, linuxppc-dev@lists.ozlabs.org
In-Reply-To: <fef1cfc9a418ec5aa3302915dcb392882f7dd5d2.1308545584.git.baruch@tkos.co.il>

On Sun, Jun 19, 2011 at 11:56 PM, Baruch Siach <baruch@tkos.co.il> wrote:
> CONFIG_E500MC breaks e500/e500v2 systems. It defines L1_CACHE_SHIFT to 6,=
 thus
> breaking clear_pages(), probably others too.
>
> Cc: Kumar Gala <galak@kernel.crashing.org>
> Signed-off-by: Baruch Siach <baruch@tkos.co.il>
> ---
> Is this the right approach?

It doesn't work for me.

I need something that if an e500v2 platform (e.g. the P1022DS) is
selected, then I won't be able to select any e500mc platforms (e.g.
P4080DS).  And if I don't select any e500v2 platforms, then I will be
able to select an e500mc platform.  This patch doesn't seem to do
that.

It might be necessary to split the entire menu into two parts, one for
e500v2 parts and one for e500mc parts.

--=20
Timur Tabi
Linux kernel developer at Freescale=

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: David Howells @ 2011-07-28 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tony.luck, Mike Frysinger, Shan Hai, linux-kernel, cmetcalf,
	dhowells, paulus, uclinux-dist-devel, tglx, walken, linuxppc-dev,
	akpm
In-Reply-To: <1311761831.24752.413.camel@twins>

Peter Zijlstra <peterz@infradead.org> wrote:

> Subject: mm: Fix fixup_user_fault() for MMU=n 
> 
> In commit 2efaca927 ("mm/futex: fix futex writes on archs with SW
> tracking of dirty & young") we forgot about MMU=n. This patch fixes
> that.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Acked-by: David Howells <dhowells@redhat.com>

^ permalink raw reply

* Re: HELP:PowerPc-Linux kernel
From: Vijay Nikam @ 2011-07-28  6:27 UTC (permalink / raw)
  To: MJ embd; +Cc: naresh.kamboju, linuxppc-dev, cort, linas, hollis
In-Reply-To: <CAPUj1OOqRp-4h=7=x277P0cPSUJT4eCaxfCo0AAeA810hrb8Rg@mail.gmail.com>

Yes

/Vijay Nikam

On Thu, Jul 28, 2011 at 11:00 AM, MJ embd <mj.embd@gmail.com> wrote:
> Have you every worked on device trees before?
>
> On 7/28/11, Vijay Nikam <vijay.t.nikam@gmail.com> wrote:
>> Hello,
>>
>> Start with looking at the configuration of the board done which is
>> similar to yours
>> or based on the same CPU as yours. It is important to know role of
>> device tree so
>> read the documentation and understand the syntax and concept of device
>> tree. Once
>> the complete concept is understood then you should start the
>> configuration and achieve
>> successful creation of kernel image.
>>
>> Take a step forward and do some hands on. If any problem occurs then
>> post for specific help,
>> as porting itself is a big task and dosent have really straight forward
>> steps.
>> Good Luck
>>
>> Kind Regards,
>> Vijay Nikam
>>
>> On Wed, Jul 27, 2011 at 8:33 PM, <naresh.kamboju@wipro.com> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I have take up the new assignment =A0Board bring up activity with Linux
>>> kernel on PowerPC MPC8272.
>>>
>>> I have been searching Linux bring up on PowerPC processor in Google and
>>> IBM wiki and not found good stuff.
>>>
>>> It would be more helpful for me if you could share related documents.
>>>
>>>
>>>
>>> Best regards
>>>
>>> Naresh Kamboju
>>>
>>>
>>>
>>> Please do not print this email unless it is absolutely necessary.
>>>
>>> The information contained in this electronic message and any attachment=
s
>>> to this message are intended for the exclusive use of the addressee(s) =
and
>>> may contain proprietary, confidential or privileged information. If you
>>> are not the intended recipient, you should not disseminate, distribute =
or
>>> copy this e-mail. Please notify the sender immediately and destroy all
>>> copies of this message and any attachments.
>>>
>>> WARNING: Computer viruses can be transmitted via email. The recipient
>>> should check this email and any attachments for the presence of viruses=
.
>>> The company accepts no liability for any damage caused by any virus
>>> transmitted by this email.
>>>
>>> www.wipro.com
>>>
>>> _______________________________________________
>>> Linuxppc-dev mailing list
>>> Linuxppc-dev@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
>>
>
>
> --
> -mj
>

^ permalink raw reply

* Re: HELP:PowerPc-Linux kernel
From: MJ embd @ 2011-07-28  5:30 UTC (permalink / raw)
  To: Vijay Nikam; +Cc: naresh.kamboju, linuxppc-dev, cort, linas, hollis
In-Reply-To: <CAGn8Sby3Swk8zXxDXnXDBM+bDn=RV2cmvqvwXZM+05P=Z1FMAg@mail.gmail.com>

Have you every worked on device trees before?

On 7/28/11, Vijay Nikam <vijay.t.nikam@gmail.com> wrote:
> Hello,
>
> Start with looking at the configuration of the board done which is
> similar to yours
> or based on the same CPU as yours. It is important to know role of
> device tree so
> read the documentation and understand the syntax and concept of device
> tree. Once
> the complete concept is understood then you should start the
> configuration and achieve
> successful creation of kernel image.
>
> Take a step forward and do some hands on. If any problem occurs then
> post for specific help,
> as porting itself is a big task and dosent have really straight forward
> steps.
> Good Luck
>
> Kind Regards,
> Vijay Nikam
>
> On Wed, Jul 27, 2011 at 8:33 PM, <naresh.kamboju@wipro.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I have take up the new assignment =A0Board bring up activity with Linux
>> kernel on PowerPC MPC8272.
>>
>> I have been searching Linux bring up on PowerPC processor in Google and
>> IBM wiki and not found good stuff.
>>
>> It would be more helpful for me if you could share related documents.
>>
>>
>>
>> Best regards
>>
>> Naresh Kamboju
>>
>>
>>
>> Please do not print this email unless it is absolutely necessary.
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) a=
nd
>> may contain proprietary, confidential or privileged information. If you
>> are not the intended recipient, you should not disseminate, distribute o=
r
>> copy this e-mail. Please notify the sender immediately and destroy all
>> copies of this message and any attachments.
>>
>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of viruses.
>> The company accepts no liability for any damage caused by any virus
>> transmitted by this email.
>>
>> www.wipro.com
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
>


--=20
-mj

^ permalink raw reply

* Re: HELP:PowerPc-Linux kernel
From: Vijay Nikam @ 2011-07-28  4:45 UTC (permalink / raw)
  To: naresh.kamboju; +Cc: linas, cort, linuxppc-dev, hollis
In-Reply-To: <35CC4C9595855B43903A67B297EFA8E3C544FD@HYD-MKD-MBX01.wipro.com>

Hello,

Start with looking at the configuration of the board done which is
similar to yours
or based on the same CPU as yours. It is important to know role of
device tree so
read the documentation and understand the syntax and concept of device
tree. Once
the complete concept is understood then you should start the
configuration and achieve
successful creation of kernel image.

Take a step forward and do some hands on. If any problem occurs then
post for specific help,
as porting itself is a big task and dosent have really straight forward ste=
ps.
Good Luck

Kind Regards,
Vijay Nikam

On Wed, Jul 27, 2011 at 8:33 PM, <naresh.kamboju@wipro.com> wrote:
>
> Hi,
>
>
>
> I have take up the new assignment =A0Board bring up activity with Linux k=
ernel on PowerPC MPC8272.
>
> I have been searching Linux bring up on PowerPC processor in Google and I=
BM wiki and not found good stuff.
>
> It would be more helpful for me if you could share related documents.
>
>
>
> Best regards
>
> Naresh Kamboju
>
>
>
> Please do not print this email unless it is absolutely necessary.
>
> The information contained in this electronic message and any attachments =
to this message are intended for the exclusive use of the addressee(s) and =
may contain proprietary, confidential or privileged information. If you are=
 not the intended recipient, you should not disseminate, distribute or copy=
 this e-mail. Please notify the sender immediately and destroy all copies o=
f this message and any attachments.
>
> WARNING: Computer viruses can be transmitted via email. The recipient sho=
uld check this email and any attachments for the presence of viruses. The c=
ompany accepts no liability for any damage caused by any virus transmitted =
by this email.
>
> www.wipro.com
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Mike Frysinger @ 2011-07-28  0:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Shan Hai, Peter Zijlstra, linux-kernel, cmetcalf,
	David Howells, paulus, uclinux-dist-devel, tglx, walken,
	linuxppc-dev, akpm
In-Reply-To: <1311762043.25044.679.camel@pasglop>

On Wed, Jul 27, 2011 at 03:20, Benjamin Herrenschmidt wrote:
> Hoping the BUG() isn't trippable by userspace but then it's no mmu, it's
> not like we care what userspace can do right :-)

side note ... common misconception that "no mmu" == "no memory
protection".  a few of the nommu processors have memory protection,
just no virtual<->physical translation.

thanks for the patch !
-mike

^ permalink raw reply

* [PATCH] [6/99] seqlock: Don't smp_rmb in seqlock reader spin loop
From: Andi Kleen @ 2011-07-27 21:48 UTC (permalink / raw)
  To: miltonm, linuxppc-dev, torvalds, andi, npiggin, benh, anton,
	paulmck, eric.dumazet, ak, tglx, gregkh, linux-kernel, stable,
	tim.bird
In-Reply-To: <20110727247.325703029@firstfloor.org>

2.6.35-longterm review patch.  If anyone has any objections, please let me know.

------------------
From: Milton Miller <miltonm@bga.com>

commit 5db1256a5131d3b133946fa02ac9770a784e6eb2 upstream.

Move the smp_rmb after cpu_relax loop in read_seqlock and add
ACCESS_ONCE to make sure the test and return are consistent.

A multi-threaded core in the lab didn't like the update
from 2.6.35 to 2.6.36, to the point it would hang during
boot when multiple threads were active.  Bisection showed
af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents:
Remove the per cpu tick skew) as the culprit and it is
supported with stack traces showing xtime_lock waits including
tick_do_update_jiffies64 and/or update_vsyscall.

Experimentation showed the combination of cpu_relax and smp_rmb
was significantly slowing the progress of other threads sharing
the core, and this patch is effective in avoiding the hang.

A theory is the rmb is affecting the whole core while the
cpu_relax is causing a resource rebalance flush, together they
cause an interfernce cadance that is unbroken when the seqlock
reader has interrupts disabled.

At first I was confused why the refactor in
3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise
seqlock) didn't affect this patch application, but after some
study that affected seqcount not seqlock. The new seqcount was
not factored back into the seqlock.  I defer that the future.

While the removal of the timer interrupt offset created
contention for the xtime lock while a cpu does the
additonal work to update the system clock, the seqlock
implementation with the tight rmb spin loop goes back much
further, and is just waiting for the right trigger.

Signed-off-by: Milton Miller <miltonm@bga.com>
Cc: <linuxppc-dev@lists.ozlabs.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/%3Cseqlock-rmb%40mdm.bga.com%3E
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 include/linux/seqlock.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.35.y/include/linux/seqlock.h
===================================================================
--- linux-2.6.35.y.orig/include/linux/seqlock.h
+++ linux-2.6.35.y/include/linux/seqlock.h
@@ -88,12 +88,12 @@ static __always_inline unsigned read_seq
 	unsigned ret;
 
 repeat:
-	ret = sl->sequence;
-	smp_rmb();
+	ret = ACCESS_ONCE(sl->sequence);
 	if (unlikely(ret & 1)) {
 		cpu_relax();
 		goto repeat;
 	}
+	smp_rmb();
 
 	return ret;
 }

^ permalink raw reply

* Re: HELP:PowerPc-Linux kernel
From: Scott Wood @ 2011-07-27 20:27 UTC (permalink / raw)
  To: naresh.kamboju; +Cc: linuxppc-dev
In-Reply-To: <35CC4C9595855B43903A67B297EFA8E3C544FD@HYD-MKD-MBX01.wipro.com>

On Wed, 27 Jul 2011 20:33:54 +0530
<naresh.kamboju@wipro.com> wrote:

> Hi,
> 
>  
> 
> I have take up the new assignment  Board bring up activity with Linux
> kernel on PowerPC MPC8272.
> 
> I have been searching Linux bring up on PowerPC processor in Google and
> IBM wiki and not found good stuff.
> 
> It would be more helpful for me if you could share related documents.

Look at the support in current kernels for 82xx-based boards such as
mpc8272ads.  Read the documentation on device trees
(Documentation/devicetree, devicetree.org, ePAPR).

-Scott

^ permalink raw reply

* [PATCH] ppc: Remove duplicate definition of PV_POWER7
From: Peter Zijlstra @ 2011-07-27 17:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Anton Blanchard

One definition of PV_POWER7 seems enough to me.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/powerpc/include/asm/reg.h |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.=
h
index c5cae0d..fedf93b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -986,7 +986,6 @@
 #define PV_970		0x0039
 #define PV_POWER5	0x003A
 #define PV_POWER5p	0x003B
-#define PV_POWER7	0x003F
 #define PV_970FX	0x003C
 #define PV_POWER6	0x003E
 #define PV_POWER7	0x003F

^ permalink raw reply related

* HELP:PowerPc-Linux kernel
From: naresh.kamboju @ 2011-07-27 15:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: cort, linas, hollis

[-- Attachment #1: Type: text/plain, Size: 1080 bytes --]

Hi,



I have take up the new assignment  Board bring up activity with Linux
kernel on PowerPC MPC8272.

I have been searching Linux bring up on PowerPC processor in Google and
IBM wiki and not found good stuff.

It would be more helpful for me if you could share related documents.



Best regards

Naresh Kamboju




Please do not print this email unless it is absolutely necessary. 

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. 

WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. 

www.wipro.com

[-- Attachment #2: Type: text/html, Size: 5835 bytes --]

^ permalink raw reply

* [PATCH] PSeries: Cancel RTAS event scan before firmware flash
From: Ravi K. Nittala @ 2011-07-27 12:09 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: antonb, subrata.modak, mikey, sbest, suzuki, ranittal,
	divya.vikas

The firmware flash update is conducted using an RTAS call, that is serialized
by lock_rtas() which uses spin_lock. rtasd keeps scanning for the RTAS events
generated on the machine. This is performed via a delayed workqueue, invoking
an RTAS call to scan the events.

The flash update takes a while to complete and during this time, any other
RTAS call has to wait. In this case, rtas_event_scan() waits for a long time
on the spin_lock resulting in a soft lockup.

Approaches to fix the issue :

Approach 1: Stop all the other CPUs before we start flashing the firmware.

Before the rtas firmware update starts, all other CPUs should be stopped.
Which means no other CPU should be in lock_rtas(). We do not want other CPUs
execute while FW update is in progress and the system will be rebooted anyway
after the update.

--- arch/powerpc/kernel/setup-common.c.orig    2011-07-01 22:41:12.952507971 -0400
+++ arch/powerpc/kernel/setup-common.c    2011-07-01 22:48:31.182507915 -0400
@@ -109,11 +109,12 @@ void machine_shutdown(void)
  void machine_restart(char *cmd)
  {
      machine_shutdown();
-    if (ppc_md.restart)
-        ppc_md.restart(cmd);
  #ifdef CONFIG_SMP
-    smp_send_stop();
+        smp_send_stop();
  #endif
+    if (ppc_md.restart)
+        ppc_md.restart(cmd);
+
      printk(KERN_EMERG "System Halted, OK to turn off power\n");
      local_irq_disable();
      while (1) ;

Problems with this approach:
Stopping the CPUs suddenly may cause other serious problems depending on what
was running on them. Hence, this approach cannot be considered.


Approach 2: Cancel the rtas_scan_event work before starting the firmware flash.

Just before the flash update is performed, the queued rtas_event_scan() work
item is cancelled from the work queue so that there is no other RTAS call
issued while the flash is in progress. After the flash completes, the system
reboots and the rtas_event_scan() is rescheduled.

Approach 2 looks to be a better solution than Approach 1. Kindly let us know
your thoughts. Patch attached.


Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
Signed-off-by: Ravi Nittala <ravi.nittala@in.ibm.com>


---
 arch/powerpc/include/asm/rtas.h  |    2 ++
 arch/powerpc/kernel/rtas_flash.c |    6 ++++++
 arch/powerpc/kernel/rtasd.c      |    6 ++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 58625d1..3f26f87 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -245,6 +245,8 @@ extern int early_init_dt_scan_rtas(unsigned long node,
 
 extern void pSeries_log_error(char *buf, unsigned int err_type, int fatal);
 
+extern bool rtas_cancel_event_scan(void);
+
 /* Error types logged.  */
 #define ERR_FLAG_ALREADY_LOGGED	0x0
 #define ERR_FLAG_BOOT		0x1 	/* log was pulled from NVRAM on boot */
diff --git a/arch/powerpc/kernel/rtas_flash.c b/arch/powerpc/kernel/rtas_flash.c
index e037c74..4174b4b 100644
--- a/arch/powerpc/kernel/rtas_flash.c
+++ b/arch/powerpc/kernel/rtas_flash.c
@@ -568,6 +568,12 @@ static void rtas_flash_firmware(int reboot_type)
 	}
 
 	/*
+	 * Just before starting the firmware flash, cancel the event scan work
+	 * to avoid any soft lockup issues.
+	 */
+	rtas_cancel_event_scan();
+
+	/*
 	 * NOTE: the "first" block must be under 4GB, so we create
 	 * an entry with no data blocks in the reserved buffer in
 	 * the kernel data segment.
diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 481ef06..e8f03fa 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -472,6 +472,12 @@ static void start_event_scan(void)
 				 &event_scan_work, event_scan_delay);
 }
 
+/* Cancel the rtas event scan work */
+bool rtas_cancel_event_scan(void)
+{
+	return cancel_delayed_work_sync(&event_scan_work);
+}
+
 static int __init rtas_init(void)
 {
 	struct proc_dir_entry *entry;

^ permalink raw reply related

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Benjamin Herrenschmidt @ 2011-07-27 10:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tony.luck, Mike Frysinger, Shan Hai, linux-kernel, cmetcalf,
	David Howells, paulus, uclinux-dist-devel, tglx, walken,
	linuxppc-dev, akpm
In-Reply-To: <1311761831.24752.413.camel@twins>

On Wed, 2011-07-27 at 12:17 +0200, Peter Zijlstra wrote:
> On Wed, 2011-07-27 at 11:09 +0100, David Howells wrote:
> > Can you inline this for the NOMMU case please?
> 
> ---
> Subject: mm: Fix fixup_user_fault() for MMU=n 
> 
> In commit 2efaca927 ("mm/futex: fix futex writes on archs with SW
> tracking of dirty & young") we forgot about MMU=n. This patch fixes
> that.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Hoping the BUG() isn't trippable by userspace but then it's no mmu, it's
not like we care what userspace can do right :-)

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

Thanks !

Cheers,
Ben.

> ---
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -962,6 +962,8 @@ int invalidate_inode_page(struct page *p
>  #ifdef CONFIG_MMU
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
> +extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +			    unsigned long address, unsigned int fault_flags);
>  #else
>  static inline int handle_mm_fault(struct mm_struct *mm,
>  			struct vm_area_struct *vma, unsigned long address,
> @@ -971,6 +973,14 @@ static inline int handle_mm_fault(struct
>  	BUG();
>  	return VM_FAULT_SIGBUS;
>  }
> +static inline int fixup_user_fault(struct task_struct *tsk, 
> +		struct mm_struct *mm, unsigned long address,
> +		unsigned int fault_flags)
> +{
> +	/* should never happen if there's no MMU */
> +	BUG();
> +	return -EFAULT;
> +}
>  #endif
>  
>  extern int make_pages_present(unsigned long addr, unsigned long end);
> @@ -988,8 +998,6 @@ int get_user_pages(struct task_struct *t
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
>  struct page *get_dump_page(unsigned long addr);
> -extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> -			    unsigned long address, unsigned int fault_flags);
>  
>  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
>  extern void do_invalidatepage(struct page *page, unsigned long offset);
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Peter Zijlstra @ 2011-07-27 10:17 UTC (permalink / raw)
  To: David Howells
  Cc: tony.luck, Mike Frysinger, Shan Hai, linux-kernel, cmetcalf,
	paulus, uclinux-dist-devel, tglx, walken, linuxppc-dev, akpm
In-Reply-To: <20368.1311761379@redhat.com>

On Wed, 2011-07-27 at 11:09 +0100, David Howells wrote:
> Can you inline this for the NOMMU case please?

---
Subject: mm: Fix fixup_user_fault() for MMU=3Dn=20

In commit 2efaca927 ("mm/futex: fix futex writes on archs with SW
tracking of dirty & young") we forgot about MMU=3Dn. This patch fixes
that.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
Index: linux-2.6/include/linux/mm.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -962,6 +962,8 @@ int invalidate_inode_page(struct page *p
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vm=
a,
 			unsigned long address, unsigned int flags);
+extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+			    unsigned long address, unsigned int fault_flags);
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
@@ -971,6 +973,14 @@ static inline int handle_mm_fault(struct
 	BUG();
 	return VM_FAULT_SIGBUS;
 }
+static inline int fixup_user_fault(struct task_struct *tsk,=20
+		struct mm_struct *mm, unsigned long address,
+		unsigned int fault_flags)
+{
+	/* should never happen if there's no MMU */
+	BUG();
+	return -EFAULT;
+}
 #endif
=20
 extern int make_pages_present(unsigned long addr, unsigned long end);
@@ -988,8 +998,6 @@ int get_user_pages(struct task_struct *t
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
 struct page *get_dump_page(unsigned long addr);
-extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
-			    unsigned long address, unsigned int fault_flags);
=20
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
 extern void do_invalidatepage(struct page *page, unsigned long offset);

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: David Howells @ 2011-07-27 10:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tony.luck, Mike Frysinger, Shan Hai, linux-kernel, cmetcalf,
	dhowells, paulus, uclinux-dist-devel, tglx, walken, linuxppc-dev,
	akpm
In-Reply-To: <1311757190.24752.406.camel@twins>

Peter Zijlstra <peterz@infradead.org> wrote:

> > What should nommu do anyways ? it's not like there's much it can do
> > right ? It should never even hit the fault path to start with ...
> 
> Something like the below makes a nommu arm config build.. David, is this
> indeed the correct thing to do for nommu?
> 
> ---
> Index: linux-2.6/mm/nommu.c
> ===================================================================
> --- linux-2.6.orig/mm/nommu.c
> +++ linux-2.6/mm/nommu.c
> @@ -190,6 +190,12 @@ int get_user_pages(struct task_struct *t
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> +int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> +		     unsigned long address, unsigned int fault_flags)
> +{
> +	BUG(); /* nommu should never call this */
> +}
> +
>  /**
>   * follow_pfn - look up PFN at a user virtual address
>   * @vma: memory mapping

Or perhaps send SEGV?  Can 'address' be bad at this point?

Can you inline this for the NOMMU case please?

David

^ permalink raw reply

* [PATCHv4 01/11] atomic: add *_dec_not_zero
From: Sven Eckelmann @ 2011-07-27  9:47 UTC (permalink / raw)
  To: linux-arch
  Cc: linux-m32r-ja, linux-mips, linux-ia64, linux-doc, H. Peter Anvin,
	Heiko Carstens, Randy Dunlap, Paul Mackerras, Helge Deller,
	sparclinux, Sven Eckelmann, linux-s390, Russell King,
	user-mode-linux-devel, Richard Weinberger, Hirokazu Takata, x86,
	James E.J. Bottomley, Ingo Molnar, Matt Turner, Fenghua Yu,
	Arnd Bergmann, Jeff Dike, Chris Metcalf, linux-m32r,
	Ivan Kokshaysky, Thomas Gleixner, linux-arm-kernel,
	Richard Henderson, Tony Luck, linux-parisc, linux-kernel,
	Ralf Baechle, Kyle McMartin, linux-alpha, Martin Schwidefsky,
	linux390, linuxppc-dev, David S. Miller

Introduce an *_dec_not_zero operation.  Make this a special case of
*_add_unless because batman-adv uses atomic_dec_not_zero in different
places like re-broadcast queue or aggregation queue management. There
are other non-final patches which may also want to use this macro.

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m32r@ml.linux-m32r.org
Cc: linux-m32r-ja@ml.linux-m32r.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: user-mode-linux-devel@lists.sourceforge.net
---
David S. Miller recommended this change in
 https://lists.open-mesh.org/pipermail/b.a.t.m.a.n/2011-May/004560.html

Arnd Bergmann wanted to apply it in 201106172320.26476.arnd@arndb.de

... and then Arun Sharma created a big merge conflict with
https://lkml.org/lkml/2011/6/6/430

I don't think that it is a a good idea to asume that everyone still agrees
with the patch after I've rewritten it.

 Documentation/atomic_ops.txt       |    1 +
 arch/alpha/include/asm/atomic.h    |    1 +
 arch/alpha/include/asm/local.h     |    1 +
 arch/arm/include/asm/atomic.h      |    1 +
 arch/ia64/include/asm/atomic.h     |    1 +
 arch/m32r/include/asm/local.h      |    1 +
 arch/mips/include/asm/atomic.h     |    1 +
 arch/mips/include/asm/local.h      |    1 +
 arch/parisc/include/asm/atomic.h   |    1 +
 arch/powerpc/include/asm/atomic.h  |    1 +
 arch/powerpc/include/asm/local.h   |    1 +
 arch/s390/include/asm/atomic.h     |    1 +
 arch/sparc/include/asm/atomic_64.h |    1 +
 arch/tile/include/asm/atomic_32.h  |    1 +
 arch/tile/include/asm/atomic_64.h  |    1 +
 arch/um/sys-i386/atomic64_cx8_32.S |   28 ++++++++++++++++++++++++++++
 arch/x86/include/asm/atomic64_32.h |   12 ++++++++++++
 arch/x86/include/asm/atomic64_64.h |    1 +
 arch/x86/include/asm/local.h       |    1 +
 arch/x86/lib/atomic64_32.c         |    4 ++++
 arch/x86/lib/atomic64_386_32.S     |   21 +++++++++++++++++++++
 arch/x86/lib/atomic64_cx8_32.S     |   28 ++++++++++++++++++++++++++++
 include/asm-generic/atomic-long.h  |    2 ++
 include/asm-generic/atomic64.h     |    1 +
 include/asm-generic/local.h        |    1 +
 include/asm-generic/local64.h      |    2 ++
 include/linux/atomic.h             |    9 +++++++++
 27 files changed, 125 insertions(+), 0 deletions(-)

diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt
index 3bd585b..1eec221 100644
--- a/Documentation/atomic_ops.txt
+++ b/Documentation/atomic_ops.txt
@@ -190,6 +190,7 @@ atomic_add_unless requires explicit memory barriers around the operation
 unless it fails (returns 0).
 
 atomic_inc_not_zero, equivalent to atomic_add_unless(v, 1, 0)
+atomic_dec_not_zero, equivalent to atomic_add_unless(v, -1, 0)
 
 
 If a caller requires memory barrier semantics around an atomic_t
diff --git a/arch/alpha/include/asm/atomic.h b/arch/alpha/include/asm/atomic.h
index 640f909..09d1571 100644
--- a/arch/alpha/include/asm/atomic.h
+++ b/arch/alpha/include/asm/atomic.h
@@ -225,6 +225,7 @@ static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 #define atomic_add_negative(a, v) (atomic_add_return((a), (v)) < 0)
 #define atomic64_add_negative(a, v) (atomic64_add_return((a), (v)) < 0)
diff --git a/arch/alpha/include/asm/local.h b/arch/alpha/include/asm/local.h
index 9c94b84..51eb678 100644
--- a/arch/alpha/include/asm/local.h
+++ b/arch/alpha/include/asm/local.h
@@ -79,6 +79,7 @@ static __inline__ long local_sub_return(long i, local_t * l)
 	c != (u);						\
 })
 #define local_inc_not_zero(l) local_add_unless((l), 1, 0)
+#define local_dec_not_zero(l) local_add_unless((l), -1, 0)
 
 #define local_add_negative(a, l) (local_add_return((a), (l)) < 0)
 
diff --git a/arch/arm/include/asm/atomic.h b/arch/arm/include/asm/atomic.h
index 86976d0..80ed975 100644
--- a/arch/arm/include/asm/atomic.h
+++ b/arch/arm/include/asm/atomic.h
@@ -458,6 +458,7 @@ static inline int atomic64_add_unless(atomic64_t *v, u64 a, u64 u)
 #define atomic64_dec_return(v)		atomic64_sub_return(1LL, (v))
 #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1LL, 0LL)
+#define atomic64_dec_not_zero(v)	atomic64_add_unless((v), -1LL, 0LL)
 
 #endif /* !CONFIG_GENERIC_ATOMIC64 */
 #endif
diff --git a/arch/ia64/include/asm/atomic.h b/arch/ia64/include/asm/atomic.h
index 3fad89e..af6e9b2 100644
--- a/arch/ia64/include/asm/atomic.h
+++ b/arch/ia64/include/asm/atomic.h
@@ -122,6 +122,7 @@ static __inline__ long atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 #define atomic_add_return(i,v)						\
 ({									\
diff --git a/arch/m32r/include/asm/local.h b/arch/m32r/include/asm/local.h
index 734bca8..d536082 100644
--- a/arch/m32r/include/asm/local.h
+++ b/arch/m32r/include/asm/local.h
@@ -272,6 +272,7 @@ static inline int local_add_unless(local_t *l, long a, long u)
 }
 
 #define local_inc_not_zero(l) local_add_unless((l), 1, 0)
+#define local_dec_not_zero(l) local_add_unless((l), -1, 0)
 
 static inline void local_clear_mask(unsigned long  mask, local_t *addr)
 {
diff --git a/arch/mips/include/asm/atomic.h b/arch/mips/include/asm/atomic.h
index 1d93f81..babb043 100644
--- a/arch/mips/include/asm/atomic.h
+++ b/arch/mips/include/asm/atomic.h
@@ -697,6 +697,7 @@ static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 #define atomic64_dec_return(v) atomic64_sub_return(1, (v))
 #define atomic64_inc_return(v) atomic64_add_return(1, (v))
diff --git a/arch/mips/include/asm/local.h b/arch/mips/include/asm/local.h
index 94fde8d..0242256 100644
--- a/arch/mips/include/asm/local.h
+++ b/arch/mips/include/asm/local.h
@@ -137,6 +137,7 @@ static __inline__ long local_sub_return(long i, local_t * l)
 	c != (u);						\
 })
 #define local_inc_not_zero(l) local_add_unless((l), 1, 0)
+#define local_dec_not_zero(l) local_add_unless((l), -1, 0)
 
 #define local_dec_return(l) local_sub_return(1, (l))
 #define local_inc_return(l) local_add_return(1, (l))
diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
index b1dc71f..8a50234 100644
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -334,6 +334,7 @@ static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 #endif /* !CONFIG_64BIT */
 
diff --git a/arch/powerpc/include/asm/atomic.h b/arch/powerpc/include/asm/atomic.h
index e2a4c26..c0131a6 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -468,6 +468,7 @@ static __inline__ int atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 #endif /* __powerpc64__ */
 
diff --git a/arch/powerpc/include/asm/local.h b/arch/powerpc/include/asm/local.h
index b8da913..d182e34 100644
--- a/arch/powerpc/include/asm/local.h
+++ b/arch/powerpc/include/asm/local.h
@@ -134,6 +134,7 @@ static __inline__ int local_add_unless(local_t *l, long a, long u)
 }
 
 #define local_inc_not_zero(l) local_add_unless((l), 1, 0)
+#define local_dec_not_zero(l) local_add_unless((l), -1, 0)
 
 #define local_sub_and_test(a, l)	(local_sub_return((a), (l)) == 0)
 #define local_dec_and_test(l)		(local_dec_return((l)) == 0)
diff --git a/arch/s390/include/asm/atomic.h b/arch/s390/include/asm/atomic.h
index 8517d2a..92e7d5d 100644
--- a/arch/s390/include/asm/atomic.h
+++ b/arch/s390/include/asm/atomic.h
@@ -325,6 +325,7 @@ static inline long long atomic64_dec_if_positive(atomic64_t *v)
 #define atomic64_dec_return(_v)		atomic64_sub_return(1, _v)
 #define atomic64_dec_and_test(_v)	(atomic64_sub_return(1, _v) == 0)
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v)	atomic64_add_unless((v), -1, 0)
 
 #define smp_mb__before_atomic_dec()	smp_mb()
 #define smp_mb__after_atomic_dec()	smp_mb()
diff --git a/arch/sparc/include/asm/atomic_64.h b/arch/sparc/include/asm/atomic_64.h
index 9f421df..94cf160 100644
--- a/arch/sparc/include/asm/atomic_64.h
+++ b/arch/sparc/include/asm/atomic_64.h
@@ -106,6 +106,7 @@ static inline long atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 /* Atomic operations are already serializing */
 #define smp_mb__before_atomic_dec()	barrier()
diff --git a/arch/tile/include/asm/atomic_32.h b/arch/tile/include/asm/atomic_32.h
index c03349e..9cfafb3 100644
--- a/arch/tile/include/asm/atomic_32.h
+++ b/arch/tile/include/asm/atomic_32.h
@@ -233,6 +233,7 @@ static inline void atomic64_set(atomic64_t *v, u64 n)
 #define atomic64_dec_return(v)		atomic64_sub_return(1LL, (v))
 #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1LL, 0LL)
+#define atomic64_dec_not_zero(v)	atomic64_add_unless((v), -1LL, 0LL)
 
 /*
  * We need to barrier before modifying the word, since the _atomic_xxx()
diff --git a/arch/tile/include/asm/atomic_64.h b/arch/tile/include/asm/atomic_64.h
index 27fe667..9c22f50 100644
--- a/arch/tile/include/asm/atomic_64.h
+++ b/arch/tile/include/asm/atomic_64.h
@@ -141,6 +141,7 @@ static inline long atomic64_add_unless(atomic64_t *v, long a, long u)
 #define atomic64_add_negative(i, v)	(atomic64_add_return((i), (v)) < 0)
 
 #define atomic64_inc_not_zero(v)	atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v)	atomic64_add_unless((v), -1, 0)
 
 /* Atomic dec and inc don't implement barrier, so provide them if needed. */
 #define smp_mb__before_atomic_dec()	smp_mb()
diff --git a/arch/um/sys-i386/atomic64_cx8_32.S b/arch/um/sys-i386/atomic64_cx8_32.S
index 1e901d3..a58a1d4 100644
--- a/arch/um/sys-i386/atomic64_cx8_32.S
+++ b/arch/um/sys-i386/atomic64_cx8_32.S
@@ -223,3 +223,31 @@ ENTRY(atomic64_inc_not_zero_cx8)
 	jmp 3b
 	CFI_ENDPROC
 ENDPROC(atomic64_inc_not_zero_cx8)
+
+ENTRY(atomic64_dec_not_zero_cx8)
+	CFI_STARTPROC
+	SAVE ebx
+
+	read64 %esi
+1:
+	testl %eax, %eax
+	je 4f
+2:
+	movl %eax, %ebx
+	movl %edx, %ecx
+	subl $1, %ebx
+	sbbl $0, %ecx
+	LOCK_PREFIX
+	cmpxchg8b (%esi)
+	jne 1b
+
+	movl $1, %eax
+3:
+	RESTORE ebx
+	ret
+4:
+	testl %edx, %edx
+	jne 2b
+	jmp 3b
+	CFI_ENDPROC
+ENDPROC(atomic64_dec_not_zero_cx8)
diff --git a/arch/x86/include/asm/atomic64_32.h b/arch/x86/include/asm/atomic64_32.h
index 24098aa..3cd4431 100644
--- a/arch/x86/include/asm/atomic64_32.h
+++ b/arch/x86/include/asm/atomic64_32.h
@@ -287,6 +287,18 @@ static inline int atomic64_inc_not_zero(atomic64_t *v)
 	return r;
 }
 
+
+static inline int atomic64_dec_not_zero(atomic64_t *v)
+{
+	int r;
+	asm volatile(ATOMIC64_ALTERNATIVE(dec_not_zero)
+		     : "=a" (r)
+		     : "S" (v)
+		     : "ecx", "edx", "memory"
+		     );
+	return r;
+}
+
 static inline long long atomic64_dec_if_positive(atomic64_t *v)
 {
 	long long r;
diff --git a/arch/x86/include/asm/atomic64_64.h b/arch/x86/include/asm/atomic64_64.h
index 017594d..93c9d8b 100644
--- a/arch/x86/include/asm/atomic64_64.h
+++ b/arch/x86/include/asm/atomic64_64.h
@@ -220,6 +220,7 @@ static inline int atomic64_add_unless(atomic64_t *v, long a, long u)
 }
 
 #define atomic64_inc_not_zero(v) atomic64_add_unless((v), 1, 0)
+#define atomic64_dec_not_zero(v) atomic64_add_unless((v), -1, 0)
 
 /*
  * atomic64_dec_if_positive - decrement by 1 if old value positive
diff --git a/arch/x86/include/asm/local.h b/arch/x86/include/asm/local.h
index 9cdae5d..2c8c92d 100644
--- a/arch/x86/include/asm/local.h
+++ b/arch/x86/include/asm/local.h
@@ -185,6 +185,7 @@ static inline long local_sub_return(long i, local_t *l)
 	c != (u);						\
 })
 #define local_inc_not_zero(l) local_add_unless((l), 1, 0)
+#define local_dec_not_zero(l) local_add_unless((l), -1, 0)
 
 /* On x86_32, these are no better than the atomic variants.
  * On x86-64 these are better than the atomic variants on SMP kernels
diff --git a/arch/x86/lib/atomic64_32.c b/arch/x86/lib/atomic64_32.c
index 042f682..7da05c3 100644
--- a/arch/x86/lib/atomic64_32.c
+++ b/arch/x86/lib/atomic64_32.c
@@ -24,6 +24,8 @@ long long atomic64_dec_if_positive_cx8(atomic64_t *v);
 EXPORT_SYMBOL(atomic64_dec_if_positive_cx8);
 int atomic64_inc_not_zero_cx8(atomic64_t *v);
 EXPORT_SYMBOL(atomic64_inc_not_zero_cx8);
+int atomic64_dec_not_zero_cx8(atomic64_t *v);
+EXPORT_SYMBOL(atomic64_dec_not_zero_cx8);
 int atomic64_add_unless_cx8(atomic64_t *v, long long a, long long u);
 EXPORT_SYMBOL(atomic64_add_unless_cx8);
 
@@ -54,6 +56,8 @@ long long atomic64_dec_if_positive_386(atomic64_t *v);
 EXPORT_SYMBOL(atomic64_dec_if_positive_386);
 int atomic64_inc_not_zero_386(atomic64_t *v);
 EXPORT_SYMBOL(atomic64_inc_not_zero_386);
+int atomic64_dec_not_zero_386(atomic64_t *v);
+EXPORT_SYMBOL(atomic64_dec_not_zero_386);
 int atomic64_add_unless_386(atomic64_t *v, long long a, long long u);
 EXPORT_SYMBOL(atomic64_add_unless_386);
 #endif
diff --git a/arch/x86/lib/atomic64_386_32.S b/arch/x86/lib/atomic64_386_32.S
index e8e7e0d..c78337b 100644
--- a/arch/x86/lib/atomic64_386_32.S
+++ b/arch/x86/lib/atomic64_386_32.S
@@ -181,6 +181,27 @@ ENDP
 #undef v
 
 #define v %esi
+BEGIN(dec_not_zero)
+	movl  (v), %eax
+	movl 4(v), %edx
+	testl %eax, %eax
+	je 3f
+1:
+	subl $1, %eax
+	sbbl $0, %edx
+	movl %eax,  (v)
+	movl %edx, 4(v)
+	movl $1, %eax
+2:
+	RET
+3:
+	testl %edx, %edx
+	jne 1b
+	jmp 2b
+ENDP
+#undef v
+
+#define v %esi
 BEGIN(dec_if_positive)
 	movl  (v), %eax
 	movl 4(v), %edx
diff --git a/arch/x86/lib/atomic64_cx8_32.S b/arch/x86/lib/atomic64_cx8_32.S
index 391a083..989638c 100644
--- a/arch/x86/lib/atomic64_cx8_32.S
+++ b/arch/x86/lib/atomic64_cx8_32.S
@@ -220,3 +220,31 @@ ENTRY(atomic64_inc_not_zero_cx8)
 	jmp 3b
 	CFI_ENDPROC
 ENDPROC(atomic64_inc_not_zero_cx8)
+
+ENTRY(atomic64_dec_not_zero_cx8)
+	CFI_STARTPROC
+	SAVE ebx
+
+	read64 %esi
+1:
+	testl %eax, %eax
+	je 4f
+2:
+	movl %eax, %ebx
+	movl %edx, %ecx
+	subl $1, %ebx
+	sbbl $0, %ecx
+	LOCK_PREFIX
+	cmpxchg8b (%esi)
+	jne 1b
+
+	movl $1, %eax
+3:
+	RESTORE ebx
+	ret
+4:
+	testl %edx, %edx
+	jne 2b
+	jmp 3b
+	CFI_ENDPROC
+ENDPROC(atomic64_dec_not_zero_cx8)
diff --git a/include/asm-generic/atomic-long.h b/include/asm-generic/atomic-long.h
index b7babf0..0fe75ab 100644
--- a/include/asm-generic/atomic-long.h
+++ b/include/asm-generic/atomic-long.h
@@ -130,6 +130,7 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 }
 
 #define atomic_long_inc_not_zero(l) atomic64_inc_not_zero((atomic64_t *)(l))
+#define atomic_long_dec_not_zero(l) atomic64_dec_not_zero((atomic64_t *)(l))
 
 #define atomic_long_cmpxchg(l, old, new) \
 	(atomic64_cmpxchg((atomic64_t *)(l), (old), (new)))
@@ -247,6 +248,7 @@ static inline long atomic_long_add_unless(atomic_long_t *l, long a, long u)
 }
 
 #define atomic_long_inc_not_zero(l) atomic_inc_not_zero((atomic_t *)(l))
+#define atomic_long_dec_not_zero(l) atomic_dec_not_zero((atomic_t *)(l))
 
 #define atomic_long_cmpxchg(l, old, new) \
 	(atomic_cmpxchg((atomic_t *)(l), (old), (new)))
diff --git a/include/asm-generic/atomic64.h b/include/asm-generic/atomic64.h
index b18ce4f..90ff9b1 100644
--- a/include/asm-generic/atomic64.h
+++ b/include/asm-generic/atomic64.h
@@ -38,5 +38,6 @@ extern int	 atomic64_add_unless(atomic64_t *v, long long a, long long u);
 #define atomic64_dec_return(v)		atomic64_sub_return(1LL, (v))
 #define atomic64_dec_and_test(v)	(atomic64_dec_return((v)) == 0)
 #define atomic64_inc_not_zero(v) 	atomic64_add_unless((v), 1LL, 0LL)
+#define atomic64_dec_not_zero(v)	atomic64_add_unless((v), -1LL, 0LL)
 
 #endif  /*  _ASM_GENERIC_ATOMIC64_H  */
diff --git a/include/asm-generic/local.h b/include/asm-generic/local.h
index 9ceb03b..fabf4f3 100644
--- a/include/asm-generic/local.h
+++ b/include/asm-generic/local.h
@@ -44,6 +44,7 @@ typedef struct
 #define local_xchg(l, n) atomic_long_xchg((&(l)->a), (n))
 #define local_add_unless(l, _a, u) atomic_long_add_unless((&(l)->a), (_a), (u))
 #define local_inc_not_zero(l) atomic_long_inc_not_zero(&(l)->a)
+#define local_dec_not_zero(l) atomic_long_dec_not_zero(&(l)->a)
 
 /* Non-atomic variants, ie. preemption disabled and won't be touched
  * in interrupt, etc.  Some archs can optimize this case well. */
diff --git a/include/asm-generic/local64.h b/include/asm-generic/local64.h
index 5980002..76acbe2 100644
--- a/include/asm-generic/local64.h
+++ b/include/asm-generic/local64.h
@@ -45,6 +45,7 @@ typedef struct {
 #define local64_xchg(l, n)	local_xchg((&(l)->a), (n))
 #define local64_add_unless(l, _a, u) local_add_unless((&(l)->a), (_a), (u))
 #define local64_inc_not_zero(l)	local_inc_not_zero(&(l)->a)
+#define local64_dec_not_zero(l)	local_dec_not_zero(&(l)->a)
 
 /* Non-atomic variants, ie. preemption disabled and won't be touched
  * in interrupt, etc.  Some archs can optimize this case well. */
@@ -83,6 +84,7 @@ typedef struct {
 #define local64_xchg(l, n)	atomic64_xchg((&(l)->a), (n))
 #define local64_add_unless(l, _a, u) atomic64_add_unless((&(l)->a), (_a), (u))
 #define local64_inc_not_zero(l)	atomic64_inc_not_zero(&(l)->a)
+#define local64_dec_not_zero(l)	atomic64_dec_not_zero(&(l)->a)
 
 /* Non-atomic variants, ie. preemption disabled and won't be touched
  * in interrupt, etc.  Some archs can optimize this case well. */
diff --git a/include/linux/atomic.h b/include/linux/atomic.h
index 42b77b5..ad2b750 100644
--- a/include/linux/atomic.h
+++ b/include/linux/atomic.h
@@ -27,6 +27,15 @@ static inline int atomic_add_unless(atomic_t *v, int a, int u)
 #define atomic_inc_not_zero(v)		atomic_add_unless((v), 1, 0)
 
 /**
+ * atomic_dec_not_zero - decrement unless the number is zero
+ * @v: pointer of type atomic_t
+ *
+ * Atomically decrements @v by 1, so long as @v is non-zero.
+ * Returns non-zero if @v was non-zero, and zero otherwise.
+ */
+#define atomic_dec_not_zero(v)		atomic_add_unless((v), -1, 0)
+
+/**
  * atomic_inc_not_zero_hint - increment if not null
  * @v: pointer of type atomic_t
  * @hint: probable value of the atomic before the increment
-- 
1.7.5.4

^ permalink raw reply related

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Peter Zijlstra @ 2011-07-27  8:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Mike Frysinger, Shan Hai, linux-kernel, cmetcalf,
	dhowells, paulus, uclinux-dist-devel, tglx, walken, linuxppc-dev,
	akpm
In-Reply-To: <1311753513.25044.663.camel@pasglop>

On Wed, 2011-07-27 at 17:58 +1000, Benjamin Herrenschmidt wrote:

> What should nommu do anyways ? it's not like there's much it can do
> right ? It should never even hit the fault path to start with ...

Something like the below makes a nommu arm config build.. David, is this
indeed the correct thing to do for nommu?

---
Index: linux-2.6/mm/nommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- linux-2.6.orig/mm/nommu.c
+++ linux-2.6/mm/nommu.c
@@ -190,6 +190,12 @@ int get_user_pages(struct task_struct *t
 }
 EXPORT_SYMBOL(get_user_pages);
=20
+int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
+		     unsigned long address, unsigned int fault_flags)
+{
+	BUG(); /* nommu should never call this */
+}
+
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Benjamin Herrenschmidt @ 2011-07-27  7:58 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: tony.luck, Peter Zijlstra, Shan Hai, Peter Zijlstra, linux-kernel,
	cmetcalf, dhowells, paulus, uclinux-dist-devel, tglx, walken,
	linuxppc-dev, akpm
In-Reply-To: <CAMjpGUdxpaYBFfKjBEPFOJohnKoXfagLUHvhrst+NembeabxcA@mail.gmail.com>

On Tue, 2011-07-26 at 23:50 -0700, Mike Frysinger wrote:
> On Mon, Jul 18, 2011 at 21:29, Benjamin Herrenschmidt wrote:
> > The futex code currently attempts to write to user memory within
> > a pagefault disabled section, and if that fails, tries to fix it
> > up using get_user_pages().
> >
> > This doesn't work on archs where the dirty and young bits are
> > maintained by software, since they will gate access permission
> > in the TLB, and will not be updated by gup().
> >
> > In addition, there's an expectation on some archs that a
> > spurious write fault triggers a local TLB flush, and that is
> > missing from the picture as well.
> >
> > I decided that adding those "features" to gup() would be too much
> > for this already too complex function, and instead added a new
> > simpler fixup_user_fault() which is essentially a wrapper around
> > handle_mm_fault() which the futex code can call.
> 
> unfortunately, this breaks all nommu ports.  you added
> fixup_user_fault() to mm/memory.c only which is not used by nommu

Argh. Andrew, do you want to send a fix ? I won't be able to do that
tonight, I have to go.

What should nommu do anyways ? it's not like there's much it can do
right ? It should never even hit the fault path to start with ...

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC/PATCH] mm/futex: Fix futex writes on archs with SW tracking of dirty & young
From: Mike Frysinger @ 2011-07-27  6:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, Peter Zijlstra, Shan Hai, Peter Zijlstra, linux-kernel,
	cmetcalf, dhowells, paulus, uclinux-dist-devel, tglx, walken,
	linuxppc-dev, akpm
In-Reply-To: <1311049762.25044.392.camel@pasglop>

On Mon, Jul 18, 2011 at 21:29, Benjamin Herrenschmidt wrote:
> The futex code currently attempts to write to user memory within
> a pagefault disabled section, and if that fails, tries to fix it
> up using get_user_pages().
>
> This doesn't work on archs where the dirty and young bits are
> maintained by software, since they will gate access permission
> in the TLB, and will not be updated by gup().
>
> In addition, there's an expectation on some archs that a
> spurious write fault triggers a local TLB flush, and that is
> missing from the picture as well.
>
> I decided that adding those "features" to gup() would be too much
> for this already too complex function, and instead added a new
> simpler fixup_user_fault() which is essentially a wrapper around
> handle_mm_fault() which the futex code can call.

unfortunately, this breaks all nommu ports.  you added
fixup_user_fault() to mm/memory.c only which is not used by nommu
logic.
-mike

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox