LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC PATCH 1/2] dma-mapping: Clean up dma_set_*mask() hooks
From: Robin Murphy @ 2018-07-06 14:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: m.szyprowski, iommu, linux-arm-kernel, linux-ia64, linuxppc-dev
In-Reply-To: <20180705193738.GB28905@lst.de>

On 05/07/18 20:37, Christoph Hellwig wrote:
> On Wed, Jul 04, 2018 at 06:50:11PM +0100, Robin Murphy wrote:
>> Arch-specific implementions for dma_set_{coherent_,}mask() currently
>> rely on an inconsistent mix of arch-defined Kconfig symbols and macro
>> overrides. Now that we have a nice centralised home for DMA API gubbins,
>> let's consolidate these loose ends under consistent config options.
>>
>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>> ---
>>
>> Here's hoping the buildbot comes by to point out what I've inevitably
>> missed, although I did check a cursory cross-compile of ppc64_defconfig
>> to iron out the obvious howlers.
> 
> The patch looks sensible to me, although I was hoping to get rid of these
> hooks in this or the next merge window as they are a horrible bad idea.
> 
>> The motivation here is that I'm looking at adding set_mask overrides
>> for arm64, and having discovered a bit of a mess it seemed prudent to
>> clean up before ingraining it any more.
> 
> What are you trying to do?  I really don't want to see more users of
> the hooks as they are are a horribly bad idea.

I really need to fix the ongoing problem we have where, due to funky 
integrations, devices suffer some downstream addressing limit (described 
by DT dma-ranges or ACPI IORT/_DMA) which we carefully set up in 
dma_configure(), but then just gets lost when the driver probes and 
innocently calls dma_set_mask() with something wider. I think it's 
effectively the generalised case of the VIA 32-bit quirk, if I 
understand that one correctly.

The approach that seemed to me to be safest is largely based on the one 
proposed in a thread from ages ago[1]; namely to make dma_configure() 
better at distinguishing firmware-specified masks from bus defaults, 
capture the firmware mask in dev->archdata during arch_setup_dma_ops(), 
then use the custom set_mask routines to ensure any subsequent updates 
never exceed that. It doesn't seem possible to make this work robustly 
without storing *some* additional per-device data, and for that archdata 
is a lesser evil than struct device itself. Plus even though it's not 
actually an arch-specific issue it feels like there's such a risk of 
breaking other platforms that I'm reticent to even try handling it 
entirely in generic code.

Robin.

[1] 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1306507.html

^ permalink raw reply

* Re: [RFC PATCH 2/2] dma-mapping: Clean up dma_get_required_mask() hooks
From: Robin Murphy @ 2018-07-06 14:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: m.szyprowski, iommu, linux-arm-kernel, linux-ia64, linuxppc-dev
In-Reply-To: <20180705193846.GC28905@lst.de>

On 05/07/18 20:38, Christoph Hellwig wrote:
> On Wed, Jul 04, 2018 at 06:50:12PM +0100, Robin Murphy wrote:
>> As for the other mask-related hooks, standardise the arch override into
>> a Kconfig option, and also pull the generic implementation into the DMA
>> mapping code rather than having it hide away in the platform bus code.
> 
> Heh, I have a somewhat similar patch in my queue.  I didn't want it out
> because dma_get_required_mask is rather ill defined at the moment and
> I wanted to clean that up first.  But I guess I could apply this first
> and clean up later.
> 
> I just fear you might be wanting to add an arm64 user, so I'd really like
> to understand why and how.

Nah, this guy is just pure cleanup since it also fell out of my 'git 
grep ARCH_HAS_DMA'

Robin.

^ permalink raw reply

* NXP p1010se device trees only correct for P1010E/P1014E, not P1010/P1014 SoCs.
From: Tim Small @ 2018-07-06 14:50 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Kumar Gala

Hello,

I contributed a patch to OpenWRT a couple of days ago to fix the device 
tree for a device which uses an NXP p1014 without the SEC4 module.  I 
was wondering about getting a similar fix applied upstream...

The device uses the P1014 (without SEC4 functionality), and includes:

fsl/p1010si-pre.dtsi and fsl/p1010si-post.dtsi

... in its device tree.  The latter pulls in leads to a node for 
soc@ffe00000/crypto@30000 - which then causes the CAAM modules to be 
used for crypto operations, when then fail at runtime.

AN4938 states that there are versions of both the p1010 and p1014 
without the SEC4 module.

The P1010 errata at:

https://media.digikey.com/pdf/PCNs/Freescale/P1010CE_RevL.pdf

(table 2 on page 2), says that the P1010 and P1014 don't have the SEC4 
module, only the P1010E and P1014E models do.

So, I think there should probably be device trees for p101xE (with SEC) 
and p101x (without SEC).

Any thoughts?  Not really sure how best to do that...  My openwrt patch 
just does a:

/delete-node/ crypto@30000;

after the p1010si-post.dtsi include.

Cheers,

Tim.
-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309

^ permalink raw reply

* Re: NXP p1010se device trees only correct for P1010E/P1014E, not P1010/P1014 SoCs.
From: Scott Wood @ 2018-07-06 18:41 UTC (permalink / raw)
  To: Tim Small, linuxppc-dev
In-Reply-To: <1ed6ce0c-f7df-615b-ffdb-8dab25b506f6@seoss.co.uk>

On Fri, 2018-07-06 at 15:50 +0100, Tim Small wrote:
> Hello,
> 
> I contributed a patch to OpenWRT a couple of days ago to fix the device 
> tree for a device which uses an NXP p1014 without the SEC4 module.  I 
> was wondering about getting a similar fix applied upstream...
> 
> The device uses the P1014 (without SEC4 functionality), and includes:
> 
> fsl/p1010si-pre.dtsi and fsl/p1010si-post.dtsi
> 
> ... in its device tree.  The latter pulls in leads to a node for 
> soc@ffe00000/crypto@30000 - which then causes the CAAM modules to be 
> used for crypto operations, when then fail at runtime.
> 
> AN4938 states that there are versions of both the p1010 and p1014 
> without the SEC4 module.
> 
> The P1010 errata at:
> 
> https://media.digikey.com/pdf/PCNs/Freescale/P1010CE_RevL.pdf
> 
> (table 2 on page 2), says that the P1010 and P1014 don't have the SEC4 
> module, only the P1010E and P1014E models do.
> 
> So, I think there should probably be device trees for p101xE (with SEC) 
> and p101x (without SEC).
> 
> Any thoughts?  Not really sure how best to do that...  My openwrt patch 
> just does a:
> 
> /delete-node/ crypto@30000;
> 
> after the p1010si-post.dtsi include.

U-Boot should already be removing the node on non-E chips -- see
ft_cpu_setup() in arch/powerpc/cpu/mpc85xx/fdt.c

-Scott

^ permalink raw reply

* Re: powerpc: 32BIT vs. 64BIT (PPC32 vs. PPC64)
From: Benjamin Herrenschmidt @ 2018-07-07  1:45 UTC (permalink / raw)
  To: Randy Dunlap, linux-kbuild
  Cc: Paul Mackerras, Michael Ellerman, Stephen Rothwell, linuxppc-dev
In-Reply-To: <3f906b03-98ff-c081-4e19-b490f0b35c51@infradead.org>

On Thu, 2018-07-05 at 14:30 -0700, Randy Dunlap wrote:
> Hi,
> 
> Is there a good way (or a shortcut) to do something like:
> 
> $ make ARCH=powerpc O=PPC32 [other_options] allmodconfig
>   to get a PPC32/32BIT allmodconfig
> 
> and also be able to do:
> 
> $make ARCH=powerpc O=PPC64 [other_options] allmodconfig
>   to get a PPC64/64BIT allmodconfig?

Hrm... O= is for the separate build dir, so there much be something
else.

You mean having ARCH= aliases like ppc/ppc32 and ppc64 ?

That would be a matter of overriding some .config defaults I suppose, I
don't know how this is done on other archs.

I see the aliasing trick in the Makefile but that's about it.

> Note that arch/x86, arch/sh, and arch/sparc have ways to do
> some flavor(s) of this (from Documentation/kbuild/kbuild.txt;
> sh and sparc based on a recent "fix" patch from me):

I fail to see what you are actually talking about here ... sorry. Do
you have concrete examples on x86 or sparc ? From what I can tell the
"i386" or "sparc32/sparc64" aliases just change SRCARCH in Makefile and
32 vs 64-bit is just a Kconfig option...

> x86: i386 for 32 bit, x86_64 for 64 bit
> sh: sh for 32 bit, sh64 for 64 bit
> sparc: sparc32 for 32 bit, sparc64 for 64 bit

^ permalink raw reply

* Re: [PATCHv3 2/4] drivers/base: utilize device tree info to shutdown devices
From: Pingfan Liu @ 2018-07-07  4:02 UTC (permalink / raw)
  To: rjw
  Cc: linux-kernel, Greg Kroah-Hartman, Rafael J . Wysocki,
	Grygorii Strashko, Christoph Hellwig, Bjorn Helgaas, Dave Young,
	linux-pci, linuxppc-dev
In-Reply-To: <2067910.hkxRV6zLYm@aspire.rjw.lan>

On Fri, Jul 6, 2018 at 5:54 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> On Friday, July 6, 2018 5:02:15 AM CEST Pingfan Liu wrote:
> > On Thu, Jul 5, 2018 at 6:13 PM Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > On Tuesday, July 3, 2018 8:50:40 AM CEST Pingfan Liu wrote:
> > > > commit 52cdbdd49853 ("driver core: correct device's shutdown order")
> > > > places an assumption of supplier<-consumer order on the process of probe.
> > > > But it turns out to break down the parent <- child order in some scene.
> > > > E.g in pci, a bridge is enabled by pci core, and behind it, the devices
> > > > have been probed. Then comes the bridge's module, which enables extra
> > > > feature(such as hotplug) on this bridge. This will break the
> > > > parent<-children order and cause failure when "kexec -e" in some scenario.
> > > >
> > > > The detailed description of the scenario:
> > > > An IBM Power9 machine on which, two drivers portdrv_pci and shpchp(a mod)
> > > > match the PCI_CLASS_BRIDGE_PCI, but neither of them success to probe due
> > > > to some issue. For this case, the bridge is moved after its children in
> > > > devices_kset. Then, when "kexec -e", a ata-disk behind the bridge can not
> > > > write back buffer in flight due to the former shutdown of the bridge which
> > > > clears the BusMaster bit.
> > > >
> > > > It is a little hard to impose both "parent<-child" and "supplier<-consumer"
> > > > order on devices_kset. Take the following scene:
> > > > step0: before a consumer's probing, (note child_a is supplier of consumer_a)
> > > >   [ consumer-X, child_a, ...., child_z] [... consumer_a, ..., consumer_z, ...] supplier-X
> > > >                                          ^^^^^^^^^^ affected range ^^^^^^^^^^
> > > > step1: when probing, moving consumer-X after supplier-X
> > > >   [ child_a, ...., child_z] [.... consumer_a, ..., consumer_z, ...] supplier-X, consumer-X
> > > > step2: the children of consumer-X should be re-ordered to maintain the seq
> > > >   [... consumer_a, ..., consumer_z, ....] supplier-X  [consumer-X, child_a, ...., child_z]
> > > > step3: the consumer_a should be re-ordered to maintain the seq
> > > >   [... consumer_z, ...] supplier-X [ consumer-X, child_a, consumer_a ..., child_z]
> > > >
> > > > It requires two nested recursion to drain out all out-of-order item in
> > > > "affected range". To avoid such complicated code, this patch suggests
> > > > to utilize the info in device tree, instead of using the order of
> > > > devices_kset during shutdown. It iterates the device tree, and firstly
> > > > shutdown a device's children and consumers. After this patch, the buggy
> > > > commit is hollow and left to clean.
> > > >
> > > > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > Cc: Grygorii Strashko <grygorii.strashko@ti.com>
> > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > Cc: Bjorn Helgaas <helgaas@kernel.org>
> > > > Cc: Dave Young <dyoung@redhat.com>
> > > > Cc: linux-pci@vger.kernel.org
> > > > Cc: linuxppc-dev@lists.ozlabs.org
> > > > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > > > ---
> > > >  drivers/base/core.c    | 48 +++++++++++++++++++++++++++++++++++++++++++-----
> > > >  include/linux/device.h |  1 +
> > > >  2 files changed, 44 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > > > index a48868f..684b994 100644
> > > > --- a/drivers/base/core.c
> > > > +++ b/drivers/base/core.c
> > > > @@ -1446,6 +1446,7 @@ void device_initialize(struct device *dev)
> > > >       INIT_LIST_HEAD(&dev->links.consumers);
> > > >       INIT_LIST_HEAD(&dev->links.suppliers);
> > > >       dev->links.status = DL_DEV_NO_DRIVER;
> > > > +     dev->shutdown = false;
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(device_initialize);
> > > >
> > > > @@ -2811,7 +2812,6 @@ static void __device_shutdown(struct device *dev)
> > > >        * lock is to be held
> > > >        */
> > > >       parent = get_device(dev->parent);
> > > > -     get_device(dev);
> > >
> > > Why is the get_/put_device() not needed any more?
> > >
> > They are moved upper layer into device_for_each_child_shutdown().
> > Since there is lock breakage in __device_shutdown(), resorting to
> > ref++ to protect the ancestor.  And I think the
> > get_device(dev->parent) can be deleted either.
>
> Wouldn't that break USB?
>
Sorry, I can not figure out. Is USB not modeled up-to-down? This
recursion can handle the up-to-down ref issue automatically, due to
the nature of device tree. Any hints? Thanks.

> > > >       /*
> > > >        * Make sure the device is off the kset list, in the
> > > >        * event that dev->*->shutdown() doesn't remove it.
> > > > @@ -2842,23 +2842,60 @@ static void __device_shutdown(struct device *dev)
> > > >                       dev_info(dev, "shutdown\n");
> > > >               dev->driver->shutdown(dev);
> > > >       }
> > > > -
> > > > +     dev->shutdown = true;
> > > >       device_unlock(dev);
> > > >       if (parent)
> > > >               device_unlock(parent);
> > > >
> > > > -     put_device(dev);
> > > >       put_device(parent);
> > > >       spin_lock(&devices_kset->list_lock);
> > > >  }
> > > >
> > > > +/* shutdown dev's children and consumer firstly, then itself */
> > > > +static int device_for_each_child_shutdown(struct device *dev)
> > >
> > > Confusing name.
> > >
> > > What about device_shutdown_subordinate()?
> > >
> > Fine. My understanding of words is not exact.
> >
> > > > +{
> > > > +     struct klist_iter i;
> > > > +     struct device *child;
> > > > +     struct device_link *link;
> > > > +
> > > > +     /* already shutdown, then skip this sub tree */
> > > > +     if (dev->shutdown)
> > > > +             return 0;
> > > > +
> > > > +     if (!dev->p)
> > > > +             goto check_consumers;
> > > > +
> > > > +     /* there is breakage of lock in __device_shutdown(), and the redundant
> > > > +      * ref++ on srcu protected consumer is harmless since shutdown is not
> > > > +      * hot path.
> > > > +      */
> > > > +     get_device(dev);
> > > > +
> > > > +     klist_iter_init(&dev->p->klist_children, &i);
> > > > +     while ((child = next_device(&i)))
> > > > +             device_for_each_child_shutdown(child);
> > >
> > > Why don't you use device_for_each_child() here?
> > >
> > OK, I will try use it.
>
> Well, hold on.
>
> > > > +     klist_iter_exit(&i);
> > > > +
> > > > +check_consumers:
> > > > +     list_for_each_entry_rcu(link, &dev->links.consumers, s_node) {
> > > > +             if (!link->consumer->shutdown)
> > > > +                     device_for_each_child_shutdown(link->consumer);
> > > > +     }
> > > > +
> > > > +     __device_shutdown(dev);
> > > > +     put_device(dev);
> > >
> > > Possible reference counter imbalance AFAICS.
> > >
> > Yes, get_device() should be ahead of "if (!dev->p)". Is anything  else I miss?
>
> Yes, that's it.
>
> > > > +     return 0;
> > > > +}
> > >
> > > Well, instead of doing this dance, we might as well walk dpm_list here as it
> > > is in the right order.
> > >
> > Sorry, do you mean that using the same way to manage the dpm_list?
>
> No, I mean to use dpm_list instead of devices_kset for shutdown.
>
> They should be in the same order anyway if all is correct.
>
Yes, the dpm_list and devices_kset contains the same info. But can we
make the shutdown as the first step, it is more simple and easy to
verify on different ARCH. Then hunting for the solution of pm.

> > > Of course, that would require dpm_list to be available for CONFIG_PM unset,
> > > but it may be a better approach long term.
> > >
> > > > +
> > > >  /**
> > > >   * device_shutdown - call ->shutdown() on each device to shutdown.
> > > >   */
> > > >  void device_shutdown(void)
> > > >  {
> > > >       struct device *dev;
> > > > +     int idx;
> > > >
> > > > +     idx = device_links_read_lock();
> > > >       spin_lock(&devices_kset->list_lock);
> > > >       /*
> > > >        * Walk the devices list backward, shutting down each in turn.
> > > > @@ -2866,11 +2903,12 @@ void device_shutdown(void)
> > > >        * devices offline, even as the system is shutting down.
> > > >        */
> > > >       while (!list_empty(&devices_kset->list)) {
> > > > -             dev = list_entry(devices_kset->list.prev, struct device,
> > > > +             dev = list_entry(devices_kset->list.next, struct device,
> > > >                               kobj.entry);
> > > > -             __device_shutdown(dev);
> > > > +             device_for_each_child_shutdown(dev);
> > > >       }
> > > >       spin_unlock(&devices_kset->list_lock);
> > > > +     device_links_read_unlock(idx);
> > > >  }
> > > >
> > > >  /*
> > > > diff --git a/include/linux/device.h b/include/linux/device.h
> > > > index 055a69d..8a0f784 100644
> > > > --- a/include/linux/device.h
> > > > +++ b/include/linux/device.h
> > > > @@ -1003,6 +1003,7 @@ struct device {
> > > >       bool                    offline:1;
> > > >       bool                    of_node_reused:1;
> > > >       bool                    dma_32bit_limit:1;
> > > > +     bool                    shutdown:1; /* one direction: false->true */
> > > >  };
> > > >
> > > >  static inline struct device *kobj_to_dev(struct kobject *kobj)
> > > >
> > >
> > > If the device_kset_move_last() in really_probe() is the only problem,
> > > I'd rather try to fix that one in the first place.
> > >
> > > Why is it needed?
> > >
> > I had tried, but it turns out not easy to archive. The code is
> > https://patchwork.kernel.org/patch/10485195/. And I make a detailed
> > description of the algorithm in this patch's commit log. To be more
> > detailed, we face the potential out of order issue in really_probe()
> > like : 0th. [ consumer-X, child_a, ...., child_z] [... consumer_a,
> > ..., consumer_z, ...] supplier-X //(note child_a is supplier of
> > consumer_a).  To address all the potential out of order item in the
> > affected section [... consumer_a, ..., consumer_z, ...],  it will
> > incur two nested recursions.  1st, moving  consumer-X and its
> > descendants after supplier-X,  2nd, moving consumer_a after child_a,
> > 3rd. the 2nd step may pose the same situation of 0th.  Besides the two
> > interleaved recursion,  the breakage of spin lock requires more effort
> > to protect the item from disappearing in linked-list  (which I did not
> > implement in the https://patchwork.kernel.org/patch/10485195/). Hence
> > I turn to this cheap method.
>
> So I think that we simply need to drop the devices_kset_move_last() call
> from really_probe() as it is plain incorrect and the use case for it is
> questionable at best.
>
See the reply on different mail, I think there is other issue with the
current solution besides really_probe->devices_kset_move_last

Thanks,
Pingfan

^ permalink raw reply

* Re: [PATCHv3 0/4] drivers/base: bugfix for supplier<-consumer ordering in device_kset
From: Pingfan Liu @ 2018-07-07  4:24 UTC (permalink / raw)
  To: rafael
  Cc: lukas, linux-kernel, Greg Kroah-Hartman, Grygorii Strashko,
	Christoph Hellwig, Bjorn Helgaas, Dave Young, linux-pci,
	linuxppc-dev, linux-pm, kishon
In-Reply-To: <CAFgQCTugu5XO06_xXb4ihkdfGJxtFX0813g2J-Jk2JeYtPP1vA@mail.gmail.com>

On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
> > > [cc += Kishon Vijay Abraham]
> > >
> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
> > >> a mistake.
> > >>
> > >> I'm not really sure what the intention of it was as the changelog of
> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
> > >> insufficient without that change?)
> > >
> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
> > > whose reset pin needs to be driven high on shutdown, lest the MMC
> > > won't be found on the next boot.
> > >
> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
> > > as a regulator.  The regulator is enabled when the MMC probes and
> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
> > > low on shutdown and the MMC is not found on the next boot.
> > >
> > > To fix this, another kludge was invented wherein the GPIO expander
> > > driving the reset pin unconditionally drives all its pins high on
> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
> > > of all pcf lines").
> > >
> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
> > > be executed after the MMC expander's ->shutdown hook.
> > >
> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
> > > available yet, see mmc_regulator_get_supply().
> > >
> > > Note, I'm just piecing the information together from git history,
> > > I'm not responsible for these kludges.  (I'm innocent!)
> >
> > Sure enough. :-)
> >
> > In any case, calling devices_kset_move_last() in really_probe() is
> > plain broken and if its only purpose was to address a single, arguably
> > kludgy, use case, let's just get rid of it in the first place IMO.
> >
> Yes, if it is only used for a single use case.
>
Think it again, I saw other potential issue with the current code.
device_link_add->device_reorder_to_tail() can break the
"supplier<-consumer" order. During moving children after parent's
supplier, it ignores the order of child's consumer.  Beside this,
essentially both devices_kset_move_after/_before() and
device_pm_move_after/_before() expose  the shutdown order to the
indirect caller,  and we can not expect that the caller can not handle
it correctly. It should be a job of drivers core.  It is hard to
extract high dimension info and pack them into one dimension
linked-list. And in theory, it is warranted that the shutdown seq is
correct by using device tree info. More important, it is cheap with
the data structure in hand. So I think it is time to resolve the issue
once for all.

Thanks and regards,
Pingfan

^ permalink raw reply

* Re: powerpc: 32BIT vs. 64BIT (PPC32 vs. PPC64)
From: Randy Dunlap @ 2018-07-07  4:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linux-kbuild
  Cc: Paul Mackerras, Michael Ellerman, Stephen Rothwell, linuxppc-dev
In-Reply-To: <a3c9b9ee9ec116390c47dca65fcf14f138eccd3f.camel@kernel.crashing.org>

On 07/06/2018 06:45 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-07-05 at 14:30 -0700, Randy Dunlap wrote:
>> Hi,
>>
>> Is there a good way (or a shortcut) to do something like:
>>
>> $ make ARCH=powerpc O=PPC32 [other_options] allmodconfig
>>   to get a PPC32/32BIT allmodconfig
>>
>> and also be able to do:
>>
>> $make ARCH=powerpc O=PPC64 [other_options] allmodconfig
>>   to get a PPC64/64BIT allmodconfig?
> 
> Hrm... O= is for the separate build dir, so there much be something
> else.
> 
> You mean having ARCH= aliases like ppc/ppc32 and ppc64 ?

Yes.

> That would be a matter of overriding some .config defaults I suppose, I
> don't know how this is done on other archs.
> 
> I see the aliasing trick in the Makefile but that's about it.
> 
>> Note that arch/x86, arch/sh, and arch/sparc have ways to do
>> some flavor(s) of this (from Documentation/kbuild/kbuild.txt;
>> sh and sparc based on a recent "fix" patch from me):
> 
> I fail to see what you are actually talking about here ... sorry. Do
> you have concrete examples on x86 or sparc ? From what I can tell the
> "i386" or "sparc32/sparc64" aliases just change SRCARCH in Makefile and
> 32 vs 64-bit is just a Kconfig option...

Yes, your summary is mostly correct.

I'm just looking for a way to do cross-compile builds that are close to
ppc32 allmodconfig and ppc64 allmodconfig.


>> x86: i386 for 32 bit, x86_64 for 64 bit
>> sh: sh for 32 bit, sh64 for 64 bit
>> sparc: sparc32 for 32 bit, sparc64 for 64 bit

I saw the powerpc merge-config targets after I sent this original email.
I can do my own homebrew that I prefer over those, though.


thanks,
-- 
~Randy

^ permalink raw reply

* [PATCH] KVM: PPC: Book3S HV: add of_node_put() in success path
From: Nicholas Mc Guire @ 2018-07-07  6:53 UTC (permalink / raw)
  To: Author: Paul Mackerras
  Cc: Benjamin Herrenschmidt, Michael Ellerman, kvm-ppc, linuxppc-dev,
	linux-kernel, Nicholas Mc Guire

 The call to of_find_compatible_node() is returning a pointer with
incremented refcount so it must be explicitly decremented after the
last use. As here it is only being used for checking of node presence
but the result is not actually used in the success path it can be
dropped immediately.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: commit f725758b899f ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
---
Problem found by experimental coccinelle script

Patch was compiletested with: ppc64_defconfig (implies CONFIG_KVM=y,
CONFIG_KVM_BOOK3S_64_HV=m)
with many sparse warnings though not related to the proposed change

Patch is against 4.18-rc3 (localversion-next is next-20180705)

 arch/powerpc/kvm/book3s_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ee4a885..8680fb9 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4561,6 +4561,8 @@ static int kvmppc_book3s_init_hv(void)
 			pr_err("KVM-HV: Cannot determine method for accessing XICS\n");
 			return -ENODEV;
 		}
+		/* presence of intc confirmed - node can be dropped again */
+		of_node_put(np);
 	}
 #endif
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH] KVM: PPC: Book3S HV: fix constant size warning
From: Nicholas Mc Guire @ 2018-07-07  9:07 UTC (permalink / raw)
  To: Author: Paul Mackerras
  Cc: Benjamin Herrenschmidt, Michael Ellerman, kvm-ppc, linuxppc-dev,
	linux-kernel, Nicholas Mc Guire

 The constants are 64bit but not explicitly declared UL resulting
in sparse warnings. Fixed by declaring the constants UL.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
---

sparse fallout from compile checking book3s_hv.c:
arch/powerpc/kvm/book3s_hv.c:141:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:142:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:143:9: warning: constant 0x7FFF2908450D8DA9 is so big it is long
arch/powerpc/kvm/book3s_hv.c:144:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:145:9: warning: constant 0x199A421245058DA9 is so big it is long
arch/powerpc/kvm/book3s_hv.c:146:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:147:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:148:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:149:9: warning: constant 0x164520C62609AECA is so big it is long
arch/powerpc/kvm/book3s_hv.c:1667:60: warning: constant 0xf0000000000003ff is so big it is unsigned long
arch/powerpc/kvm/book3s_hv.c:2896:42: warning: constant 0x164520C62609AECA is so big it is long

most of the warnings are local constants in book3s_hv.c, PSSCR_GUEST_VIS
from reg.h is not. Although PSSCR_GUEST_VIS is not #ifdef CONFIG_PPC_BOOK3S
it seems only to be used in that particular platform (check with cscope).

The constants fit in the target variables so the size declaration 
discrepancy does not actually cause a problem here.

Patch was compiletested with: ppc64_defconfig (implies
CONFIG_KVM=y, CONFIG_KVM_BOOK3S_64_HV=m)

Patch is against 4.18-rc3 (localversion-next is next-20180705)

 arch/powerpc/include/asm/reg.h |  2 +-
 arch/powerpc/kvm/book3s_hv.c   | 16 ++++++++--------
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 5625684..858aa79 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -161,7 +161,7 @@
 #define PSSCR_ESL		0x00200000 /* Enable State Loss */
 #define PSSCR_SD		0x00400000 /* Status Disable */
 #define PSSCR_PLS	0xf000000000000000 /* Power-saving Level Status */
-#define PSSCR_GUEST_VIS	0xf0000000000003ff /* Guest-visible PSSCR fields */
+#define PSSCR_GUEST_VIS	0xf0000000000003ffUL /* Guest-visible PSSCR fields */
 #define PSSCR_FAKE_SUSPEND	0x00000400 /* Fake-suspend bit (P9 DD2.2) */
 #define PSSCR_FAKE_SUSPEND_LG	10	   /* Fake-suspend bit position */
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index ee4a885..c517859 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -128,14 +128,14 @@ static int kvmppc_hv_setup_htab_rma(struct kvm_vcpu *vcpu);
  * and SPURR count and should be set according to the number of
  * online threads in the vcore being run.
  */
-#define RWMR_RPA_P8_1THREAD	0x164520C62609AECA
-#define RWMR_RPA_P8_2THREAD	0x7FFF2908450D8DA9
-#define RWMR_RPA_P8_3THREAD	0x164520C62609AECA
-#define RWMR_RPA_P8_4THREAD	0x199A421245058DA9
-#define RWMR_RPA_P8_5THREAD	0x164520C62609AECA
-#define RWMR_RPA_P8_6THREAD	0x164520C62609AECA
-#define RWMR_RPA_P8_7THREAD	0x164520C62609AECA
-#define RWMR_RPA_P8_8THREAD	0x164520C62609AECA
+#define RWMR_RPA_P8_1THREAD	0x164520C62609AECAUL
+#define RWMR_RPA_P8_2THREAD	0x7FFF2908450D8DA9UL
+#define RWMR_RPA_P8_3THREAD	0x164520C62609AECAUL
+#define RWMR_RPA_P8_4THREAD	0x199A421245058DA9UL
+#define RWMR_RPA_P8_5THREAD	0x164520C62609AECAUL
+#define RWMR_RPA_P8_6THREAD	0x164520C62609AECAUL
+#define RWMR_RPA_P8_7THREAD	0x164520C62609AECAUL
+#define RWMR_RPA_P8_8THREAD	0x164520C62609AECAUL
 
 static unsigned long p8_rwmr_values[MAX_SMT_THREADS + 1] = {
 	RWMR_RPA_P8_1THREAD,
-- 
2.1.4

^ permalink raw reply related

* [PATCH kernel v5 0/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-07 10:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh, Nicholas Piggin

This is to improve page boundaries checking and should probably
be cc:stable. I came accross this while debugging nvlink2 passthrough
but the lack of checking might be exploited by the existing userspace.

The get_user_pages() comment says it should be "phased out" but the only
alternative seems to be get_user_pages_longterm(), should that be used
instead (this is longterm reference elevation, however it is not DAX,
whatever this implies)? get_user_pages_remote() seems unnecessarily
complicated because of @locked.

Changes:
v5:
* 2/2: changed compound pages handling

v4:
* 2/2: implemented less strict but still safe max pageshift as David suggested

v3:
* enforced huge pages not to cross preregistered chunk boundaries

v2:
* 2/2: explicitly check for compound pages before calling compound_order()

This is based on sha1
021c917 Linus Torvalds "Linux 4.18-rc3".

Please comment. Thanks.

Alexey Kardashevskiy (2):
  vfio/spapr: Use IOMMU pageshift rather than pagesize
  KVM: PPC: Check if IOMMU page is contained in the pinned physical page

 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
 drivers/vfio/vfio_iommu_spapr_tce.c    | 10 +++++-----
 5 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-07 10:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh, Nicholas Piggin
In-Reply-To: <20180707104410.46589-1-aik@ozlabs.ru>

A VM which has:
 - a DMA capable device passed through to it (eg. network card);
 - running a malicious kernel that ignores H_PUT_TCE failure;
 - capability of using IOMMU pages bigger that physical pages
can create an IOMMU mapping that exposes (for example) 16MB of
the host physical memory to the device when only 64K was allocated to the VM.

The remaining 16MB - 64K will be some other content of host memory, possibly
including pages of the VM, but also pages of host kernel memory, host
programs or other VMs.

The attacking VM does not control the location of the page it can map,
and is only allowed to map as many pages as it has pages of RAM.

We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
an IOMMU page is contained in the physical page so the PCI hardware won't
get access to unassigned host memory; however this check is missing in
the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
did not hit this yet as the very first time when the mapping happens
we do not have tbl::it_userspace allocated yet and fall back to
the userspace which in turn calls VFIO IOMMU driver, this fails and
the guest does not retry,

This stores the smallest preregistered page size in the preregistered
region descriptor and changes the mm_iommu_xxx API to check this against
the IOMMU page size. This calculates maximum page size as a minimum of
the natural region alignment and hugepagetlb's compound page size.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* only consider compound pages from hugetlbfs

v4:
* reimplemented max pageshift calculation

v3:
* fixed upper limit for the page size
* added checks that we don't register parts of a huge page

v2:
* explicitely check for compound pages before calling compound_order()

---
The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
advertise 16MB pages to the guest; a typical pseries guest will use 16MB
for IOMMU pages without checking the mmu pagesize and this will fail
at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256

With the change, mapping will fail in KVM and the guest will print:

mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
---
 arch/powerpc/include/asm/mmu_context.h |  4 ++--
 arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
 arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
 arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
 drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
 5 files changed, 35 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
index 896efa5..79d570c 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
 extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 		unsigned long ua, unsigned long entries);
 extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa);
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
 extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
 extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
 #endif
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d066e37..8c456fa 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
+	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
 		return H_HARDWARE;
 
 	if (mm_iommu_mapped_inc(mem))
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 925fc31..5b298f5 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
 	if (!mem)
 		return H_TOO_HARD;
 
-	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
+	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
+			&hpa)))
 		return H_HARDWARE;
 
 	pua = (void *) vmalloc_to_phys(pua);
@@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 
 		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
 		if (mem)
-			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
+			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
+					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
 	}
 
 	if (!prereg) {
diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
index abb4364..0aebed6 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -18,6 +18,7 @@
 #include <linux/migrate.h>
 #include <linux/hugetlb.h>
 #include <linux/swap.h>
+#include <linux/hugetlb.h>
 #include <asm/mmu_context.h>
 
 static DEFINE_MUTEX(mem_list_mutex);
@@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
 	struct rcu_head rcu;
 	unsigned long used;
 	atomic64_t mapped;
+	unsigned int pageshift;
 	u64 ua;			/* userspace address */
 	u64 entries;		/* number of entries in hpas[] */
 	u64 *hpas;		/* vmalloc'ed */
@@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 {
 	struct mm_iommu_table_group_mem_t *mem;
 	long i, j, ret = 0, locked_entries = 0;
+	unsigned int pageshift;
 	struct page *page = NULL;
 
 	mutex_lock(&mem_list_mutex);
@@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		goto unlock_exit;
 	}
 
+	/*
+	 * Use @ua and @entries alignment as a starting point for a maximum
+	 * page size calculation below.
+	 */
+	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
 	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
 	if (!mem->hpas) {
 		kfree(mem);
@@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 	}
 
 	for (i = 0; i < entries; ++i) {
-		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
-					1/* pages */, 1/* iswrite */, &page)) {
+		struct vm_area_struct *vma = NULL;
+
+		if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
+					1/* pages */, 1/* iswrite */, &page,
+					&vma)) {
 			ret = -EFAULT;
 			for (j = 0; j < i; ++j)
 				put_page(pfn_to_page(mem->hpas[j] >>
@@ -186,9 +197,9 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 		if (is_migrate_cma_page(page)) {
 			if (mm_iommu_move_page_from_cma(page))
 				goto populate;
-			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+			if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
 						1/* pages */, 1/* iswrite */,
-						&page)) {
+						&page, &vma)) {
 				ret = -EFAULT;
 				for (j = 0; j < i; ++j)
 					put_page(pfn_to_page(mem->hpas[j] >>
@@ -199,6 +210,10 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
 			}
 		}
 populate:
+		pageshift = PAGE_SHIFT;
+		if (vma && vma->vm_file && is_file_hugepages(vma->vm_file))
+			pageshift += hstate_vma(vma)->order;
+		mem->pageshift = min(mem->pageshift, pageshift);
 		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
 	}
 
@@ -349,7 +364,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
 EXPORT_SYMBOL_GPL(mm_iommu_find);
 
 long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	u64 *va = &mem->hpas[entry];
@@ -357,6 +372,9 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	*hpa = *va | (ua & ~PAGE_MASK);
 
 	return 0;
@@ -364,7 +382,7 @@ long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
 EXPORT_SYMBOL_GPL(mm_iommu_ua_to_hpa);
 
 long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
-		unsigned long ua, unsigned long *hpa)
+		unsigned long ua, unsigned int pageshift, unsigned long *hpa)
 {
 	const long entry = (ua - mem->ua) >> PAGE_SHIFT;
 	void *va = &mem->hpas[entry];
@@ -373,6 +391,9 @@ long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
 	if (entry >= mem->entries)
 		return -EFAULT;
 
+	if (pageshift > mem->pageshift)
+		return -EFAULT;
+
 	pa = (void *) vmalloc_to_phys(va);
 	if (!pa)
 		return -EFAULT;
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 2da5f05..7cd63b0 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -467,7 +467,7 @@ static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
 	if (!mem)
 		return -EINVAL;
 
-	ret = mm_iommu_ua_to_hpa(mem, tce, phpa);
+	ret = mm_iommu_ua_to_hpa(mem, tce, shift, phpa);
 	if (ret)
 		return -EINVAL;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel v5 1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
From: Alexey Kardashevskiy @ 2018-07-07 10:44 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh, Nicholas Piggin
In-Reply-To: <20180707104410.46589-1-aik@ozlabs.ru>

The size is always equal to 1 page so let's use this. Later on this will
be used for other checks which use page shifts to check the granularity
of access.

This should cause no behavioral change.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

As Alex suggested, this should go via the ppc tree which the next patch
is going to (which is ppc-kvm).
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
index 759a5bd..2da5f05 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -457,13 +457,13 @@ static void tce_iommu_unuse_page(struct tce_container *container,
 }
 
 static int tce_iommu_prereg_ua_to_hpa(struct tce_container *container,
-		unsigned long tce, unsigned long size,
+		unsigned long tce, unsigned long shift,
 		unsigned long *phpa, struct mm_iommu_table_group_mem_t **pmem)
 {
 	long ret = 0;
 	struct mm_iommu_table_group_mem_t *mem;
 
-	mem = mm_iommu_lookup(container->mm, tce, size);
+	mem = mm_iommu_lookup(container->mm, tce, 1ULL << shift);
 	if (!mem)
 		return -EINVAL;
 
@@ -487,7 +487,7 @@ static void tce_iommu_unuse_page_v2(struct tce_container *container,
 	if (!pua)
 		return;
 
-	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, IOMMU_PAGE_SIZE(tbl),
+	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
 			&hpa, &mem);
 	if (ret)
 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
@@ -611,7 +611,7 @@ static long tce_iommu_build_v2(struct tce_container *container,
 				entry + i);
 
 		ret = tce_iommu_prereg_ua_to_hpa(container,
-				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
+				tce, tbl->it_page_shift, &hpa, &mem);
 		if (ret)
 			break;
 
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Nicholas Piggin @ 2018-07-07 11:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh
In-Reply-To: <20180707104410.46589-3-aik@ozlabs.ru>

On Sat,  7 Jul 2018 20:44:10 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> A VM which has:
>  - a DMA capable device passed through to it (eg. network card);
>  - running a malicious kernel that ignores H_PUT_TCE failure;
>  - capability of using IOMMU pages bigger that physical pages
> can create an IOMMU mapping that exposes (for example) 16MB of
> the host physical memory to the device when only 64K was allocated to the VM.
> 
> The remaining 16MB - 64K will be some other content of host memory, possibly
> including pages of the VM, but also pages of host kernel memory, host
> programs or other VMs.
> 
> The attacking VM does not control the location of the page it can map,
> and is only allowed to map as many pages as it has pages of RAM.
> 
> We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> an IOMMU page is contained in the physical page so the PCI hardware won't
> get access to unassigned host memory; however this check is missing in
> the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> did not hit this yet as the very first time when the mapping happens
> we do not have tbl::it_userspace allocated yet and fall back to
> the userspace which in turn calls VFIO IOMMU driver, this fails and
> the guest does not retry,
> 
> This stores the smallest preregistered page size in the preregistered
> region descriptor and changes the mm_iommu_xxx API to check this against
> the IOMMU page size. This calculates maximum page size as a minimum of
> the natural region alignment and hugepagetlb's compound page size.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v5:
> * only consider compound pages from hugetlbfs
> 
> v4:
> * reimplemented max pageshift calculation
> 
> v3:
> * fixed upper limit for the page size
> * added checks that we don't register parts of a huge page
> 
> v2:
> * explicitely check for compound pages before calling compound_order()
> 
> ---
> The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> for IOMMU pages without checking the mmu pagesize and this will fail
> at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> 
> With the change, mapping will fail in KVM and the guest will print:
> 
> mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
> ---
>  arch/powerpc/include/asm/mmu_context.h |  4 ++--
>  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
>  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
>  arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
>  drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
>  5 files changed, 35 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> index 896efa5..79d570c 100644
> --- a/arch/powerpc/include/asm/mmu_context.h
> +++ b/arch/powerpc/include/asm/mmu_context.h
> @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
>  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
>  		unsigned long ua, unsigned long entries);
>  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa);
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> -		unsigned long ua, unsigned long *hpa);
> +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
>  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
>  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
>  #endif
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> index d066e37..8c456fa 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
>  		return H_TOO_HARD;
>  
> -	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
>  		return H_HARDWARE;
>  
>  	if (mm_iommu_mapped_inc(mem))
> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> index 925fc31..5b298f5 100644
> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
>  	if (!mem)
>  		return H_TOO_HARD;
>  
> -	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
> +			&hpa)))
>  		return H_HARDWARE;
>  
>  	pua = (void *) vmalloc_to_phys(pua);
> @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
>  
>  		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
>  		if (mem)
> -			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
> +					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
>  	}
>  
>  	if (!prereg) {
> diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> index abb4364..0aebed6 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -18,6 +18,7 @@
>  #include <linux/migrate.h>
>  #include <linux/hugetlb.h>
>  #include <linux/swap.h>
> +#include <linux/hugetlb.h>
>  #include <asm/mmu_context.h>
>  
>  static DEFINE_MUTEX(mem_list_mutex);
> @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
>  	struct rcu_head rcu;
>  	unsigned long used;
>  	atomic64_t mapped;
> +	unsigned int pageshift;
>  	u64 ua;			/* userspace address */
>  	u64 entries;		/* number of entries in hpas[] */
>  	u64 *hpas;		/* vmalloc'ed */
> @@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  {
>  	struct mm_iommu_table_group_mem_t *mem;
>  	long i, j, ret = 0, locked_entries = 0;
> +	unsigned int pageshift;
>  	struct page *page = NULL;
>  
>  	mutex_lock(&mem_list_mutex);
> @@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		goto unlock_exit;
>  	}
>  
> +	/*
> +	 * Use @ua and @entries alignment as a starting point for a maximum
> +	 * page size calculation below.
> +	 */
> +	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
>  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
>  	if (!mem->hpas) {
>  		kfree(mem);

I think it would be better to start with UINT_MAX or something to make
it clear that it's not a valid "guess" but rather just a starting point
for the min().

> @@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  	}
>  
>  	for (i = 0; i < entries; ++i) {
> -		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> -					1/* pages */, 1/* iswrite */, &page)) {
> +		struct vm_area_struct *vma = NULL;
> +
> +		if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
> +					1/* pages */, 1/* iswrite */, &page,
> +					&vma)) {
>  			ret = -EFAULT;
>  			for (j = 0; j < i; ++j)
>  				put_page(pfn_to_page(mem->hpas[j] >>

You need mmap_sem for read here, and held while you inspect the vma.
I would say don't hold it over the error handling and the
mm_iommu_move_page_from_cma() call though.

> @@ -186,9 +197,9 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  		if (is_migrate_cma_page(page)) {
>  			if (mm_iommu_move_page_from_cma(page))
>  				goto populate;
> -			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> +			if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
>  						1/* pages */, 1/* iswrite */,
> -						&page)) {
> +						&page, &vma)) {
>  				ret = -EFAULT;
>  				for (j = 0; j < i; ++j)
>  					put_page(pfn_to_page(mem->hpas[j] >>
> @@ -199,6 +210,10 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
>  			}
>  		}
>  populate:
> +		pageshift = PAGE_SHIFT;
> +		if (vma && vma->vm_file && is_file_hugepages(vma->vm_file))
> +			pageshift += hstate_vma(vma)->order;

I would just set to huge_page_shift() here so you don't expose this
hstate quirk that adds order to PAGE_SIZE.

> +		mem->pageshift = min(mem->pageshift, pageshift);
>  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>  	}
>  

Otherwise the mm side of things looks okay, I think.

Thanks,
Nick

^ permalink raw reply

* Re: powerpc: 32BIT vs. 64BIT (PPC32 vs. PPC64)
From: Nicholas Piggin @ 2018-07-07 12:13 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Benjamin Herrenschmidt, linux-kbuild, Paul Mackerras,
	linuxppc-dev, Stephen Rothwell
In-Reply-To: <4f3a5969-88d5-134d-508d-7617947c588c@infradead.org>

On Fri, 6 Jul 2018 21:58:29 -0700
Randy Dunlap <rdunlap@infradead.org> wrote:

> On 07/06/2018 06:45 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-07-05 at 14:30 -0700, Randy Dunlap wrote:  
> >> Hi,
> >>
> >> Is there a good way (or a shortcut) to do something like:
> >>
> >> $ make ARCH=powerpc O=PPC32 [other_options] allmodconfig
> >>   to get a PPC32/32BIT allmodconfig
> >>
> >> and also be able to do:
> >>
> >> $make ARCH=powerpc O=PPC64 [other_options] allmodconfig
> >>   to get a PPC64/64BIT allmodconfig?  
> > 
> > Hrm... O= is for the separate build dir, so there much be something
> > else.
> > 
> > You mean having ARCH= aliases like ppc/ppc32 and ppc64 ?  
> 
> Yes.
> 
> > That would be a matter of overriding some .config defaults I suppose, I
> > don't know how this is done on other archs.
> > 
> > I see the aliasing trick in the Makefile but that's about it.
> >   
> >> Note that arch/x86, arch/sh, and arch/sparc have ways to do
> >> some flavor(s) of this (from Documentation/kbuild/kbuild.txt;
> >> sh and sparc based on a recent "fix" patch from me):  
> > 
> > I fail to see what you are actually talking about here ... sorry. Do
> > you have concrete examples on x86 or sparc ? From what I can tell the
> > "i386" or "sparc32/sparc64" aliases just change SRCARCH in Makefile and
> > 32 vs 64-bit is just a Kconfig option...  
> 
> Yes, your summary is mostly correct.
> 
> I'm just looking for a way to do cross-compile builds that are close to
> ppc32 allmodconfig and ppc64 allmodconfig.

Would there a problem with adding ARCH=ppc32 / ppc64 matching? This
seems to work...

Thanks,
Nick

---
 Makefile                               | 8 ++++++++
 arch/powerpc/Kconfig                   | 9 +++++++++
 arch/powerpc/platforms/Kconfig.cputype | 8 --------
 3 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/Makefile b/Makefile
index c5ce55cbc543..f97204aed17a 100644
--- a/Makefile
+++ b/Makefile
@@ -345,6 +345,14 @@ ifeq ($(ARCH),sh64)
        SRCARCH := sh
 endif
 
+# Additional ARCH settings for powerpc
+ifeq ($(ARCH),ppc32)
+       SRCARCH := powerpc
+endif
+ifeq ($(ARCH),ppc64)
+       SRCARCH := powerpc
+endif
+
 KCONFIG_CONFIG	?= .config
 export KCONFIG_CONFIG
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 9f2b75fe2c2d..3405b1b122be 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1,4 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
+
+config PPC64
+	bool "64-bit kernel" if "$(ARCH)" = "powerpc"
+	default "$(ARCH)" != "ppc32"
+	select ZLIB_DEFLATE
+	help
+	  This option selects whether a 32-bit or a 64-bit kernel
+	  will be built.
+
 source "arch/powerpc/platforms/Kconfig.cputype"
 
 config PPC32
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index e6a1de521319..f6e5d6ef9782 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -1,12 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
-config PPC64
-	bool "64-bit kernel"
-	default n
-	select ZLIB_DEFLATE
-	help
-	  This option selects whether a 32-bit or a 64-bit kernel
-	  will be built.
-
 menu "Processor support"
 choice
 	prompt "Processor Type"
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Alexey Kardashevskiy @ 2018-07-07 12:44 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh
In-Reply-To: <20180707214303.1daafccc@roar.ozlabs.ibm.com>

On Sat, 7 Jul 2018 21:43:03 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Sat,  7 Jul 2018 20:44:10 +1000
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > A VM which has:
> >  - a DMA capable device passed through to it (eg. network card);
> >  - running a malicious kernel that ignores H_PUT_TCE failure;
> >  - capability of using IOMMU pages bigger that physical pages
> > can create an IOMMU mapping that exposes (for example) 16MB of
> > the host physical memory to the device when only 64K was allocated to the VM.
> > 
> > The remaining 16MB - 64K will be some other content of host memory, possibly
> > including pages of the VM, but also pages of host kernel memory, host
> > programs or other VMs.
> > 
> > The attacking VM does not control the location of the page it can map,
> > and is only allowed to map as many pages as it has pages of RAM.
> > 
> > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > an IOMMU page is contained in the physical page so the PCI hardware won't
> > get access to unassigned host memory; however this check is missing in
> > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> > did not hit this yet as the very first time when the mapping happens
> > we do not have tbl::it_userspace allocated yet and fall back to
> > the userspace which in turn calls VFIO IOMMU driver, this fails and
> > the guest does not retry,
> > 
> > This stores the smallest preregistered page size in the preregistered
> > region descriptor and changes the mm_iommu_xxx API to check this against
> > the IOMMU page size. This calculates maximum page size as a minimum of
> > the natural region alignment and hugepagetlb's compound page size.
> > 
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> > Changes:
> > v5:
> > * only consider compound pages from hugetlbfs
> > 
> > v4:
> > * reimplemented max pageshift calculation
> > 
> > v3:
> > * fixed upper limit for the page size
> > * added checks that we don't register parts of a huge page
> > 
> > v2:
> > * explicitely check for compound pages before calling compound_order()
> > 
> > ---
> > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > for IOMMU pages without checking the mmu pagesize and this will fail
> > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > 
> > With the change, mapping will fail in KVM and the guest will print:
> > 
> > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
> > ---
> >  arch/powerpc/include/asm/mmu_context.h |  4 ++--
> >  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
> >  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
> >  arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
> >  drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
> >  5 files changed, 35 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > index 896efa5..79d570c 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> >  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> >  		unsigned long ua, unsigned long entries);
> >  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> > -		unsigned long ua, unsigned long *hpa);
> > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> >  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> > -		unsigned long ua, unsigned long *hpa);
> > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> >  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> >  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
> >  #endif
> > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > index d066e37..8c456fa 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> >  		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> >  		return H_TOO_HARD;
> >  
> > -	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> > +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
> >  		return H_HARDWARE;
> >  
> >  	if (mm_iommu_mapped_inc(mem))
> > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > index 925fc31..5b298f5 100644
> > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> >  	if (!mem)
> >  		return H_TOO_HARD;
> >  
> > -	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> > +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
> > +			&hpa)))
> >  		return H_HARDWARE;
> >  
> >  	pua = (void *) vmalloc_to_phys(pua);
> > @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >  
> >  		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> >  		if (mem)
> > -			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> > +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
> > +					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
> >  	}
> >  
> >  	if (!prereg) {
> > diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> > index abb4364..0aebed6 100644
> > --- a/arch/powerpc/mm/mmu_context_iommu.c
> > +++ b/arch/powerpc/mm/mmu_context_iommu.c
> > @@ -18,6 +18,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/swap.h>
> > +#include <linux/hugetlb.h>
> >  #include <asm/mmu_context.h>
> >  
> >  static DEFINE_MUTEX(mem_list_mutex);
> > @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
> >  	struct rcu_head rcu;
> >  	unsigned long used;
> >  	atomic64_t mapped;
> > +	unsigned int pageshift;
> >  	u64 ua;			/* userspace address */
> >  	u64 entries;		/* number of entries in hpas[] */
> >  	u64 *hpas;		/* vmalloc'ed */
> > @@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  {
> >  	struct mm_iommu_table_group_mem_t *mem;
> >  	long i, j, ret = 0, locked_entries = 0;
> > +	unsigned int pageshift;
> >  	struct page *page = NULL;
> >  
> >  	mutex_lock(&mem_list_mutex);
> > @@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  		goto unlock_exit;
> >  	}
> >  
> > +	/*
> > +	 * Use @ua and @entries alignment as a starting point for a maximum
> > +	 * page size calculation below.
> > +	 */
> > +	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
> >  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
> >  	if (!mem->hpas) {
> >  		kfree(mem);  
> 
> I think it would be better to start with UINT_MAX or something to make
> it clear that it's not a valid "guess" but rather just a starting point
> for the min().

This is to handle unaligned @ua/@entries backed by a huge page. For
example, 1GB page and we are preregistering only a second half of it -
the maximum pageshift should be 29 instead of 30 which it will be if we
start from 64.


> 
> > @@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  	}
> >  
> >  	for (i = 0; i < entries; ++i) {
> > -		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > -					1/* pages */, 1/* iswrite */, &page)) {
> > +		struct vm_area_struct *vma = NULL;
> > +
> > +		if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
> > +					1/* pages */, 1/* iswrite */, &page,
> > +					&vma)) {
> >  			ret = -EFAULT;
> >  			for (j = 0; j < i; ++j)
> >  				put_page(pfn_to_page(mem->hpas[j] >>  
> 
> You need mmap_sem for read here, and held while you inspect the vma.
> I would say don't hold it over the error handling and the
> mm_iommu_move_page_from_cma() call though.


Ah, this is why it deadlocked when I tried this.

But do I really need mmap_sem here? get_user_pages_longterm() inspects
vma with no mmap_sem held in case of VFIO IOMMU Type1:

vfio_fops_unl_ioctl
  vfio_iommu_type1_ioctl
    vfio_dma_do_map
      vfio_pin_map_dma
        vfio_pin_pages_remote
          vaddr_get_pfn
            get_user_pages_longterm

Is that a bug? Other cases seem to hold it though.

> 
> > @@ -186,9 +197,9 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  		if (is_migrate_cma_page(page)) {
> >  			if (mm_iommu_move_page_from_cma(page))
> >  				goto populate;
> > -			if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > +			if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
> >  						1/* pages */, 1/* iswrite */,
> > -						&page)) {
> > +						&page, &vma)) {
> >  				ret = -EFAULT;
> >  				for (j = 0; j < i; ++j)
> >  					put_page(pfn_to_page(mem->hpas[j] >>
> > @@ -199,6 +210,10 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> >  			}
> >  		}
> >  populate:
> > +		pageshift = PAGE_SHIFT;
> > +		if (vma && vma->vm_file && is_file_hugepages(vma->vm_file))
> > +			pageshift += hstate_vma(vma)->order;  
> 
> I would just set to huge_page_shift() here so you don't expose this
> hstate quirk that adds order to PAGE_SIZE.

Ah, right, there is a million helpers for this :)

 
> > +		mem->pageshift = min(mem->pageshift, pageshift);
> >  		mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> >  	}
> >    
> 
> Otherwise the mm side of things looks okay, I think.

Thanks for the review.

--
Alexey

^ permalink raw reply

* Re: [PATCH kernel v5 2/2] KVM: PPC: Check if IOMMU page is contained in the pinned physical page
From: Nicholas Piggin @ 2018-07-07 13:47 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: linuxppc-dev, David Gibson, kvm-ppc, Alex Williamson,
	Michael Ellerman, Paul Mackerras, Balbir Singh
In-Reply-To: <20180707224445.24664324@aikyoga2>

On Sat, 7 Jul 2018 22:44:45 +1000
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On Sat, 7 Jul 2018 21:43:03 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Sat,  7 Jul 2018 20:44:10 +1000
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > A VM which has:
> > >  - a DMA capable device passed through to it (eg. network card);
> > >  - running a malicious kernel that ignores H_PUT_TCE failure;
> > >  - capability of using IOMMU pages bigger that physical pages
> > > can create an IOMMU mapping that exposes (for example) 16MB of
> > > the host physical memory to the device when only 64K was allocated to the VM.
> > > 
> > > The remaining 16MB - 64K will be some other content of host memory, possibly
> > > including pages of the VM, but also pages of host kernel memory, host
> > > programs or other VMs.
> > > 
> > > The attacking VM does not control the location of the page it can map,
> > > and is only allowed to map as many pages as it has pages of RAM.
> > > 
> > > We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
> > > an IOMMU page is contained in the physical page so the PCI hardware won't
> > > get access to unassigned host memory; however this check is missing in
> > > the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
> > > did not hit this yet as the very first time when the mapping happens
> > > we do not have tbl::it_userspace allocated yet and fall back to
> > > the userspace which in turn calls VFIO IOMMU driver, this fails and
> > > the guest does not retry,
> > > 
> > > This stores the smallest preregistered page size in the preregistered
> > > region descriptor and changes the mm_iommu_xxx API to check this against
> > > the IOMMU page size. This calculates maximum page size as a minimum of
> > > the natural region alignment and hugepagetlb's compound page size.
> > > 
> > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > ---
> > > Changes:
> > > v5:
> > > * only consider compound pages from hugetlbfs
> > > 
> > > v4:
> > > * reimplemented max pageshift calculation
> > > 
> > > v3:
> > > * fixed upper limit for the page size
> > > * added checks that we don't register parts of a huge page
> > > 
> > > v2:
> > > * explicitely check for compound pages before calling compound_order()
> > > 
> > > ---
> > > The bug is: run QEMU _without_ hugepages (no -mempath) and tell it to
> > > advertise 16MB pages to the guest; a typical pseries guest will use 16MB
> > > for IOMMU pages without checking the mmu pagesize and this will fail
> > > at https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=fb396cf00ac40eb35967a04c9cc798ca896eed57;hb=refs/heads/master#l256
> > > 
> > > With the change, mapping will fail in KVM and the guest will print:
> > > 
> > > mlx5_core 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000 18 1f returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
> > > mlx5_core 0000:00:00.0: created tce table LIOBN 0x80000001 for /pci@800000020000000/ethernet@0
> > > mlx5_core 0000:00:00.0: failed to map direct window for /pci@800000020000000/ethernet@0: -1
> > > ---
> > >  arch/powerpc/include/asm/mmu_context.h |  4 ++--
> > >  arch/powerpc/kvm/book3s_64_vio.c       |  2 +-
> > >  arch/powerpc/kvm/book3s_64_vio_hv.c    |  6 ++++--
> > >  arch/powerpc/mm/mmu_context_iommu.c    | 33 +++++++++++++++++++++++++++------
> > >  drivers/vfio/vfio_iommu_spapr_tce.c    |  2 +-
> > >  5 files changed, 35 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h
> > > index 896efa5..79d570c 100644
> > > --- a/arch/powerpc/include/asm/mmu_context.h
> > > +++ b/arch/powerpc/include/asm/mmu_context.h
> > > @@ -35,9 +35,9 @@ extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup_rm(
> > >  extern struct mm_iommu_table_group_mem_t *mm_iommu_find(struct mm_struct *mm,
> > >  		unsigned long ua, unsigned long entries);
> > >  extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem,
> > > -		unsigned long ua, unsigned long *hpa);
> > > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> > >  extern long mm_iommu_ua_to_hpa_rm(struct mm_iommu_table_group_mem_t *mem,
> > > -		unsigned long ua, unsigned long *hpa);
> > > +		unsigned long ua, unsigned int pageshift, unsigned long *hpa);
> > >  extern long mm_iommu_mapped_inc(struct mm_iommu_table_group_mem_t *mem);
> > >  extern void mm_iommu_mapped_dec(struct mm_iommu_table_group_mem_t *mem);
> > >  #endif
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> > > index d066e37..8c456fa 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio.c
> > > @@ -449,7 +449,7 @@ long kvmppc_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> > >  		/* This only handles v2 IOMMU type, v1 is handled via ioctl() */
> > >  		return H_TOO_HARD;
> > >  
> > > -	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, &hpa)))
> > > +	if (WARN_ON_ONCE(mm_iommu_ua_to_hpa(mem, ua, tbl->it_page_shift, &hpa)))
> > >  		return H_HARDWARE;
> > >  
> > >  	if (mm_iommu_mapped_inc(mem))
> > > diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > index 925fc31..5b298f5 100644
> > > --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> > > @@ -279,7 +279,8 @@ static long kvmppc_rm_tce_iommu_do_map(struct kvm *kvm, struct iommu_table *tbl,
> > >  	if (!mem)
> > >  		return H_TOO_HARD;
> > >  
> > > -	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, &hpa)))
> > > +	if (WARN_ON_ONCE_RM(mm_iommu_ua_to_hpa_rm(mem, ua, tbl->it_page_shift,
> > > +			&hpa)))
> > >  		return H_HARDWARE;
> > >  
> > >  	pua = (void *) vmalloc_to_phys(pua);
> > > @@ -469,7 +470,8 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> > >  
> > >  		mem = mm_iommu_lookup_rm(vcpu->kvm->mm, ua, IOMMU_PAGE_SIZE_4K);
> > >  		if (mem)
> > > -			prereg = mm_iommu_ua_to_hpa_rm(mem, ua, &tces) == 0;
> > > +			prereg = mm_iommu_ua_to_hpa_rm(mem, ua,
> > > +					IOMMU_PAGE_SHIFT_4K, &tces) == 0;
> > >  	}
> > >  
> > >  	if (!prereg) {
> > > diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c
> > > index abb4364..0aebed6 100644
> > > --- a/arch/powerpc/mm/mmu_context_iommu.c
> > > +++ b/arch/powerpc/mm/mmu_context_iommu.c
> > > @@ -18,6 +18,7 @@
> > >  #include <linux/migrate.h>
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/swap.h>
> > > +#include <linux/hugetlb.h>
> > >  #include <asm/mmu_context.h>
> > >  
> > >  static DEFINE_MUTEX(mem_list_mutex);
> > > @@ -27,6 +28,7 @@ struct mm_iommu_table_group_mem_t {
> > >  	struct rcu_head rcu;
> > >  	unsigned long used;
> > >  	atomic64_t mapped;
> > > +	unsigned int pageshift;
> > >  	u64 ua;			/* userspace address */
> > >  	u64 entries;		/* number of entries in hpas[] */
> > >  	u64 *hpas;		/* vmalloc'ed */
> > > @@ -125,6 +127,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  {
> > >  	struct mm_iommu_table_group_mem_t *mem;
> > >  	long i, j, ret = 0, locked_entries = 0;
> > > +	unsigned int pageshift;
> > >  	struct page *page = NULL;
> > >  
> > >  	mutex_lock(&mem_list_mutex);
> > > @@ -159,6 +162,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  		goto unlock_exit;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Use @ua and @entries alignment as a starting point for a maximum
> > > +	 * page size calculation below.
> > > +	 */
> > > +	mem->pageshift = __ffs(ua | (entries << PAGE_SHIFT));
> > >  	mem->hpas = vzalloc(array_size(entries, sizeof(mem->hpas[0])));
> > >  	if (!mem->hpas) {
> > >  		kfree(mem);    
> > 
> > I think it would be better to start with UINT_MAX or something to make
> > it clear that it's not a valid "guess" but rather just a starting point
> > for the min().  
> 
> This is to handle unaligned @ua/@entries backed by a huge page. For
> example, 1GB page and we are preregistering only a second half of it -
> the maximum pageshift should be 29 instead of 30 which it will be if we
> start from 64.

Ah okay that's what I'm missing. Maybe a small addition to the comment?

> >   
> > > @@ -167,8 +175,11 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, unsigned long entries,
> > >  	}
> > >  
> > >  	for (i = 0; i < entries; ++i) {
> > > -		if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
> > > -					1/* pages */, 1/* iswrite */, &page)) {
> > > +		struct vm_area_struct *vma = NULL;
> > > +
> > > +		if (1 != get_user_pages(ua + (i << PAGE_SHIFT),
> > > +					1/* pages */, 1/* iswrite */, &page,
> > > +					&vma)) {
> > >  			ret = -EFAULT;
> > >  			for (j = 0; j < i; ++j)
> > >  				put_page(pfn_to_page(mem->hpas[j] >>    
> > 
> > You need mmap_sem for read here, and held while you inspect the vma.
> > I would say don't hold it over the error handling and the
> > mm_iommu_move_page_from_cma() call though.  
> 
> 
> Ah, this is why it deadlocked when I tried this.
> 
> But do I really need mmap_sem here? get_user_pages_longterm() inspects
> vma with no mmap_sem held in case of VFIO IOMMU Type1:
> 
> vfio_fops_unl_ioctl
>   vfio_iommu_type1_ioctl
>     vfio_dma_do_map
>       vfio_pin_map_dma
>         vfio_pin_pages_remote
>           vaddr_get_pfn
>             get_user_pages_longterm
> 
> Is that a bug? Other cases seem to hold it though.

Yes you need mmap_sem. I think that would be a bug too.

Thanks,
Nick

^ permalink raw reply

* Re: powerpc: 32BIT vs. 64BIT (PPC32 vs. PPC64)
From: Randy Dunlap @ 2018-07-07 14:59 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Benjamin Herrenschmidt, linux-kbuild, Paul Mackerras,
	linuxppc-dev, Stephen Rothwell
In-Reply-To: <20180707221316.5b75e075@roar.ozlabs.ibm.com>

On 07/07/2018 05:13 AM, Nicholas Piggin wrote:
> On Fri, 6 Jul 2018 21:58:29 -0700
> Randy Dunlap <rdunlap@infradead.org> wrote:
> 
>> On 07/06/2018 06:45 PM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2018-07-05 at 14:30 -0700, Randy Dunlap wrote:  
>>>> Hi,
>>>>
>>>> Is there a good way (or a shortcut) to do something like:
>>>>
>>>> $ make ARCH=powerpc O=PPC32 [other_options] allmodconfig
>>>>   to get a PPC32/32BIT allmodconfig
>>>>
>>>> and also be able to do:
>>>>
>>>> $make ARCH=powerpc O=PPC64 [other_options] allmodconfig
>>>>   to get a PPC64/64BIT allmodconfig?  
>>>
>>> Hrm... O= is for the separate build dir, so there much be something
>>> else.
>>>
>>> You mean having ARCH= aliases like ppc/ppc32 and ppc64 ?  
>>
>> Yes.
>>
>>> That would be a matter of overriding some .config defaults I suppose, I
>>> don't know how this is done on other archs.
>>>
>>> I see the aliasing trick in the Makefile but that's about it.
>>>   
>>>> Note that arch/x86, arch/sh, and arch/sparc have ways to do
>>>> some flavor(s) of this (from Documentation/kbuild/kbuild.txt;
>>>> sh and sparc based on a recent "fix" patch from me):  
>>>
>>> I fail to see what you are actually talking about here ... sorry. Do
>>> you have concrete examples on x86 or sparc ? From what I can tell the
>>> "i386" or "sparc32/sparc64" aliases just change SRCARCH in Makefile and
>>> 32 vs 64-bit is just a Kconfig option...  
>>
>> Yes, your summary is mostly correct.
>>
>> I'm just looking for a way to do cross-compile builds that are close to
>> ppc32 allmodconfig and ppc64 allmodconfig.
> 
> Would there a problem with adding ARCH=ppc32 / ppc64 matching? This
> seems to work...
> 
> Thanks,
> Nick

Yes, this mostly works and is similar to a patch (my patch) on my test machine.
And they both work for allmodconfig, which is my primary build target.

And they both have one little quirk that is confusing when the build target
is defconfig:

When ARCH=ppc32, the terminal output (stdout) is: (using O=PPC32)

make[1]: Entering directory '/home/rdunlap/lnx/lnx-418-rc3/PPC32'
  GEN     ./Makefile
*** Default configuration is based on 'ppc64_defconfig'   <<<<< NOTE <<<<<
#
# configuration written to .config
#
make[1]: Leaving directory '/home/rdunlap/lnx/lnx-418-rc3/PPC32'


I expect that can be fixed also.  :)

And the written .config file is indeed for 32BIT, not 64BIT.

Thanks, Nick.

> ---
>  Makefile                               | 8 ++++++++
>  arch/powerpc/Kconfig                   | 9 +++++++++
>  arch/powerpc/platforms/Kconfig.cputype | 8 --------
>  3 files changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index c5ce55cbc543..f97204aed17a 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -345,6 +345,14 @@ ifeq ($(ARCH),sh64)
>         SRCARCH := sh
>  endif
>  
> +# Additional ARCH settings for powerpc
> +ifeq ($(ARCH),ppc32)
> +       SRCARCH := powerpc
> +endif
> +ifeq ($(ARCH),ppc64)
> +       SRCARCH := powerpc
> +endif
> +
>  KCONFIG_CONFIG	?= .config
>  export KCONFIG_CONFIG
>  
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 9f2b75fe2c2d..3405b1b122be 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -1,4 +1,13 @@
>  # SPDX-License-Identifier: GPL-2.0
> +
> +config PPC64
> +	bool "64-bit kernel" if "$(ARCH)" = "powerpc"
> +	default "$(ARCH)" != "ppc32"
> +	select ZLIB_DEFLATE
> +	help
> +	  This option selects whether a 32-bit or a 64-bit kernel
> +	  will be built.
> +
>  source "arch/powerpc/platforms/Kconfig.cputype"
>  
>  config PPC32
> diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
> index e6a1de521319..f6e5d6ef9782 100644
> --- a/arch/powerpc/platforms/Kconfig.cputype
> +++ b/arch/powerpc/platforms/Kconfig.cputype
> @@ -1,12 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0
> -config PPC64
> -	bool "64-bit kernel"
> -	default n
> -	select ZLIB_DEFLATE
> -	help
> -	  This option selects whether a 32-bit or a 64-bit kernel
> -	  will be built.
> -
>  menu "Processor support"
>  choice
>  	prompt "Processor Type"
> 


-- 
~Randy

^ permalink raw reply

* Re: [PATCH v6 2/4] resource: Use list_head to link sibling resource
From: Baoquan He @ 2018-07-08  2:59 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, linux-kernel, akpm, robh+dt, dan.j.williams,
	nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko,
	brijesh.singh, devicetree, airlied, linux-pci, richard.weiyang,
	keith.busch, jcmvbkbc, baiyaowei, frowand.list, lorenzo.pieralisi,
	sthemmin, linux-nvdimm, patrik.r.jakobsson, linux-input, gustavo,
	dyoung, vgoyal, thomas.lendacky, haiyangz, maarten.lankhorst,
	jglisse, seanpaul, bhelgaas, tglx, yinghai, jonathan.derrick,
	chris, monstr, linux-parisc, gregkh, dmitry.torokhov, kexec,
	ebiederm, devel, linuxppc-dev, davem
In-Reply-To: <201807042347.hfDcKxmG%fengguang.wu@intel.com>

Hi,

On 07/05/18 at 01:00am, kbuild test robot wrote:
> Hi Baoquan,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18-rc3 next-20180704]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

Thanks for telling. 

I cloned 0day-ci/linut to my local pc.
https://github.com/0day-ci/linux.git

However, I didn't find below branch. And tried to open it in web
broswer, also failed.


> url:    https://github.com/0day-ci/linux/commits/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180704-121402
> config: mips-rb532_defconfig (attached as .config)
> compiler: mipsel-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.2.0 make.cross ARCH=mips 

I did find a old one which is for the old version 5 post.

[bhe@linux]$ git remote -v
0day-ci	https://github.com/0day-ci/linux.git (fetch)
0day-ci	https://github.com/0day-ci/linux.git (push)
[bhe@dhcp-128-28 linux]$ git branch -a| grep Baoquan| grep resource
  remotes/0day-ci/Baoquan-He/resource-Use-list_head-to-link-sibling-resource/20180612-113600

Could you help have a look at this?

Thanks
Baoquan

> 
> All error/warnings (new ones prefixed by >>):
> 
> >> arch/mips/pci/pci-rc32434.c:57:11: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
>      .child = &rc32434_res_pci_mem2
>               ^
>    arch/mips/pci/pci-rc32434.c:57:11: note: (near initialization for 'rc32434_res_pci_mem1.child.next')
> >> arch/mips/pci/pci-rc32434.c:51:47: warning: missing braces around initializer [-Wmissing-braces]
>     static struct resource rc32434_res_pci_mem1 = {
>                                                   ^
>    arch/mips/pci/pci-rc32434.c:60:47: warning: missing braces around initializer [-Wmissing-braces]
>     static struct resource rc32434_res_pci_mem2 = {
>                                                   ^
>    cc1: some warnings being treated as errors
> 
> vim +57 arch/mips/pci/pci-rc32434.c
> 
> 73b4390f Ralf Baechle 2008-07-16  50  
> 73b4390f Ralf Baechle 2008-07-16 @51  static struct resource rc32434_res_pci_mem1 = {
> 73b4390f Ralf Baechle 2008-07-16  52  	.name = "PCI MEM1",
> 73b4390f Ralf Baechle 2008-07-16  53  	.start = 0x50000000,
> 73b4390f Ralf Baechle 2008-07-16  54  	.end = 0x5FFFFFFF,
> 73b4390f Ralf Baechle 2008-07-16  55  	.flags = IORESOURCE_MEM,
> 73b4390f Ralf Baechle 2008-07-16  56  	.sibling = NULL,
> 73b4390f Ralf Baechle 2008-07-16 @57  	.child = &rc32434_res_pci_mem2
> 73b4390f Ralf Baechle 2008-07-16  58  };
> 73b4390f Ralf Baechle 2008-07-16  59  
> 
> :::::: The code at line 57 was first introduced by commit
> :::::: 73b4390fb23456964201abda79f1210fe337d01a [MIPS] Routerboard 532: Support for base system
> 
> :::::: TO: Ralf Baechle <ralf@linux-mips.org>
> :::::: CC: Ralf Baechle <ralf@linux-mips.org>
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Re: [PATCHv3 0/4] drivers/base: bugfix for supplier<-consumer ordering in device_kset
From: Rafael J. Wysocki @ 2018-07-08  8:25 UTC (permalink / raw)
  To: Pingfan Liu
  Cc: Rafael J. Wysocki, Lukas Wunner, Linux Kernel Mailing List,
	Greg Kroah-Hartman, Grygorii Strashko, Christoph Hellwig,
	Bjorn Helgaas, Dave Young, Linux PCI, linuxppc-dev, Linux PM,
	Kishon Vijay Abraham I
In-Reply-To: <CAFgQCTsCjkDWJsNog_sTnS-ZaZ87fMFpKq1ekVSUG4MDs6K_jw@mail.gmail.com>

On Sat, Jul 7, 2018 at 6:24 AM, Pingfan Liu <kernelfans@gmail.com> wrote:
> On Fri, Jul 6, 2018 at 9:55 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>>
>> On Fri, Jul 6, 2018 at 4:47 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>> >
>> > On Fri, Jul 6, 2018 at 10:36 AM, Lukas Wunner <lukas@wunner.de> wrote:
>> > > [cc += Kishon Vijay Abraham]
>> > >
>> > > On Thu, Jul 05, 2018 at 11:18:28AM +0200, Rafael J. Wysocki wrote:
>> > >> OK, so calling devices_kset_move_last() from really_probe() clearly is
>> > >> a mistake.
>> > >>
>> > >> I'm not really sure what the intention of it was as the changelog of
>> > >> commit 52cdbdd49853d doesn't really explain that (why would it be
>> > >> insufficient without that change?)
>> > >
>> > > It seems 52cdbdd49853d fixed an issue with boards which have an MMC
>> > > whose reset pin needs to be driven high on shutdown, lest the MMC
>> > > won't be found on the next boot.
>> > >
>> > > The boards' devicetrees use a kludge wherein the reset pin is modelled
>> > > as a regulator.  The regulator is enabled when the MMC probes and
>> > > disabled on driver unbind and shutdown.  As a result, the pin is driven
>> > > low on shutdown and the MMC is not found on the next boot.
>> > >
>> > > To fix this, another kludge was invented wherein the GPIO expander
>> > > driving the reset pin unconditionally drives all its pins high on
>> > > shutdown, see pcf857x_shutdown() in drivers/gpio/gpio-pcf857x.c
>> > > (commit adc284755055, "gpio: pcf857x: restore the initial line state
>> > > of all pcf lines").
>> > >
>> > > For this kludge to work, the GPIO expander's ->shutdown hook needs to
>> > > be executed after the MMC expander's ->shutdown hook.
>> > >
>> > > Commit 52cdbdd49853d achieved that by reordering devices_kset according
>> > > to the probe order.  Apparently the MMC probes after the GPIO expander,
>> > > possibly because it returns -EPROBE_DEFER if the vmmc regulator isn't
>> > > available yet, see mmc_regulator_get_supply().
>> > >
>> > > Note, I'm just piecing the information together from git history,
>> > > I'm not responsible for these kludges.  (I'm innocent!)
>> >
>> > Sure enough. :-)
>> >
>> > In any case, calling devices_kset_move_last() in really_probe() is
>> > plain broken and if its only purpose was to address a single, arguably
>> > kludgy, use case, let's just get rid of it in the first place IMO.
>> >
>> Yes, if it is only used for a single use case.
>>
> Think it again, I saw other potential issue with the current code.
> device_link_add->device_reorder_to_tail() can break the
> "supplier<-consumer" order. During moving children after parent's
> supplier, it ignores the order of child's consumer.

What do you mean?

> Beside this, essentially both devices_kset_move_after/_before() and
> device_pm_move_after/_before() expose  the shutdown order to the
> indirect caller,  and we can not expect that the caller can not handle
> it correctly. It should be a job of drivers core.

Arguably so, but that's how those functions were designed and the
callers should be aware of the limitation.

If they aren't, there is a bug in the caller.

> It is hard to extract high dimension info and pack them into one dimension
> linked-list.

Well, yes and no.

We know it for a fact that there is a linear ordering that will work.
It is inefficient to figure it out every time during system suspend
and resume, for one and that's why we have dpm_list.

Now, if we have it for suspend and resume, it can also be used for shutdown.

> And in theory, it is warranted that the shutdown seq is
> correct by using device tree info. More important, it is cheap with
> the data structure in hand. So I think it is time to resolve the issue
> once for all.

Not the way you want to do that, though.

Thanks,
Rafael

^ permalink raw reply

* Re: [PATCH 2/3] [v2] m68k: mac: use time64_t in RTC handling
From: Geert Uytterhoeven @ 2018-07-08 10:49 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Finn Thain, Paul Mackerras, Michael Ellerman, Joshua Thompson,
	Mathieu Malaterre, Benjamin Herrenschmidt, Greg Ungerer,
	linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	y2038 Mailman List, Meelis Roos, Andreas Schwab
In-Reply-To: <CAK8P3a2chXVhz+UJ+ujg8QMqyMmJ4zGj9XKuFZ+MsJ=dGqb3eg@mail.gmail.com>

Hi Arnd, Finn,

On Fri, Jun 22, 2018 at 10:55 AM Arnd Bergmann <arnd@arndb.de> wrote:
> On Fri, Jun 22, 2018 at 7:26 AM, Finn Thain <fthain@telegraphics.com.au> wrote:
> > On Tue, 19 Jun 2018, Arnd Bergmann wrote:
> >
> >> The real-time clock on m68k (and powerpc) mac systems uses an unsigned
> >> 32-bit value starting in 1904, which overflows in 2040, about two years
> >> later than everyone else, but this gets wrapped around in the Linux code
> >> in 2038 already because of the deprecated usage of time_t and/or long in
> >> the conversion.
> >>
> >> Getting rid of the deprecated interfaces makes it work until 2040 as
> >> documented, and it could be easily extended by reinterpreting the
> >> resulting time64_t as a positive number. For the moment, I'm adding a
> >> WARN_ON() that triggers if we encounter a time before 1970 or after 2040
> >> (the two are indistinguishable).
> >>
> >
> > I really don't like the WARN_ON(), but I'd prefer to address that in a
> > separate patch rather than impede the progress of this patch (or of this
> > series, since 3/3 seems to be unrelated).
> >
> > BTW, have you considered using the same wrap-around test (i.e. YY < 70)
> > that we use for the year register in the other RTC chips?
>
> That wrap-around test would have the same effect as the my original
> version (aside from the two bugs I now fixed), doing rougly
>
> -        return time - RTC_OFFSET;
> +        return (u32)(time - RTC_OFFSET);
>
> or some other variation of that will give us an RTC that supports all dates
> between 1970 and 2106. I don't think anyone so far had a strong
> preference here, so I went with what Mathieu suggested and kept the
> original Mac behavior, but added the WARN_ON().

So, is this safe to apply?
Especially in light of the warnings seen by Meelis with the PPC version.

Thanks!


Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 2/3] [v2] m68k: mac: use time64_t in RTC handling
From: Finn Thain @ 2018-07-08 11:45 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Arnd Bergmann, Paul Mackerras, Michael Ellerman, Joshua Thompson,
	Mathieu Malaterre, Benjamin Herrenschmidt, Greg Ungerer,
	linux-m68k, linuxppc-dev, Linux Kernel Mailing List,
	y2038 Mailman List, Meelis Roos, Andreas Schwab
In-Reply-To: <CAMuHMdWzO1jzd=8SN5G_eoFObEAJR4-5u7JDExMefR__30QauA@mail.gmail.com>

On Sun, 8 Jul 2018, Geert Uytterhoeven wrote:

> On Fri, Jun 22, 2018 at 10:55 AM Arnd Bergmann <arnd@arndb.de> wrote:

> > I don't think anyone so far had a strong preference here, so I went 
> > with what Mathieu suggested and kept the original Mac behavior, but 
> > added the WARN_ON().
> 
> So, is this safe to apply?
> Especially in light of the warnings seen by Meelis with the PPC version.
> 

You mean, "can we apply this and avoid warning splats?"

Meelis's result says, "no".

I forget what date the RTC gets set to when the PMU/Cuda is reset but I 
suspect that timezone arithmetic in either MacOS or Linux could cause it 
to end up in 1969.

So I'd prefer to see the WARN_ON() removed.

-- 

^ permalink raw reply

* Re: powerpc: 32BIT vs. 64BIT (PPC32 vs. PPC64)
From: Michael Ellerman @ 2018-07-08 11:51 UTC (permalink / raw)
  To: Randy Dunlap, linux-kbuild
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Stephen Rothwell,
	linuxppc-dev
In-Reply-To: <3f906b03-98ff-c081-4e19-b490f0b35c51@infradead.org>

Randy Dunlap <rdunlap@infradead.org> writes:
> Hi,
>
> Is there a good way (or a shortcut) to do something like:

The best I know of is:

> $ make ARCH=powerpc O=PPC32 [other_options] allmodconfig
>   to get a PPC32/32BIT allmodconfig

$ echo CONFIG_PPC64=n > allmod.config
$ KCONFIG_ALLCONFIG=1 make allmodconfig
$ grep PPC32 .config
CONFIG_PPC32=y

Which is still a bit clunky.

I looked at this a while back and the problem we have is that the 32-bit
kernel is not a single thing. There are multiple 32-bit platforms which
are mutually exclusive.

eg, from menuconfig:

 - 512x/52xx/6xx/7xx/74xx/82xx/83xx/86xx
 - Freescale 85xx
 - Freescale 8xx
 - AMCC 40x
 - AMCC 44x, 46x or 47x
 - Freescale e200

So we could have a 32-bit allmodconfig, but we'd need to chose one of
the above, and we'd still only be testing some of the code.

Having said that you're the 2nd person to ask about this, so we should
clearly do something to make a 32-bit allmodconfig easier, even if it's
not perfect.

cheers

^ permalink raw reply

* [RFC PATCH] powerpc/64s: Move ISAv3.0 / POWER9 idle code to powernv C code
From: Nicholas Piggin @ 2018-07-08 14:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Gautham R . Shenoy, Vaidyanathan Srinivasan

Reimplement POWER9 idle code in C, in the powernv platform code.
Assembly stubs are used to save and restore the stack frame and
non-volatile GPRs before going to idle, but these are small and
mostly agnostic to microarchitecture implementation details.

POWER7/8 code is not converted (yet), but that's not a moving
target, and it doesn't make you want to claw your eyes out so
much with the POWER9 code untangled from it.

The optimisation where EC=ESL=0 idle modes did not have to save
GPRs or mtmsrd L=0 is restored, because it's simple to do.

Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them all explicitly.

Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
way this stuff will cope with non-trivial new CPU implementation
details, firmware changes, etc., without becoming unmaintainable.
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   1 +
 arch/powerpc/include/asm/cpuidle.h            |  14 +-
 arch/powerpc/include/asm/paca.h               |  38 +-
 arch/powerpc/include/asm/processor.h          |   3 +-
 arch/powerpc/include/asm/reg.h                |   7 +-
 arch/powerpc/kernel/Makefile                  |   2 +-
 arch/powerpc/kernel/asm-offsets.c             |  11 +-
 arch/powerpc/kernel/exceptions-64s.S          |  10 +-
 arch/powerpc/kernel/idle_book3s.S             | 348 ++-------------
 arch/powerpc/kernel/idle_isa3.S               |  73 ++++
 arch/powerpc/kernel/setup-common.c            |   4 +-
 arch/powerpc/mm/slb.c                         |   7 +-
 arch/powerpc/platforms/powernv/idle.c         | 402 +++++++++++++++---
 arch/powerpc/xmon/xmon.c                      |  25 +-
 14 files changed, 496 insertions(+), 449 deletions(-)
 create mode 100644 arch/powerpc/kernel/idle_isa3.S

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 50ed64fba4ae..c626319a962d 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -486,6 +486,7 @@ static inline void hpte_init_pseries(void) { }
 extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
+extern void __slb_flush_and_rebolt(void);
 extern void slb_flush_and_rebolt(void);
 
 extern void slb_vmalloc_update(void);
diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index e210a83eb196..b668f030d531 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -28,6 +28,7 @@
  * yet woken from the winkle state.
  */
 #define PNV_CORE_IDLE_LOCK_BIT			0x10000000
+#define NR_PNV_CORE_IDLE_LOCK_BIT		28
 
 #define PNV_CORE_IDLE_WINKLE_COUNT		0x00010000
 #define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT	0x00080000
@@ -68,22 +69,9 @@
 #define ERR_DEEP_STATE_ESL_MISMATCH	-2
 
 #ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
-	u64 pid;
-	u64 ldbar;
-	u64 fscr;
-	u64 hfscr;
-	u64 mmcr1;
-	u64 mmcr2;
-	u64 mmcra;
-};
-
 extern u32 pnv_fastsleep_workaround_at_entry[];
 extern u32 pnv_fastsleep_workaround_at_exit[];
 
-extern u64 pnv_first_deep_stop_state;
-
 unsigned long pnv_cpu_offline(unsigned int cpu);
 int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags);
 static inline void report_invalid_psscr_val(u64 psscr_val, int err)
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4e9cede5a7e7..a7a4892d39c0 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -178,23 +178,29 @@ struct paca_struct {
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
-	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
-	u32 *core_idle_state_ptr;
-	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
-	/* Mask to indicate thread id in core */
-	u8 thread_mask;
-	/* Mask to denote subcore sibling threads */
-	u8 subcore_sibling_mask;
-	/* Flag to request this thread not to stop */
-	atomic_t dont_stop;
-	/* The PSSCR value that the kernel requested before going to stop */
-	u64 requested_psscr;
+	union {
+		/* P7/P8 specific fields */
+		struct {
+			/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
+			unsigned long *core_idle_state_ptr;
+			u8 thread_idle_state;	/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
+			/* Mask to indicate thread id in core */
+			u8 thread_mask;
+			/* Mask to denote subcore sibling threads */
+			u8 subcore_sibling_mask;
+		};
 
-	/*
-	 * Save area for additional SPRs that need to be
-	 * saved/restored during cpuidle stop.
-	 */
-	struct stop_sprs stop_sprs;
+		/* P9 specific fields */
+		struct {
+			/* The PSSCR value that the kernel requested before going to stop */
+			u64 requested_psscr;
+			unsigned long idle_state;
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+			/* Flag to request this thread not to stop */
+			atomic_t dont_stop;
+#endif
+		};
+	};
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 5debe337ea9d..4774ec7603ee 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -513,7 +513,8 @@ enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
 extern unsigned long power7_idle_insn(unsigned long type); /* PNV_THREAD_NAP/etc*/
 extern void power7_idle_type(unsigned long type);
-extern unsigned long power9_idle_stop(unsigned long psscr_val);
+extern unsigned long isa3_idle_stop_noloss(unsigned long psscr_val);
+extern unsigned long isa3_idle_stop_mayloss(unsigned long psscr_val);
 extern unsigned long power9_offline_stop(unsigned long psscr_val);
 extern void power9_idle_type(unsigned long stop_psscr_val,
 			      unsigned long stop_psscr_mask);
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 562568414cf4..3c7d97b2abb0 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -750,10 +750,9 @@
 #define	  SRR1_WAKERESET	0x00100000 /* System reset */
 #define   SRR1_WAKEHDBELL	0x000c0000 /* Hypervisor doorbell on P8 */
 #define	  SRR1_WAKESTATE	0x00030000 /* Powersave exit mask [46:47] */
-#define	  SRR1_WS_DEEPEST	0x00030000 /* Some resources not maintained,
-					  * may not be recoverable */
-#define	  SRR1_WS_DEEPER	0x00020000 /* Some resources not maintained */
-#define	  SRR1_WS_DEEP		0x00010000 /* All resources maintained */
+#define	  SRR1_WS_HVLOSS	0x00030000 /* HV resources not maintained */
+#define	  SRR1_WS_GPRLOSS	0x00020000 /* GPRs not maintained */
+#define	  SRR1_WS_NOLOSS	0x00010000 /* All resources maintained */
 #define   SRR1_PROGTM		0x00200000 /* TM Bad Thing */
 #define   SRR1_PROGFPE		0x00100000 /* Floating Point Enabled */
 #define   SRR1_PROGILL		0x00080000 /* Illegal instruction */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 2b4c40b255e4..6914fab16e2c 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -48,7 +48,7 @@ obj-$(CONFIG_PPC_BOOK3E_64)	+= exceptions-64e.o idle_book3e.o
 obj-$(CONFIG_PPC64)		+= vdso64/
 obj-$(CONFIG_ALTIVEC)		+= vecemu.o
 obj-$(CONFIG_PPC_970_NAP)	+= idle_power4.o
-obj-$(CONFIG_PPC_P7_NAP)	+= idle_book3s.o
+obj-$(CONFIG_PPC_P7_NAP)	+= idle_book3s.o idle_isa3.o
 procfs-y			:= proc_powerpc.o
 obj-$(CONFIG_PROC_FS)		+= $(procfs-y)
 rtaspci-$(CONFIG_PPC64)-$(CONFIG_PCI)	:= rtas_pci.o
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 89cf15566c4e..b8162bb80ddb 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -762,20 +762,11 @@ int main(void)
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
+	/* POWER7/8 specific idle fields (kernel/idle_book3s.S) */
 	OFFSET(PACA_CORE_IDLE_STATE_PTR, paca_struct, core_idle_state_ptr);
 	OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
 	OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
 	OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
-	OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
-	OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
-#define STOP_SPR(x, f)	OFFSET(x, paca_struct, stop_sprs.f)
-	STOP_SPR(STOP_PID, pid);
-	STOP_SPR(STOP_LDBAR, ldbar);
-	STOP_SPR(STOP_FSCR, fscr);
-	STOP_SPR(STOP_HFSCR, hfscr);
-	STOP_SPR(STOP_MMCR1, mmcr1);
-	STOP_SPR(STOP_MMCR2, mmcr2);
-	STOP_SPR(STOP_MMCRA, mmcra);
 #endif
 
 	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 76a14702cb9c..36b5f0e18c0c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -135,8 +135,14 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
 
 #ifdef CONFIG_PPC_P7_NAP
 EXC_COMMON_BEGIN(system_reset_idle_common)
+BEGIN_FTR_SECTION
+	mfspr	r3,SPRN_SRR1
+	bltlr	cr3	/* no state loss, return to idle caller */
+	b	isa3_idle_wake_gpr_loss
+FTR_SECTION_ELSE
 	mfspr	r12,SPRN_SRR1
 	b	pnv_powersave_wakeup
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 #endif
 
 /*
@@ -425,7 +431,9 @@ EXC_COMMON_BEGIN(machine_check_idle_common)
 	li	r11,0
 	mtmsrd	r11,1
 
-	b	pnv_powersave_wakeup_mce
+	/* XXX fixup
+	b	pnv_powersave_wakeup_mce */
+	b	.
 #endif
 	/*
 	 * Handle machine check early in real mode. We come here with
diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index d85d5515a091..506b88768767 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -1,6 +1,6 @@
 /*
- *  This file contains idle entry/exit functions for POWER7,
- *  POWER8 and POWER9 CPUs.
+ *  This file contains idle entry/exit functions for POWER7 and
+ *  POWER8 CPUs.
  *
  *  This program is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU General Public License
@@ -56,19 +56,8 @@ save_sprs_to_stack:
 	 * Note all register i.e per-core, per-subcore or per-thread is saved
 	 * here since any thread in the core might wake up first
 	 */
-BEGIN_FTR_SECTION
-	/*
-	 * Note - SDR1 is dropped in Power ISA v3. Hence not restoring
-	 * SDR1 here
-	 */
-	mfspr	r3,SPRN_PTCR
-	std	r3,_PTCR(r1)
-	mfspr	r3,SPRN_LPCR
-	std	r3,_LPCR(r1)
-FTR_SECTION_ELSE
 	mfspr	r3,SPRN_SDR1
 	std	r3,_SDR1(r1)
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 	mfspr	r3,SPRN_RPR
 	std	r3,_RPR(r1)
 	mfspr	r3,SPRN_SPURR
@@ -85,66 +74,6 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 	std	r3,_WORT(r1)
 	mfspr	r3,SPRN_WORC
 	std	r3,_WORC(r1)
-/*
- * On POWER9, there are idle states such as stop4, invoked via cpuidle,
- * that lose hypervisor resources. In such cases, we need to save
- * additional SPRs before entering those idle states so that they can
- * be restored to their older values on wakeup from the idle state.
- *
- * On POWER8, the only such deep idle state is winkle which is used
- * only in the context of CPU-Hotplug, where these additional SPRs are
- * reinitiazed to a sane value. Hence there is no need to save/restore
- * these SPRs.
- */
-BEGIN_FTR_SECTION
-	blr
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-
-power9_save_additional_sprs:
-	mfspr	r3, SPRN_PID
-	mfspr	r4, SPRN_LDBAR
-	std	r3, STOP_PID(r13)
-	std	r4, STOP_LDBAR(r13)
-
-	mfspr	r3, SPRN_FSCR
-	mfspr	r4, SPRN_HFSCR
-	std	r3, STOP_FSCR(r13)
-	std	r4, STOP_HFSCR(r13)
-
-	mfspr	r3, SPRN_MMCRA
-	mfspr	r4, SPRN_MMCR0
-	std	r3, STOP_MMCRA(r13)
-	std	r4, _MMCR0(r1)
-
-	mfspr	r3, SPRN_MMCR1
-	mfspr	r4, SPRN_MMCR2
-	std	r3, STOP_MMCR1(r13)
-	std	r4, STOP_MMCR2(r13)
-	blr
-
-power9_restore_additional_sprs:
-	ld	r3,_LPCR(r1)
-	ld	r4, STOP_PID(r13)
-	mtspr	SPRN_LPCR,r3
-	mtspr	SPRN_PID, r4
-
-	ld	r3, STOP_LDBAR(r13)
-	ld	r4, STOP_FSCR(r13)
-	mtspr	SPRN_LDBAR, r3
-	mtspr	SPRN_FSCR, r4
-
-	ld	r3, STOP_HFSCR(r13)
-	ld	r4, STOP_MMCRA(r13)
-	mtspr	SPRN_HFSCR, r3
-	mtspr	SPRN_MMCRA, r4
-
-	ld	r3, _MMCR0(r1)
-	ld	r4, STOP_MMCR1(r13)
-	mtspr	SPRN_MMCR0, r3
-	mtspr	SPRN_MMCR1, r4
-
-	ld	r3, STOP_MMCR2(r13)
-	mtspr	SPRN_MMCR2, r3
 	blr
 
 /*
@@ -167,13 +96,23 @@ core_idle_lock_held:
 	blr
 
 /*
- * Pass requested state in r3:
- *	r3 - PNV_THREAD_NAP/SLEEP/WINKLE in POWER8
- *	   - Requested PSSCR value in POWER9
- *
- * Address of idle handler to branch to in realmode in r4
+ * This is the sequence required to execute idle instructions, as
+ * specified in ISA v2.07 (and earlier). MSR[IR] and MSR[DR] must be 0.
+ */
+#define IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)			\
+	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
+	std	r0,0(r1);					\
+	ptesync;						\
+	ld	r0,0(r1);					\
+236:	cmpd	cr0,r0,r0;					\
+	bne	236b;						\
+	IDLE_INST;
+
+/*
+ * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
+ * r3 contains desired idle state (PNV_THREAD_NAP/SLEEP/WINKLE).
  */
-pnv_powersave_common:
+_GLOBAL(power7_idle_insn)
 	/* Use r3 to pass state nap/sleep/winkle */
 	/* NAP is a state loss, we create a regs frame on the
 	 * stack, fill it up with the state we care about and
@@ -181,8 +120,6 @@ pnv_powersave_common:
 	 * need to save PC, some CR bits and the NV GPRs,
 	 * but for now an interrupt frame will do.
 	 */
-	mtctr	r4
-
 	mflr	r0
 	std	r0,16(r1)
 	stdu	r1,-INT_FRAME_SIZE(r1)
@@ -200,16 +137,7 @@ pnv_powersave_common:
 	std	r5,_CCR(r1)
 	std	r1,PACAR1(r13)
 
-BEGIN_FTR_SECTION
-	/*
-	 * POWER9 does not require real mode to stop, and presently does not
-	 * set hwthread_state for KVM (threads don't share MMU context), so
-	 * we can remain in virtual mode for this.
-	 */
-	bctr
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 	/*
-	 * POWER8
 	 * Go to real mode to do the nap, as required by the architecture.
 	 * Also, we need to be in real mode before setting hwthread_state,
 	 * because as soon as we do that, another thread can switch
@@ -217,24 +145,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 	 */
 	LOAD_REG_IMMEDIATE(r7, MSR_IDLE)
 	mtmsrd	r7,0
-	bctr
 
-/*
- * This is the sequence required to execute idle instructions, as
- * specified in ISA v2.07 (and earlier). MSR[IR] and MSR[DR] must be 0.
- */
-#define IDLE_STATE_ENTER_SEQ_NORET(IDLE_INST)			\
-	/* Magic NAP/SLEEP/WINKLE mode enter sequence */	\
-	std	r0,0(r1);					\
-	ptesync;						\
-	ld	r0,0(r1);					\
-236:	cmpd	cr0,r0,r0;					\
-	bne	236b;						\
-	IDLE_INST;
-
-
-	.globl pnv_enter_arch207_idle_mode
-pnv_enter_arch207_idle_mode:
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	/* Tell KVM we're entering idle */
 	li	r4,KVM_HWTHREAD_IN_IDLE
@@ -321,86 +232,6 @@ enter_winkle:
 
 	IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
 
-/*
- * r3 - PSSCR value corresponding to the requested stop state.
- */
-power_enter_stop:
-/*
- * Check if we are executing the lite variant with ESL=EC=0
- */
-	andis.   r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
-	clrldi   r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
-	bne	 .Lhandle_esl_ec_set
-	PPC_STOP
-	li	r3,0  /* Since we didn't lose state, return 0 */
-	std	r3, PACA_REQ_PSSCR(r13)
-
-	/*
-	 * pnv_wakeup_noloss() expects r12 to contain the SRR1 value so
-	 * it can determine if the wakeup reason is an HMI in
-	 * CHECK_HMI_INTERRUPT.
-	 *
-	 * However, when we wakeup with ESL=0, SRR1 will not contain the wakeup
-	 * reason, so there is no point setting r12 to SRR1.
-	 *
-	 * Further, we clear r12 here, so that we don't accidentally enter the
-	 * HMI in pnv_wakeup_noloss() if the value of r12[42:45] == WAKE_HMI.
-	 */
-	li	r12, 0
-	b 	pnv_wakeup_noloss
-
-.Lhandle_esl_ec_set:
-BEGIN_FTR_SECTION
-	/*
-	 * POWER9 DD2.0 or earlier can incorrectly set PMAO when waking up after
-	 * a state-loss idle. Saving and restoring MMCR0 over idle is a
-	 * workaround.
-	 */
-	mfspr	r4,SPRN_MMCR0
-	std	r4,_MMCR0(r1)
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
-
-/*
- * Check if the requested state is a deep idle state.
- */
-	LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
-	ld	r4,ADDROFF(pnv_first_deep_stop_state)(r5)
-	cmpd	r3,r4
-	bge	.Lhandle_deep_stop
-	PPC_STOP	/* Does not return (system reset interrupt) */
-
-.Lhandle_deep_stop:
-/*
- * Entering deep idle state.
- * Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
- * stack and enter stop
- */
-	lbz     r7,PACA_THREAD_MASK(r13)
-	ld      r14,PACA_CORE_IDLE_STATE_PTR(r13)
-
-lwarx_loop_stop:
-	lwarx   r15,0,r14
-	andis.	r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bnel-	core_idle_lock_held
-	andc    r15,r15,r7                      /* Clear thread bit */
-
-	stwcx.  r15,0,r14
-	bne-    lwarx_loop_stop
-	isync
-
-	bl	save_sprs_to_stack
-
-	PPC_STOP	/* Does not return (system reset interrupt) */
-
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired idle state (PNV_THREAD_NAP/SLEEP/WINKLE).
- */
-_GLOBAL(power7_idle_insn)
-	/* Now check if user or arch enabled NAP mode */
-	LOAD_REG_ADDR(r4, pnv_enter_arch207_idle_mode)
-	b	pnv_powersave_common
-
 #define CHECK_HMI_INTERRUPT						\
 BEGIN_FTR_SECTION_NESTED(66);						\
 	rlwinm	r0,r12,45-31,0xf;  /* extract wake reason field (P8) */	\
@@ -419,53 +250,6 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
 	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
 20:	nop;
 
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired PSSCR register value.
- *
- * Offline (CPU unplug) case also must notify KVM that the CPU is
- * idle.
- */
-_GLOBAL(power9_offline_stop)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-	/*
-	 * Tell KVM we're entering idle.
-	 * This does not have to be done in real mode because the P9 MMU
-	 * is independent per-thread. Some steppings share radix/hash mode
-	 * between threads, but in that case KVM has a barrier sync in real
-	 * mode before and after switching between radix and hash.
-	 */
-	li	r4,KVM_HWTHREAD_IN_IDLE
-	stb	r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
-	/* fall through */
-
-_GLOBAL(power9_idle_stop)
-	std	r3, PACA_REQ_PSSCR(r13)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-BEGIN_FTR_SECTION
-	sync
-	lwz	r5, PACA_DONT_STOP(r13)
-	cmpwi	r5, 0
-	bne	1f
-END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
-#endif
-	mtspr 	SPRN_PSSCR,r3
-	LOAD_REG_ADDR(r4,power_enter_stop)
-	b	pnv_powersave_common
-	/* No return */
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-1:
-	/*
-	 * We get here when TM / thread reconfiguration bug workaround
-	 * code wants to get the CPU into SMT4 mode, and therefore
-	 * we are being asked not to stop.
-	 */
-	li	r3, 0
-	std	r3, PACA_REQ_PSSCR(r13)
-	blr		/* return 0 for wakeup cause / SRR1 value */
-#endif
-
 /*
  * Called from machine check handler for powersave wakeups.
  * Low level machine check processing has already been done. Now just
@@ -499,11 +283,7 @@ pnv_powersave_wakeup_mce:
 pnv_powersave_wakeup:
 	ld	r2, PACATOC(r13)
 
-BEGIN_FTR_SECTION
-	bl	pnv_restore_hyp_resource_arch300
-FTR_SECTION_ELSE
-	bl	pnv_restore_hyp_resource_arch207
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
+	bl	pnv_restore_hyp_resource
 
 	li	r0,PNV_THREAD_RUNNING
 	stb	r0,PACA_THREAD_IDLE_STATE(r13)	/* Clear thread state */
@@ -535,50 +315,8 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
  *
  * cr3 - set to gt if waking up with partial/complete hypervisor state loss
  */
-pnv_restore_hyp_resource_arch300:
-	/*
-	 * Workaround for POWER9, if we lost resources, the ERAT
-	 * might have been mixed up and needs flushing. We also need
-	 * to reload MMCR0 (see comment above). We also need to set
-	 * then clear bit 60 in MMCRA to ensure the PMU starts running.
-	 */
-	blt	cr3,1f
-BEGIN_FTR_SECTION
-	PPC_INVALIDATE_ERAT
-	ld	r1,PACAR1(r13)
-	ld	r4,_MMCR0(r1)
-	mtspr	SPRN_MMCR0,r4
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
-	mfspr	r4,SPRN_MMCRA
-	ori	r4,r4,(1 << (63-60))
-	mtspr	SPRN_MMCRA,r4
-	xori	r4,r4,(1 << (63-60))
-	mtspr	SPRN_MMCRA,r4
-1:
-	/*
-	 * POWER ISA 3. Use PSSCR to determine if we
-	 * are waking up from deep idle state
-	 */
-	LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
-	ld	r4,ADDROFF(pnv_first_deep_stop_state)(r5)
-
+pnv_restore_hyp_resource:
 	/*
-	 * 0-3 bits correspond to Power-Saving Level Status
-	 * which indicates the idle state we are waking up from
-	 */
-	mfspr	r5, SPRN_PSSCR
-	rldicl  r5,r5,4,60
-	li	r0, 0		/* clear requested_psscr to say we're awake */
-	std	r0, PACA_REQ_PSSCR(r13)
-	cmpd	cr4,r5,r4
-	bge	cr4,pnv_wakeup_tb_loss /* returns to caller */
-
-	blr	/* Waking up without hypervisor state loss. */
-
-/* Same calling convention as arch300 */
-pnv_restore_hyp_resource_arch207:
-	/*
-	 * POWER ISA 2.07 or less.
 	 * Check if we slept with sleep or winkle.
 	 */
 	lbz	r4,PACA_THREAD_IDLE_STATE(r13)
@@ -598,15 +336,9 @@ pnv_restore_hyp_resource_arch207:
  * Called if waking up from idle state which can cause either partial or
  * complete hyp state loss.
  * In POWER8, called if waking up from fastsleep or winkle
- * In POWER9, called if waking up from stop state >= pnv_first_deep_stop_state
  *
  * r13 - PACA
  * cr3 - gt if waking up with partial/complete hypervisor state loss
- *
- * If ISA300:
- * cr4 - gt or eq if waking up from complete hypervisor state loss.
- *
- * If ISA207:
  * r4 - PACA_THREAD_IDLE_STATE
  */
 pnv_wakeup_tb_loss:
@@ -621,9 +353,7 @@ pnv_wakeup_tb_loss:
 	 * and SRR1 test for restoring NVGPRs.
 	 *
 	 * We are about to clobber NVGPRs now, so set NAPSTATELOST to
-	 * guarantee they will always be restored. This might be tightened
-	 * with careful reading of specs (particularly for ISA300) but this
-	 * is already a slow wakeup path and it's simpler to be safe.
+	 * guarantee they will always be restored.
 	 */
 	li	r0,1
 	stb	r0,PACA_NAPSTATELOST(r13)
@@ -672,19 +402,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
 	 * At this stage
 	 * cr2 - eq if first thread to wakeup in core
 	 * cr3-  gt if waking up with partial/complete hypervisor state loss
-	 * ISA300:
-	 * cr4 - gt or eq if waking up from complete hypervisor state loss.
 	 */
 
-BEGIN_FTR_SECTION
 	/*
 	 * Were we in winkle?
 	 * If yes, check if all threads were in winkle, decrement our
 	 * winkle count, set all thread winkle bits if all were in winkle.
-	 * Check if our thread has a winkle bit set, and set cr4 accordingly
-	 * (to match ISA300, above). Pseudo-code for core idle state
-	 * transitions for ISA207 is as follows (everything happens atomically
-	 * due to store conditional and/or lock bit):
+	 * Check if our thread has a winkle bit set, and set cr4 accordingly.
+	 * Pseudo-code for core idle state transitions for ISA207 is as follows
+	 * (everything happens atomically due to store conditional and/or lock
+	 * bit):
 	 *
 	 * nap_idle() { }
 	 * nap_wake() { }
@@ -749,7 +476,6 @@ BEGIN_FTR_SECTION
 
 	or	r15,r15,r7		/* Set thread bit */
 	beq	first_thread_in_subcore
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
 
 	or	r15,r15,r7		/* Set thread bit */
 	beq	cr2,first_thread_in_core
@@ -815,15 +541,6 @@ timebase_resync:
 	 * complete hypervisor state loss. Restore per core hypervisor
 	 * state.
 	 */
-BEGIN_FTR_SECTION
-	ld	r4,_PTCR(r1)
-	mtspr	SPRN_PTCR,r4
-	ld	r4,_RPR(r1)
-	mtspr	SPRN_RPR,r4
-	ld	r4,_AMOR(r1)
-	mtspr	SPRN_AMOR,r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-
 	ld	r4,_TSCR(r1)
 	mtspr	SPRN_TSCR,r4
 	ld	r4,_WORC(r1)
@@ -845,9 +562,6 @@ common_exit:
 
 	/* Waking up from winkle */
 
-BEGIN_MMU_FTR_SECTION
-	b	no_segments
-END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
 	/* Restore SLB  from PACA */
 	ld	r8,PACA_SLBSHADOWPTR(r13)
 
@@ -861,7 +575,6 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
 	slbmte	r6,r5
 1:	addi	r8,r8,16
 	.endr
-no_segments:
 
 	/* Restore per thread state */
 
@@ -884,17 +597,6 @@ no_segments:
 	mtctr	r12
 	bctrl
 
-/*
- * On POWER9, we can come here on wakeup from a cpuidle stop state.
- * Hence restore the additional SPRs to the saved value.
- *
- * On POWER8, we come here only on winkle. Since winkle is used
- * only in the case of CPU-Hotplug, we don't need to restore
- * the additional SPRs.
- */
-BEGIN_FTR_SECTION
-	bl 	power9_restore_additional_sprs
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 hypervisor_state_restored:
 
 	mr	r12,r19
diff --git a/arch/powerpc/kernel/idle_isa3.S b/arch/powerpc/kernel/idle_isa3.S
new file mode 100644
index 000000000000..c869512b716a
--- /dev/null
+++ b/arch/powerpc/kernel/idle_isa3.S
@@ -0,0 +1,73 @@
+/*
+ *  This file contains general idle entry/exit functions. The platform / CPU
+ *  must call the correct save/restore functions and ensure SPRs are saved
+ *  and restored correctly, handle KVM, interrupts, etc.
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#include <asm/ppc_asm.h>
+#include <asm/asm-offsets.h>
+#include <asm/ppc-opcode.h>
+
+/*
+ * Desired PSSCR in r3
+ *
+ * No state will be lost regardless of wakeup mechanism (interrupt or NIA).
+ * Interrupt driven wakeup may clobber volatiles, and should blr (with LR
+ * unchanged) to return to caller with r3 set according to caller's expected
+ * return code (for Book3S/64 that is SRR1).
+ *
+ * Caller is responsible for restoring SPRs, MSR, etc.
+ */
+_GLOBAL(isa3_idle_stop_noloss)
+	mtspr 	SPRN_PSSCR,r3
+	PPC_STOP
+	li	r3,0
+	blr
+
+/*
+ * Desired PSSCR in r3
+ *
+ * GPRs may be lost, so they are saved here. Wakeup is by interrupt only.
+ * Wakeup can return to caller by calling pnv_powersave_wakeup_gpr_loss
+ * with r3 set to return value.
+ *
+ * A wakeup without GPR loss may alterateively be handled as in
+ * isa3_idle_stop_noloss as an optimisation.
+ *
+ * Caller is responsible for restoring SPRs, MSR, etc.
+ */
+_GLOBAL(isa3_idle_stop_mayloss)
+	std	r1,PACAR1(r13)
+	mflr	r4
+	mfcr	r5
+	/* use stack red zone rather than a new frame */
+	addi	r6,r1,-INT_FRAME_SIZE
+	SAVE_GPR(2, r6)
+	SAVE_NVGPRS(r6)
+	std	r4,_LINK(r6)
+	std	r5,_CCR(r6)
+	mtspr 	SPRN_PSSCR,r3
+	PPC_STOP
+	b	.
+
+/*
+ * Desired return value in r3
+ *
+ * Idle wakeup can call this after calling isa3_idle_stop_loss to
+ * return to caller with r3 as return code.
+ */
+_GLOBAL(isa3_idle_wake_gpr_loss)
+	ld	r1,PACAR1(r13)
+	addi	r6,r1,-INT_FRAME_SIZE
+	ld	r4,_LINK(r6)
+	ld	r5,_CCR(r6)
+	REST_NVGPRS(r6)
+	REST_GPR(2, r6)
+	mtlr	r4
+	mtcr	r5
+	blr
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 40b44bb53a4e..e089da156ef3 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -401,8 +401,8 @@ void __init check_for_initrd(void)
 
 #ifdef CONFIG_SMP
 
-int threads_per_core, threads_per_subcore, threads_shift;
-cpumask_t threads_core_mask;
+int threads_per_core, threads_per_subcore, threads_shift __read_mostly;
+cpumask_t threads_core_mask __read_mostly;
 EXPORT_SYMBOL_GPL(threads_per_core);
 EXPORT_SYMBOL_GPL(threads_per_subcore);
 EXPORT_SYMBOL_GPL(threads_shift);
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..2d5db5e0132e 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,7 +90,7 @@ static inline void create_shadowed_slbe(unsigned long ea, int ssize,
 		     : "memory" );
 }
 
-static void __slb_flush_and_rebolt(void)
+void __slb_flush_and_rebolt(void)
 {
 	/* If you change this make sure you change SLB_NUM_BOLTED
 	 * and PR KVM appropriately too. */
@@ -128,6 +128,8 @@ static void __slb_flush_and_rebolt(void)
 		        "r"(ksp_vsid_data),
 		        "r"(ksp_esid_data)
 		     : "memory");
+
+	get_paca()->slb_cache_ptr = 0;
 }
 
 void slb_flush_and_rebolt(void)
@@ -142,7 +144,6 @@ void slb_flush_and_rebolt(void)
 	hard_irq_disable();
 
 	__slb_flush_and_rebolt();
-	get_paca()->slb_cache_ptr = 0;
 }
 
 void slb_vmalloc_update(void)
@@ -213,6 +214,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 			asm volatile("slbie %0" : : "r" (slbie_data));
 		}
 		asm volatile("isync" : : : "memory");
+		get_paca()->slb_cache_ptr = 0;
 	} else {
 		__slb_flush_and_rebolt();
 	}
@@ -221,7 +223,6 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 	if (offset == 1 || offset > SLB_CACHE_ENTRIES)
 		asm volatile("slbie %0" : : "r" (slbie_data));
 
-	get_paca()->slb_cache_ptr = 0;
 	copy_mm_to_paca(mm);
 
 	/*
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index 12f13acee1f6..2e129b882727 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -16,6 +16,7 @@
 #include <linux/device.h>
 #include <linux/cpu.h>
 
+#include <asm/asm-prototypes.h>
 #include <asm/firmware.h>
 #include <asm/machdep.h>
 #include <asm/opal.h>
@@ -46,10 +47,10 @@ static u64 pnv_default_stop_mask;
 static bool default_stop_found;
 
 /*
- * First deep stop state. Used to figure out when to save/restore
- * hypervisor context.
+ * First stop state levels when HV and TB loss can occur.
  */
-u64 pnv_first_deep_stop_state = MAX_STOP_STATE;
+static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -135,11 +136,11 @@ static int pnv_save_sprs_for_deep_states(void)
 	return 0;
 }
 
-static void pnv_alloc_idle_core_states(void)
+static void pnv_alloc_idle_core_states_p8(void)
 {
 	int i, j;
 	int nr_cores = cpu_nr_cores();
-	u32 *core_idle_state;
+	unsigned long *core_idle_state;
 
 	/*
 	 * core_idle_state - The lower 8 bits track the idle state of
@@ -166,7 +167,7 @@ static void pnv_alloc_idle_core_states(void)
 		int node = cpu_to_node(first_cpu);
 		size_t paca_ptr_array_size;
 
-		core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
+		core_idle_state = kmalloc_node(sizeof(unsigned long), GFP_KERNEL, node);
 		*core_idle_state = (1 << threads_per_core) - 1;
 		paca_ptr_array_size = (threads_per_core *
 				       sizeof(struct paca_struct *));
@@ -181,41 +182,6 @@ static void pnv_alloc_idle_core_states(void)
 	}
 
 	update_subcore_sibling_mask();
-
-	if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
-		int rc = pnv_save_sprs_for_deep_states();
-
-		if (likely(!rc))
-			return;
-
-		/*
-		 * The stop-api is unable to restore hypervisor
-		 * resources on wakeup from platform idle states which
-		 * lose full context. So disable such states.
-		 */
-		supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
-		pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
-		pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
-
-		if (cpu_has_feature(CPU_FTR_ARCH_300) &&
-		    (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
-			/*
-			 * Use the default stop state for CPU-Hotplug
-			 * if available.
-			 */
-			if (default_stop_found) {
-				pnv_deepest_stop_psscr_val =
-					pnv_default_stop_val;
-				pnv_deepest_stop_psscr_mask =
-					pnv_default_stop_mask;
-				pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
-					pnv_deepest_stop_psscr_val);
-			} else { /* Fallback to snooze loop for CPU-Hotplug */
-				deepest_stop_found = false;
-				pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
-			}
-		}
-	}
 }
 
 u32 pnv_get_supported_cpuidle_states(void)
@@ -345,6 +311,263 @@ void power7_idle(void)
 	power7_idle_type(PNV_THREAD_NAP);
 }
 
+struct p9_sprs {
+	/* per core */
+	u64 ptcr;
+	u64 rpr;
+	u64 tscr;
+	u64 ldbar;
+
+	/* per thread */
+	u64 lpcr;
+	u64 hfscr;
+	u64 fscr;
+	u64 pid;
+	u64 purr;
+	u64 spurr;
+	u64 dscr;
+
+	u64 mmcra;
+	u32 mmcr0;
+	u32 mmcr1;
+	u64 mmcr2;
+};
+
+static inline void atomic_start_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	int thread = cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+	u64 s = READ_ONCE(*state);
+	u64 new, tmp;
+
+again:
+	if (unlikely(s & PNV_CORE_IDLE_LOCK_BIT)) {
+		spin_begin();
+		do {
+			spin_cpu_relax();
+			s = READ_ONCE(*state);
+		} while (s & PNV_CORE_IDLE_LOCK_BIT);
+		spin_end();
+	}
+
+	BUG_ON(!(s & thread));
+
+	new = s & ~thread;
+	tmp = cmpxchg(state, s, new);
+	if (unlikely(tmp != s)) {
+		s = tmp;
+		goto again;
+	}
+}
+
+static inline void atomic_lock_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	while (unlikely(test_and_set_bit_lock(NR_PNV_CORE_IDLE_LOCK_BIT, state)))
+		barrier();
+}
+
+static inline void atomic_unlock_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
+}
+
+static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	int thread = cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+	unsigned long srr1;
+	unsigned long mmcr0 = 0;
+	struct p9_sprs sprs;
+
+	/* XXX: this gets rid of the uninitialized warning. Should use attributes because this is expensive */
+	memset(&sprs, 0, sizeof(sprs));
+
+	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
+		/*
+		 * Wake synchronously. SRESET via xscom may still cause
+		 * a 0x100 powersave wakeup with SRR1 reason!
+		 */
+		srr1 = isa3_idle_stop_noloss(psscr);
+		if (likely(!srr1))
+			return 0;
+
+	} else {
+		if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+			 /*
+			  * POWER9 DD2 can incorrectly set PMAO when waking up
+			  * after a state-loss idle. Saving and restoring MMCR0
+			  * over idle is a workaround.
+			  */
+			mmcr0 = mfspr(SPRN_MMCR0);
+		}
+		if ((psscr & PSSCR_RL_MASK) >= pnv_first_hv_loss_level) {
+			atomic_start_thread_idle();
+
+			sprs.ptcr = mfspr(SPRN_PTCR);
+			sprs.rpr = mfspr(SPRN_RPR);
+			sprs.tscr = mfspr(SPRN_TSCR);
+			sprs.ldbar = mfspr(SPRN_LDBAR);
+
+			sprs.lpcr = mfspr(SPRN_LPCR);
+			sprs.hfscr = mfspr(SPRN_HFSCR);
+			sprs.fscr = mfspr(SPRN_FSCR);
+			sprs.pid = mfspr(SPRN_PID);
+			sprs.purr = mfspr(SPRN_PURR);
+			sprs.spurr = mfspr(SPRN_SPURR);
+			sprs.dscr = mfspr(SPRN_DSCR);
+
+			sprs.mmcra = mfspr(SPRN_MMCRA);
+			sprs.mmcr0 = mfspr(SPRN_MMCR0);
+			sprs.mmcr1 = mfspr(SPRN_MMCR1);
+			sprs.mmcr2 = mfspr(SPRN_MMCR2);
+		}
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+		if (cpu_has_feature(CPU_FTR_P9_TM_XER_SO_BUG)) {
+			local_paca->requested_psscr = psscr;
+			/* order setting requested_psscr vs testing dont_stop */
+			smp_mb();
+			if (atomic_read(&local_paca->dont_stop)) {
+				local_paca->requested_psscr = 0;
+				return 0;
+			}
+
+			srr1 = isa3_idle_stop_mayloss(psscr);
+			local_paca->requested_psscr = 0;
+		} else
+#endif
+			srr1 = isa3_idle_stop_mayloss(psscr);
+	}
+
+	WARN_ON_ONCE(!srr1);
+	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
+
+	if ((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS) {
+		unsigned long mmcra;
+
+		/*
+		 * Workaround for POWER9 DD2, if we lost resources, the ERAT
+		 * might have been mixed up and needs flushing. We also need
+		 * to reload MMCR0 (see mmcr0 comment above).
+		 */
+		if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+			asm volatile(PPC_INVALIDATE_ERAT);
+			mtspr(SPRN_MMCR0, mmcr0);
+		}
+
+		/*
+		 * DD2.2 and earlier need to set then clear bit 60 in MMCRA
+		 * to ensure the PMU starts running.
+		 */
+		mmcra = mfspr(SPRN_MMCRA);
+		mmcra |= PPC_BIT(60);
+		mtspr(SPRN_MMCRA, mmcra);
+		mmcra &= ~PPC_BIT(60);
+		mtspr(SPRN_MMCRA, mmcra);
+	}
+
+	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
+		hmi_exception_realmode(NULL);
+
+	if (likely((srr1 & SRR1_WAKESTATE) != SRR1_WS_HVLOSS)) {
+		mtmsr(MSR_KERNEL);
+		return srr1;
+	}
+
+	/* HV state loss */
+	WARN_ON((psscr & PSSCR_RL_MASK) < pnv_first_hv_loss_level);
+
+	atomic_lock_thread_idle();
+
+	WARN_ON(*state & thread);
+
+	if ((*state & ((1 << threads_per_core) - 1)) != 0)
+		goto core_woken;
+
+	/* Per-core SPRs */
+	mtspr(SPRN_PTCR, sprs.ptcr);
+	mtspr(SPRN_RPR, sprs.rpr);
+	mtspr(SPRN_TSCR, sprs.tscr);
+	mtspr(SPRN_LDBAR, sprs.ldbar);
+
+	if ((psscr & PSSCR_RL_MASK) >= pnv_first_tb_loss_level) {
+		unsigned long level = mfspr(SPRN_PSSCR) & PSSCR_RL_MASK;
+		if (level >= pnv_first_tb_loss_level) {
+			/* TB loss */
+			if (opal_resync_timebase() != OPAL_SUCCESS)
+				BUG();
+		}
+	}
+
+core_woken:
+	*state |= thread;
+	atomic_unlock_thread_idle();
+
+	/* Per-thread SPRs */
+	mtspr(SPRN_LPCR, sprs.lpcr);
+	mtspr(SPRN_HFSCR, sprs.hfscr);
+	mtspr(SPRN_FSCR, sprs.fscr);
+	mtspr(SPRN_PID, sprs.pid);
+	mtspr(SPRN_PURR, sprs.purr);
+	mtspr(SPRN_SPURR, sprs.spurr);
+	mtspr(SPRN_DSCR, sprs.dscr);
+
+	mtspr(SPRN_MMCRA, sprs.mmcra);
+	mtspr(SPRN_MMCR0, sprs.mmcr0);
+	mtspr(SPRN_MMCR1, sprs.mmcr1);
+	mtspr(SPRN_MMCR2, sprs.mmcr2);
+
+	if (!radix_enabled())
+		__slb_flush_and_rebolt();
+
+	mtmsr(MSR_KERNEL);
+
+	return srr1;
+}
+
+unsigned long power9_offline_stop(unsigned long psscr)
+{
+#ifndef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	return power9_idle_stop(psscr, true);
+#else
+	unsigned long srr1;
+
+	/*
+	 * Tell KVM we're entering idle.
+	 * This does not have to be done in real mode because the P9 MMU
+	 * is independent per-thread. Some steppings share radix/hash mode
+	 * between threads, but in that case KVM has a barrier sync in real
+	 * mode before and after switching between radix and hash.
+	 */
+	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
+	srr1 = power9_idle_stop(psscr, false);
+
+	if (local_paca->kvm_hstate.hwthread_state == KVM_HWTHREAD_IN_KERNEL) {
+		local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
+		/* Order setting hwthread_state vs. testing hwthread_req */
+		smp_mb();
+	}
+	if (local_paca->kvm_hstate.hwthread_req) {
+		/* XXX: fix this so it's not garbage */
+		asm volatile("b	kvm_start_guest" ::: "memory");
+	}
+	mtmsr(MSR_KERNEL);
+
+	return srr1;
+#endif
+}
+
 static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
 				      unsigned long stop_psscr_mask)
 {
@@ -358,7 +581,7 @@ static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
 	psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
 
 	__ppc64_runlatch_off();
-	srr1 = power9_idle_stop(psscr);
+	srr1 = power9_idle_stop(psscr, true);
 	__ppc64_runlatch_on();
 
 	fini_irq_for_idle_irqsoff();
@@ -407,7 +630,7 @@ void pnv_power9_force_smt4_catch(void)
 			atomic_inc(&paca_ptrs[cpu0+thr]->dont_stop);
 	}
 	/* order setting dont_stop vs testing requested_psscr */
-	mb();
+	smp_mb();
 	for (thr = 0; thr < threads_per_core; ++thr) {
 		if (!paca_ptrs[cpu0+thr]->requested_psscr)
 			++awake_threads;
@@ -623,7 +846,8 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 	u64 *psscr_val = NULL;
 	u64 *psscr_mask = NULL;
 	u32 *residency_ns = NULL;
-	u64 max_residency_ns = 0;
+	u64 max_deep_residency_ns = 0;
+	u64 max_default_residency_ns = 0;
 	int rc = 0, i;
 
 	psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
@@ -661,26 +885,25 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 	}
 
 	/*
-	 * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
-	 * and the pnv_default_stop_{val,mask}.
-	 *
-	 * pnv_first_deep_stop_state should be set to the first stop
-	 * level to cause hypervisor state loss.
-	 *
 	 * pnv_deepest_stop_{val,mask} should be set to values corresponding to
 	 * the deepest stop state.
 	 *
 	 * pnv_default_stop_{val,mask} should be set to values corresponding to
-	 * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
+	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 	 */
-	pnv_first_deep_stop_state = MAX_STOP_STATE;
+	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+	pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
 	for (i = 0; i < dt_idle_states; i++) {
 		int err;
 		u64 psscr_rl = psscr_val[i] & PSSCR_RL_MASK;
 
+		if ((flags[i] & OPAL_PM_TIMEBASE_STOP) &&
+		     (pnv_first_tb_loss_level > psscr_rl))
+			pnv_first_tb_loss_level = psscr_rl;
+
 		if ((flags[i] & OPAL_PM_LOSE_FULL_CONTEXT) &&
-		     (pnv_first_deep_stop_state > psscr_rl))
-			pnv_first_deep_stop_state = psscr_rl;
+		     (pnv_first_hv_loss_level > psscr_rl))
+			pnv_first_hv_loss_level = psscr_rl;
 
 		err = validate_psscr_val_mask(&psscr_val[i], &psscr_mask[i],
 					      flags[i]);
@@ -689,19 +912,21 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 			continue;
 		}
 
-		if (max_residency_ns < residency_ns[i]) {
-			max_residency_ns = residency_ns[i];
+		if (max_deep_residency_ns < residency_ns[i]) {
+			max_deep_residency_ns = residency_ns[i];
 			pnv_deepest_stop_psscr_val = psscr_val[i];
 			pnv_deepest_stop_psscr_mask = psscr_mask[i];
 			pnv_deepest_stop_flag = flags[i];
 			deepest_stop_found = true;
 		}
 
-		if (!default_stop_found &&
+		if (max_default_residency_ns < residency_ns[i] &&
 		    (flags[i] & OPAL_PM_STOP_INST_FAST)) {
+			max_default_residency_ns = residency_ns[i];
 			pnv_default_stop_val = psscr_val[i];
 			pnv_default_stop_mask = psscr_mask[i];
 			default_stop_found = true;
+			WARN_ON(flags[i] & OPAL_PM_LOSE_FULL_CONTEXT);
 		}
 	}
 
@@ -721,15 +946,48 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
 			pnv_deepest_stop_psscr_mask);
 	}
 
-	pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
-		pnv_first_deep_stop_state);
+	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%lld\n",
+		pnv_first_hv_loss_level);
+
+	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%lld\n",
+		pnv_first_tb_loss_level);
 out:
 	kfree(psscr_val);
 	kfree(psscr_mask);
 	kfree(residency_ns);
+
 	return rc;
 }
 
+static void __init pnv_disable_deep_states(void)
+{
+	/*
+	 * The stop-api is unable to restore hypervisor
+	 * resources on wakeup from platform idle states which
+	 * lose full context. So disable such states.
+	 */
+	supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
+	pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
+	pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
+
+	if (cpu_has_feature(CPU_FTR_ARCH_300) &&
+	    (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
+		/*
+		 * Use the default stop state for CPU-Hotplug
+		 * if available.
+		 */
+		if (default_stop_found) {
+			pnv_deepest_stop_psscr_val = pnv_default_stop_val;
+			pnv_deepest_stop_psscr_mask = pnv_default_stop_mask;
+			pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
+				pnv_deepest_stop_psscr_val);
+		} else { /* Fallback to snooze loop for CPU-Hotplug */
+			deepest_stop_found = false;
+			pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
+		}
+	}
+}
+
 /*
  * Probe device tree for supported idle states
  */
@@ -771,6 +1029,7 @@ static void __init pnv_probe_idle_states(void)
 out:
 	kfree(flags);
 }
+
 static int __init pnv_init_idle_states(void)
 {
 
@@ -798,10 +1057,29 @@ static int __init pnv_init_idle_states(void)
 				&dev_attr_fastsleep_workaround_applyonce);
 	}
 
-	pnv_alloc_idle_core_states();
+	if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+		pnv_alloc_idle_core_states_p8();
+		if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
+			ppc_md.power_save = power7_idle;
+	} else {
+		int cpu;
+
+		for_each_present_cpu(cpu) {
+			paca_ptrs[cpu]->requested_psscr = 0;
+			paca_ptrs[cpu]->idle_state = 0;
+			if (cpu == cpu_first_thread_sibling(cpu))
+				paca_ptrs[cpu]->idle_state =
+					(1 << threads_per_core) - 1;
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+			atomic_set(&paca_ptrs[cpu]->dont_stop, 0);
+#endif
+		}
+	}
 
-	if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
-		ppc_md.power_save = power7_idle;
+	if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
+		if (pnv_save_sprs_for_deep_states())
+			pnv_disable_deep_states();
+	}
 
 out:
 	return 0;
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 21119cfe8474..09120d4ec12b 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2425,19 +2425,18 @@ static void dump_one_paca(int cpu)
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
-	DUMP(p, core_idle_state_ptr, "%-*px");
-	DUMP(p, thread_idle_state, "%#-*x");
-	DUMP(p, thread_mask, "%#-*x");
-	DUMP(p, subcore_sibling_mask, "%#-*x");
-	DUMP(p, requested_psscr, "%#-*llx");
-	DUMP(p, stop_sprs.pid, "%#-*llx");
-	DUMP(p, stop_sprs.ldbar, "%#-*llx");
-	DUMP(p, stop_sprs.fscr, "%#-*llx");
-	DUMP(p, stop_sprs.hfscr, "%#-*llx");
-	DUMP(p, stop_sprs.mmcr1, "%#-*llx");
-	DUMP(p, stop_sprs.mmcr2, "%#-*llx");
-	DUMP(p, stop_sprs.mmcra, "%#-*llx");
-	DUMP(p, dont_stop.counter, "%#-*x");
+	if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+		DUMP(p, core_idle_state_ptr, "%-*px");
+		DUMP(p, thread_idle_state, "%#-*x");
+		DUMP(p, thread_mask, "%#-*x");
+		DUMP(p, subcore_sibling_mask, "%#-*x");
+	} else {
+		DUMP(p, idle_state, "%#-*lx");
+		DUMP(p, requested_psscr, "%#-*llx");
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+		DUMP(p, dont_stop.counter, "%#-*x");
+#endif
+	}
 #endif
 
 	DUMP(p, accounting.utime, "%#-*lx");
-- 
2.17.0

^ permalink raw reply related

* Re: [RFC PATCH 1/2] dma-mapping: Clean up dma_set_*mask() hooks
From: Christoph Hellwig @ 2018-07-08 15:07 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, m.szyprowski, iommu, linux-arm-kernel,
	linux-ia64, linuxppc-dev
In-Reply-To: <33175bb1-d198-a9fd-ff37-e8c4132f47f8@arm.com>

On Fri, Jul 06, 2018 at 03:20:34PM +0100, Robin Murphy wrote:
>> What are you trying to do?  I really don't want to see more users of
>> the hooks as they are are a horribly bad idea.
>
> I really need to fix the ongoing problem we have where, due to funky 
> integrations, devices suffer some downstream addressing limit (described by 
> DT dma-ranges or ACPI IORT/_DMA) which we carefully set up in 
> dma_configure(), but then just gets lost when the driver probes and 
> innocently calls dma_set_mask() with something wider. I think it's 
> effectively the generalised case of the VIA 32-bit quirk, if I understand 
> that one correctly.

I'd much rather fix this in generic code.  How funky are your limitations?
In fact when I did the 32-bit quirk (which will also be used by a Xiling
PCIe root port usable on a lot of architectures) I did initially consider
adding a bus_dma_mask or similar to struct device, but opted for the
most simple implementation for now.  I'd be happy to chanfe this.

Especially these days where busses and IP blocks are generally not tied
to a specific cpu instruction set I really believe that having any
more architecture code than absolutely required is a bad idea.

> The approach that seemed to me to be safest is largely based on the one 
> proposed in a thread from ages ago[1]; namely to make dma_configure() 
> better at distinguishing firmware-specified masks from bus defaults, 
> capture the firmware mask in dev->archdata during arch_setup_dma_ops(), 
> then use the custom set_mask routines to ensure any subsequent updates 
> never exceed that. It doesn't seem possible to make this work robustly 
> without storing *some* additional per-device data, and for that archdata is 
> a lesser evil than struct device itself. Plus even though it's not actually 
> an arch-specific issue it feels like there's such a risk of breaking other 
> platforms that I'm reticent to even try handling it entirely in generic 
> code.

My plan for a few merge windows from now is that dma_mask and
coherent_mask are 100% in device control and dma_set_mask will never
fail.  It will be up to the dma ops to make sure things are addressible.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox