LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
	linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
	jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
	jean-philippe.brucker, marc.zyngier
In-Reply-To: <7779442d7889ee943b3e4ff6c63ec90b4a58b79d.camel@kernel.crashing.org>

On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > > but it's not, VMs are
> > > created in "legacy" mode all the times and we don't know at VM creation
> > > time whether it will become a secure VM or not. The way our secure VMs
> > > work is that they start as a normal VM, load a secure "payload" and
> > > call the Ultravisor to "become" secure.
> > > 
> > > So we're in a bit of a bind here. We need that one-liner optional arch
> > > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > > 
> > > Ben.
> > 
> > And just to make sure I understand, on your platform DMA APIs do include
> > some of the cache flushing tricks and this is why you don't want to
> > declare iommu support in the hypervisor?
> 
> I'm not sure I parse what you mean.
> 
> We don't need cache flushing tricks.

You don't but do real devices on same platform need them?

> The problem we have with our
> "secure" VMs is that:
> 
>  - At VM creation time we have no idea it's going to become a secure
> VM, qemu doesn't know anything about it, and thus qemu (or other
> management tools, libvirt etc...) are going to create "legacy" (ie
> iommu bypass) virtio devices.
> 
>  - Once the VM goes secure (early during boot but too late for qemu),
> it will need to make virtio do bounce-buffering via swiotlb because
> qemu cannot physically access most VM pages (blocked by HW security
> features), we need to bounce buffer using some unsecure pages that are
> accessible to qemu.
> 
> That said, I wouldn't object for us to more generally switch long run
> to changing qemu so that virtio on powerpc starts using the IOMMU as a
> default provided we fix our guest firmware to understand it (it
> currently doesn't), and provided we verify that the performance impact
> on things like vhost is negligible.
> 
> Cheers,
> Ben.
> 

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:55 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
	elfring, david, jasowang, benh, mpe, hch, linuxram, haren, paulus,
	srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> This patch series is the follow up on the discussions we had before about
> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> were suggestions about doing away with two different paths of transactions
> with the host/QEMU, first being the direct GPA and the other being the DMA
> API based translations.
> 
> First patch attempts to create a direct GPA mapping based DMA operations
> structure called 'virtio_direct_dma_ops' with exact same implementation
> of the direct GPA path which virtio core currently has but just wrapped in
> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> existing semantics. The second patch does exactly that inside the function
> virtio_finalize_features(). The third patch removes the default direct GPA
> path from virtio core forcing it to use DMA API callbacks for all devices.
> Now with that change, every device must have a DMA operations structure
> associated with it. The fourth patch adds an additional hook which gives
> the platform an opportunity to do yet another override if required. This
> platform hook can be used on POWER Ultravisor based protected guests to
> load up SWIOTLB DMA callbacks to do the required (as discussed previously
> in the above mentioned thread how host is allowed to access only parts of
> the guest GPA range) bounce buffering into the shared memory for all I/O
> scatter gather buffers to be consumed on the host side.
> 
> Please go through these patches and review whether this approach broadly
> makes sense. I will appreciate suggestions, inputs, comments regarding
> the patches or the approach in general. Thank you.

Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.

I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.


> Anshuman Khandual (4):
>   virtio: Define virtio_direct_dma_ops structure
>   virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>   virtio: Force virtio core to use DMA API callbacks for all virtio devices
>   virtio: Add platform specific DMA API translation for virito devices
> 
>  arch/powerpc/include/asm/dma-mapping.h |  6 +++
>  arch/powerpc/platforms/pseries/iommu.c |  6 +++
>  drivers/virtio/virtio.c                | 72 ++++++++++++++++++++++++++++++++++
>  drivers/virtio/virtio_pci_common.h     |  3 ++
>  drivers/virtio/virtio_ring.c           | 65 +-----------------------------
>  5 files changed, 89 insertions(+), 63 deletions(-)
> 
> -- 
> 2.9.3

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 21:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
	linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
	jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
	jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180802225738-mutt-send-email-mst@kernel.org>

On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > at the details of the patch in a while but that's what I told him to
> > implement ;-) :
> > 
> >   - Make virtio always use DMA ops to simplify the code path (with a set
> > of "transparent" ops for legacy)
> > 
> >   and
> > 
> >   -  Provide an arch hook allowing us to "override" those "transparent"
> > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > 
> > Cheers,
> > Ben.
> > 
> 
> Right but as I tried to say doing that brings us to a bunch of issues
> with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> guest to hypervisor communication.

I'm not sure I see the problem, see below

> When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> bunch of overhead which we need to get rid of if we are to switch to
> PLATFORM_IOMMU by default.  We need to fix that.

So let's differenciate the two problems of having an IOMMU (real or
emulated) which indeeds adds overhead etc... and using the DMA API.

At the moment, virtio does this all over the place:

	if (use_dma_api)
		dma_map/alloc_something(...)
	else
		use_pa

The idea of the patch set is to do two, somewhat orthogonal, changes
that together achieve what we want. Let me know where you think there
is "a bunch of issues" because I'm missing it:

 1- Replace the above if/else constructs with just calling the DMA API,
and have virtio, at initialization, hookup its own dma_ops that just
"return pa" (roughly) when the IOMMU stuff isn't used.

This adds an indirect function call to the path that previously didn't
have one (the else case above). Is that a significant/measurable
overhead ?

This change stands alone, and imho "cleans" up virtio by avoiding all
that if/else "2 path" and unless it adds a measurable overhead, should
probably be done.

 2- Make virtio use the DMA API with our custom platform-provided
swiotlb callbacks when needed, that is when not using IOMMU *and*
running on a secure VM in our case.

This benefits from -1- by making us just plumb in a different set of
DMA ops we would have cooked up specifically for virtio in our arch
code (or in virtio itself but build arch-conditionally in a separate
file). But it doesn't strictly need it -1-:

Now, -2- doesn't strictly needs -1-. We could have just done another
xen-like hack that forces the DMA API "ON" for virtio when running in a
secure VM.

The problem if we do that however is that we also then need the arch
PCI code to make sure it hooks up the virtio PCI devices with the
special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
not the same as other PCI devices. So it will have to play games such
as checking vendor/device IDs for virtio, checking the IOMMU flag,
etc... from the arch code which really bloody sucks when assigning PCI
DMA ops.

However, if we do it the way we plan here, on top of -1-, with a hook
called from virtio into the arch to "override" the virtio DMA ops, then
we avoid the problem completely: The arch hook would only be called by
virtio if the IOMMU flag is *not* set. IE only when using that special
"hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
PCI dma ops as usual.

That way, we have a very clear semantic: This hook is purely about
replacing those "null" DMA ops that just return PA introduced in -1-
with some arch provided specially cooked up DMA ops for non-IOMMU
virtio that know about the arch special requirements. For us bounce
buffering.

Is there something I'm missing ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 21:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
	linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
	jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
	jean-philippe.brucker, marc.zyngier
In-Reply-To: <de4888b6457e220776e16a9c8958ff0886ffc66c.camel@kernel.crashing.org>

On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > > at the details of the patch in a while but that's what I told him to
> > > implement ;-) :
> > > 
> > >   - Make virtio always use DMA ops to simplify the code path (with a set
> > > of "transparent" ops for legacy)
> > > 
> > >   and
> > > 
> > >   -  Provide an arch hook allowing us to "override" those "transparent"
> > > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > 
> > Right but as I tried to say doing that brings us to a bunch of issues
> > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> > guest to hypervisor communication.
> 
> I'm not sure I see the problem, see below
> 
> > When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> > bunch of overhead which we need to get rid of if we are to switch to
> > PLATFORM_IOMMU by default.  We need to fix that.
> 
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.

Well actually it's the other way around. An iommu in theory doesn't need
to bring overhead if you set it in bypass mode.  Which does imply the
iommu supports bypass mode. Is that universally the case?  DMA API does
see Christoph's list of things it does some of which add overhead.

> At the moment, virtio does this all over the place:
> 
> 	if (use_dma_api)
> 		dma_map/alloc_something(...)
> 	else
> 		use_pa
> 
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
> 
>  1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
> 
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?

Seems to be :( Jason reports about 4%. I wonder whether we can support
map_sg and friends being NULL, then use that when mapping is an
identity. A conditional branch there is likely very cheap.

Would this cover all platforms with kvm (which is where we care
most about performance)?

> This change stands alone, and imho "cleans" up virtio by avoiding all
> that if/else "2 path" and unless it adds a measurable overhead, should
> probably be done.
> 
>  2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
> 
> This benefits from -1- by making us just plumb in a different set of
> DMA ops we would have cooked up specifically for virtio in our arch
> code (or in virtio itself but build arch-conditionally in a separate
> file). But it doesn't strictly need it -1-:
> 
> Now, -2- doesn't strictly needs -1-. We could have just done another
> xen-like hack that forces the DMA API "ON" for virtio when running in a
> secure VM.
> 
> The problem if we do that however is that we also then need the arch
> PCI code to make sure it hooks up the virtio PCI devices with the
> special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
> not the same as other PCI devices. So it will have to play games such
> as checking vendor/device IDs for virtio, checking the IOMMU flag,
> etc... from the arch code which really bloody sucks when assigning PCI
> DMA ops.
> 
> However, if we do it the way we plan here, on top of -1-, with a hook
> called from virtio into the arch to "override" the virtio DMA ops, then
> we avoid the problem completely: The arch hook would only be called by
> virtio if the IOMMU flag is *not* set. IE only when using that special
> "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
> PCI dma ops as usual.
> 
> That way, we have a very clear semantic: This hook is purely about
> replacing those "null" DMA ops that just return PA introduced in -1-
> with some arch provided specially cooked up DMA ops for non-IOMMU
> virtio that know about the arch special requirements. For us bounce
> buffering.
> 
> Is there something I'm missing ?
> 
> Cheers,
> Ben.

Right so I was trying to write it up in a systematic way, but just to
give you one example, if there is a system where DMA API handles
coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU
flag causes that to happen.

And we kinda worked around this without the IOMMU by basically saying
"ok we do not really need DMA API so let's just bypass it" and
it was kind of ok except now everyone is switching to vIOMMU
just in case. So now people do want some parts of what DMA API does,
such as the bounce buffer use, or IOMMU mappings.

And maybe in the end the solution is going to be to do something similar
to virt_Xmb except for DMA APIs: add APIs that handle just the
addressing bits but without the overhead. See
commit 6a65d26385bf487926a0616650927303058551e3
    asm-generic: implement virt_xxx memory barriers
for reference, it's a similar set of issues.

So it's not a problem with your patches as such, it's just that
they don't solve that harder problem.

-- 
MST

^ permalink raw reply

* Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
From: Joakim Tjernlund @ 2018-08-02 22:36 UTC (permalink / raw)
  To: leoyang.li@nxp.com, york.sun@nxp.com
  Cc: linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au,
	qiang.zhao@nxp.com
In-Reply-To: <AM4PR0401MB1699AFF424BDFAE1BC65E1FA8F420@AM4PR0401MB1699.eurprd04.prod.outlook.com>

TGVvLCBkaWQgdGhpcyBnbyBhbnl3aGVyZSA/DQoNCiAgICBKb2NrZQ0KDQpPbiBUdWUsIDIwMTgt
MDctMDMgYXQgMTY6MzUgKzAwMDAsIExlbyBMaSB3cm90ZToNCj4gDQo+ID4gLS0tLS1PcmlnaW5h
bCBNZXNzYWdlLS0tLS0NCj4gPiBGcm9tOiBZb3JrIFN1bg0KPiA+IFNlbnQ6IFR1ZXNkYXksIEp1
bHkgMywgMjAxOCAxMDoyNyBBTQ0KPiA+IFRvOiBqb2NrZUBpbmZpbmVyYS5jb20gPGpvYWtpbS50
amVybmx1bmRAaW5maW5lcmEuY29tPjsgTGVvIExpDQo+ID4gPGxlb3lhbmcubGlAbnhwLmNvbT4N
Cj4gPiBDYzogbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5vcmc7IG1wZUBlbGxlcm1hbi5pZC5h
dTsgUWlhbmcgWmhhbw0KPiA+IDxxaWFuZy56aGFvQG54cC5jb20+DQo+ID4gU3ViamVjdDogUmU6
IFtQQVRDSF0gUUUgR1BJTzogQWRkIHFlX2dwaW9fc2V0X211bHRpcGxlDQo+ID4gDQo+ID4gK0xl
bw0KPiA+IA0KPiA+IE9uIDA3LzAzLzIwMTggMDM6MzAgQU0sIEpvYWtpbSBUamVybmx1bmQgd3Jv
dGU6DQo+ID4gPiBPbiBUdWUsIDIwMTgtMDYtMjYgYXQgMjM6NDEgKzEwMDAsIE1pY2hhZWwgRWxs
ZXJtYW4gd3JvdGU6DQo+ID4gPiA+IA0KPiA+ID4gPiBKb2FraW0gVGplcm5sdW5kIDxKb2FraW0u
VGplcm5sdW5kQGluZmluZXJhLmNvbT4gd3JpdGVzOg0KPiA+ID4gPiA+IE9uIFRodSwgMjAxOC0w
Ni0yMSBhdCAwMjozOCArMDAwMCwgUWlhbmcgWmhhbyB3cm90ZToNCj4gPiA+ID4gPiA+IE9uIDA2
LzE5LzIwMTggMDk6MjIgQU0sIEpvYWtpbSBUamVybmx1bmQgd3JvdGU6DQo+ID4gPiA+ID4gPiAt
LS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gPiA+ID4gRnJvbTogTGludXhwcGMtZGV2
IFttYWlsdG86bGludXhwcGMtZGV2LQ0KPiA+IA0KPiA+IGJvdW5jZXMrcWlhbmcuemhhbz1ueHAu
Y29tQGxpc3RzLm96bGFicy5vcmddIE9uIEJlaGFsZiBPZiBKb2FraW0NCj4gPiBUamVybmx1bmQN
Cj4gPiA+ID4gPiA+IFNlbnQ6IDIwMTjlubQ25pyIMjDml6UgMDoyMg0KPiA+ID4gPiA+ID4gVG86
IFlvcmsgU3VuIDx5b3JrLnN1bkBueHAuY29tPjsgbGludXhwcGMtZGV2IDxsaW51eHBwYy0NCj4g
PiANCj4gPiBkZXZAbGlzdHMub3psYWJzLm9yZz4NCj4gPiA+ID4gPiA+IFN1YmplY3Q6IFtQQVRD
SF0gUUUgR1BJTzogQWRkIHFlX2dwaW9fc2V0X211bHRpcGxlDQo+ID4gPiA+ID4gPiANCj4gPiA+
ID4gPiA+IFRoaXMgY291c2luIHRvIGdwaW8tbXBjOHh4eCB3YXMgbGFja2luZyBhIG11bHRpcGxl
IHBpbnMgbWV0aG9kLCBhZGQNCj4gPiANCj4gPiBvbmUuDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4g
PiA+IFNpZ25lZC1vZmYtYnk6IEpvYWtpbSBUamVybmx1bmQgPGpvYWtpbS50amVybmx1bmRAaW5m
aW5lcmEuY29tPg0KPiA+ID4gPiA+ID4gLS0tDQo+ID4gPiA+ID4gPiAgZHJpdmVycy9zb2MvZnNs
L3FlL2dwaW8uYyB8IDI4ICsrKysrKysrKysrKysrKysrKysrKysrKysrKysNCj4gPiA+ID4gPiA+
ICAxIGZpbGUgY2hhbmdlZCwgMjggaW5zZXJ0aW9ucygrKQ0KPiA+ID4gPiANCj4gPiA+ID4gLi4u
DQo+ID4gPiA+ID4gPiAgc3RhdGljIGludCBxZV9ncGlvX2Rpcl9pbihzdHJ1Y3QgZ3Bpb19jaGlw
ICpnYywgdW5zaWduZWQgaW50IGdwaW8pICB7DQo+ID4gPiA+ID4gPiAgICAgICAgIHN0cnVjdCBv
Zl9tbV9ncGlvX2NoaXAgKm1tX2djID0gdG9fb2ZfbW1fZ3Bpb19jaGlwKGdjKTsgQEAgLQ0KPiA+
IA0KPiA+IDI5OCw2ICszMjUsNyBAQCBzdGF0aWMgaW50IF9faW5pdCBxZV9hZGRfZ3Bpb2NoaXBz
KHZvaWQpDQo+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmRpcmVjdGlvbl9vdXRwdXQg
PSBxZV9ncGlvX2Rpcl9vdXQ7DQo+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmdldCA9
IHFlX2dwaW9fZ2V0Ow0KPiA+ID4gPiA+ID4gICAgICAgICAgICAgICAgIGdjLT5zZXQgPSBxZV9n
cGlvX3NldDsNCj4gPiA+ID4gPiA+ICsgICAgICAgICAgICAgICBnYy0+c2V0X211bHRpcGxlID0g
cWVfZ3Bpb19zZXRfbXVsdGlwbGU7DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+ICAgICAgICAg
ICAgICAgICByZXQgPSBvZl9tbV9ncGlvY2hpcF9hZGRfZGF0YShucCwgbW1fZ2MsIHFlX2djKTsN
Cj4gPiA+ID4gPiA+ICAgICAgICAgICAgICAgICBpZiAocmV0KQ0KPiA+ID4gPiA+ID4gDQo+ID4g
PiA+ID4gPiBSZXZpZXdlZC1ieTogUWlhbmcgWmhhbyA8cWlhbmcuemhhb0BueHAuY29tPg0KPiA+
ID4gPiA+ID4gDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gV2hvIHBpY2tzIHVwIHRoaXMgcGF0Y2gg
PyBJcyBpdCBxdWV1ZWQgc29tZXdoZXJlIGFscmVhZHk/DQo+ID4gPiA+IA0KPiA+ID4gPiBOb3Qg
bWUuDQo+ID4gPiANCj4gPiA+IFlvcms/IFlvdSBzZWVtIHRvIGJlIHRoZSBvbmx5IG9uZSBsZWZ0
Lg0KPiA+ID4gDQo+ID4gDQo+ID4gSSBhbSBub3QgYSBMaW51eCBtYWludGFpbmVyLiBFdmVuIEkg
d2FudCB0bywgSSBjYW4ndCBtZXJnZSB0aGlzIHBhdGNoLg0KPiA+IA0KPiA+IExlbywgd2hvIGNh
biBtZXJnZSB0aGlzIHBhdGNoIGFuZCByZXF1ZXN0IGEgcHVsbD8NCj4gDQo+IFNpbmNlIGl0IGZh
bGxzIHVuZGVyIHRoZSBkcml2ZXIvc29jL2ZzbC8gZm9sZGVyLiAgSSBjYW4gdGFrZSBpdC4NCj4g
DQo+IFJlZ2FyZHMsDQo+IExlbw0K

^ permalink raw reply

* RE: [PATCH] QE GPIO: Add qe_gpio_set_multiple
From: Leo Li @ 2018-08-02 22:44 UTC (permalink / raw)
  To: jocke@infinera.com, York Sun
  Cc: linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au, Qiang Zhao
In-Reply-To: <bace5db4ede3d1e8cbe314f1306b90bc2893218e.camel@infinera.com>

DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogSm9ha2ltIFRqZXJubHVu
ZCBbbWFpbHRvOkpvYWtpbS5UamVybmx1bmRAaW5maW5lcmEuY29tXQ0KPiBTZW50OiBUaHVyc2Rh
eSwgQXVndXN0IDIsIDIwMTggNTozNyBQTQ0KPiBUbzogTGVvIExpIDxsZW95YW5nLmxpQG54cC5j
b20+OyBZb3JrIFN1biA8eW9yay5zdW5AbnhwLmNvbT4NCj4gQ2M6IGxpbnV4cHBjLWRldkBsaXN0
cy5vemxhYnMub3JnOyBtcGVAZWxsZXJtYW4uaWQuYXU7IFFpYW5nIFpoYW8NCj4gPHFpYW5nLnpo
YW9AbnhwLmNvbT4NCj4gU3ViamVjdDogUmU6IFtQQVRDSF0gUUUgR1BJTzogQWRkIHFlX2dwaW9f
c2V0X211bHRpcGxlDQo+IA0KPiBMZW8sIGRpZCB0aGlzIGdvIGFueXdoZXJlID8NCg0KSXQgaGFz
IGJlZW4gaW5jbHVkZWQgaW4gbXkgc29jIHB1bGwgcmVxdWVzdCBmb3IgNC4xOS4gIGh0dHA6Ly9s
a21sLml1LmVkdS9oeXBlcm1haWwvbGludXgva2VybmVsLzE4MDcuMy8wMjI5NS5odG1sDQoNCkxl
bw0KDQo+IA0KPiAgICAgSm9ja2UNCj4gDQo+IE9uIFR1ZSwgMjAxOC0wNy0wMyBhdCAxNjozNSAr
MDAwMCwgTGVvIExpIHdyb3RlOg0KPiA+DQo+ID4gPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0t
LQ0KPiA+ID4gRnJvbTogWW9yayBTdW4NCj4gPiA+IFNlbnQ6IFR1ZXNkYXksIEp1bHkgMywgMjAx
OCAxMDoyNyBBTQ0KPiA+ID4gVG86IGpvY2tlQGluZmluZXJhLmNvbSA8am9ha2ltLnRqZXJubHVu
ZEBpbmZpbmVyYS5jb20+OyBMZW8gTGkNCj4gPiA+IDxsZW95YW5nLmxpQG54cC5jb20+DQo+ID4g
PiBDYzogbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5vcmc7IG1wZUBlbGxlcm1hbi5pZC5hdTsg
UWlhbmcgWmhhbw0KPiA+ID4gPHFpYW5nLnpoYW9AbnhwLmNvbT4NCj4gPiA+IFN1YmplY3Q6IFJl
OiBbUEFUQ0hdIFFFIEdQSU86IEFkZCBxZV9ncGlvX3NldF9tdWx0aXBsZQ0KPiA+ID4NCj4gPiA+
ICtMZW8NCj4gPiA+DQo+ID4gPiBPbiAwNy8wMy8yMDE4IDAzOjMwIEFNLCBKb2FraW0gVGplcm5s
dW5kIHdyb3RlOg0KPiA+ID4gPiBPbiBUdWUsIDIwMTgtMDYtMjYgYXQgMjM6NDEgKzEwMDAsIE1p
Y2hhZWwgRWxsZXJtYW4gd3JvdGU6DQo+ID4gPiA+ID4NCj4gPiA+ID4gPiBKb2FraW0gVGplcm5s
dW5kIDxKb2FraW0uVGplcm5sdW5kQGluZmluZXJhLmNvbT4gd3JpdGVzOg0KPiA+ID4gPiA+ID4g
T24gVGh1LCAyMDE4LTA2LTIxIGF0IDAyOjM4ICswMDAwLCBRaWFuZyBaaGFvIHdyb3RlOg0KPiA+
ID4gPiA+ID4gPiBPbiAwNi8xOS8yMDE4IDA5OjIyIEFNLCBKb2FraW0gVGplcm5sdW5kIHdyb3Rl
Og0KPiA+ID4gPiA+ID4gPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gPiA+ID4g
PiBGcm9tOiBMaW51eHBwYy1kZXYgW21haWx0bzpsaW51eHBwYy1kZXYtDQo+ID4gPg0KPiA+ID4g
Ym91bmNlcytxaWFuZy56aGFvPW54cC5jb21AbGlzdHMub3psYWJzLm9yZ10gT24gQmVoYWxmIE9m
IEpvYWtpbQ0KPiA+ID4gVGplcm5sdW5kDQo+ID4gPiA+ID4gPiA+IFNlbnQ6IDIwMTjlubQ25pyI
MjDml6UgMDoyMg0KPiA+ID4gPiA+ID4gPiBUbzogWW9yayBTdW4gPHlvcmsuc3VuQG54cC5jb20+
OyBsaW51eHBwYy1kZXYgPGxpbnV4cHBjLQ0KPiA+ID4NCj4gPiA+IGRldkBsaXN0cy5vemxhYnMu
b3JnPg0KPiA+ID4gPiA+ID4gPiBTdWJqZWN0OiBbUEFUQ0hdIFFFIEdQSU86IEFkZCBxZV9ncGlv
X3NldF9tdWx0aXBsZQ0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4gPiBUaGlzIGNvdXNpbiB0
byBncGlvLW1wYzh4eHggd2FzIGxhY2tpbmcgYSBtdWx0aXBsZSBwaW5zDQo+ID4gPiA+ID4gPiA+
IG1ldGhvZCwgYWRkDQo+ID4gPg0KPiA+ID4gb25lLg0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+
ID4gPiBTaWduZWQtb2ZmLWJ5OiBKb2FraW0gVGplcm5sdW5kDQo+ID4gPiA+ID4gPiA+IDxqb2Fr
aW0udGplcm5sdW5kQGluZmluZXJhLmNvbT4NCj4gPiA+ID4gPiA+ID4gLS0tDQo+ID4gPiA+ID4g
PiA+ICBkcml2ZXJzL3NvYy9mc2wvcWUvZ3Bpby5jIHwgMjggKysrKysrKysrKysrKysrKysrKysr
KysrKysrKw0KPiA+ID4gPiA+ID4gPiAgMSBmaWxlIGNoYW5nZWQsIDI4IGluc2VydGlvbnMoKykN
Cj4gPiA+ID4gPg0KPiA+ID4gPiA+IC4uLg0KPiA+ID4gPiA+ID4gPiAgc3RhdGljIGludCBxZV9n
cGlvX2Rpcl9pbihzdHJ1Y3QgZ3Bpb19jaGlwICpnYywgdW5zaWduZWQgaW50IGdwaW8pICB7DQo+
ID4gPiA+ID4gPiA+ICAgICAgICAgc3RydWN0IG9mX21tX2dwaW9fY2hpcCAqbW1fZ2MgPQ0KPiA+
ID4gPiA+ID4gPiB0b19vZl9tbV9ncGlvX2NoaXAoZ2MpOyBAQCAtDQo+ID4gPg0KPiA+ID4gMjk4
LDYgKzMyNSw3IEBAIHN0YXRpYyBpbnQgX19pbml0IHFlX2FkZF9ncGlvY2hpcHModm9pZCkNCj4g
PiA+ID4gPiA+ID4gICAgICAgICAgICAgICAgIGdjLT5kaXJlY3Rpb25fb3V0cHV0ID0gcWVfZ3Bp
b19kaXJfb3V0Ow0KPiA+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmdldCA9IHFlX2dw
aW9fZ2V0Ow0KPiA+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPnNldCA9IHFlX2dwaW9f
c2V0Ow0KPiA+ID4gPiA+ID4gPiArICAgICAgICAgICAgICAgZ2MtPnNldF9tdWx0aXBsZSA9IHFl
X2dwaW9fc2V0X211bHRpcGxlOw0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4gPiAgICAgICAg
ICAgICAgICAgcmV0ID0gb2ZfbW1fZ3Bpb2NoaXBfYWRkX2RhdGEobnAsIG1tX2djLCBxZV9nYyk7
DQo+ID4gPiA+ID4gPiA+ICAgICAgICAgICAgICAgICBpZiAocmV0KQ0KPiA+ID4gPiA+ID4gPg0K
PiA+ID4gPiA+ID4gPiBSZXZpZXdlZC1ieTogUWlhbmcgWmhhbyA8cWlhbmcuemhhb0BueHAuY29t
Pg0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4NCj4gPiA+ID4gPiA+IFdobyBwaWNrcyB1cCB0
aGlzIHBhdGNoID8gSXMgaXQgcXVldWVkIHNvbWV3aGVyZSBhbHJlYWR5Pw0KPiA+ID4gPiA+DQo+
ID4gPiA+ID4gTm90IG1lLg0KPiA+ID4gPg0KPiA+ID4gPiBZb3JrPyBZb3Ugc2VlbSB0byBiZSB0
aGUgb25seSBvbmUgbGVmdC4NCj4gPiA+ID4NCj4gPiA+DQo+ID4gPiBJIGFtIG5vdCBhIExpbnV4
IG1haW50YWluZXIuIEV2ZW4gSSB3YW50IHRvLCBJIGNhbid0IG1lcmdlIHRoaXMgcGF0Y2guDQo+
ID4gPg0KPiA+ID4gTGVvLCB3aG8gY2FuIG1lcmdlIHRoaXMgcGF0Y2ggYW5kIHJlcXVlc3QgYSBw
dWxsPw0KPiA+DQo+ID4gU2luY2UgaXQgZmFsbHMgdW5kZXIgdGhlIGRyaXZlci9zb2MvZnNsLyBm
b2xkZXIuICBJIGNhbiB0YWtlIGl0Lg0KPiA+DQo+ID4gUmVnYXJkcywNCj4gPiBMZW8NCg==

^ permalink raw reply

* Re: [resend] [PATCH 0/3] powerpc/pseries: use H_BLOCK_REMOVE
From: Michael Ellerman @ 2018-08-03  2:29 UTC (permalink / raw)
  To: Nicholas Piggin, Laurent Dufour
  Cc: linuxppc-dev, linux-kernel, aneesh.kumar, benh, paulus
In-Reply-To: <20180801115514.441eecc8@roar.ozlabs.ibm.com>

Nicholas Piggin <npiggin@gmail.com> writes:
> On Fri, 27 Jul 2018 15:51:30 +0200
> Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
>> [Resending so everyone is getting the cover letter]
>> On very large system we could see soft lockup fired when a process is exiting
>> 
>> watchdog: BUG: soft lockup - CPU#851 stuck for 21s! [forkoff:215523]
>> Modules linked in: pseries_rng rng_core xfs raid10 vmx_crypto btrfs libcrc32c xor zstd_decompress zstd_compress xxhash lzo_compress raid6_pq crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
>> CPU: 851 PID: 215523 Comm: forkoff Not tainted 4.17.0 #1
>> NIP:  c0000000000b995c LR: c0000000000b8f64 CTR: 000000000000aa18
>> REGS: c00006b0645b7610 TRAP: 0901   Not tainted  (4.17.0)
>> MSR:  800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22042082  XER: 00000000
>> CFAR: 00000000006cf8f0 SOFTE: 0 
>> GPR00: 0010000000000000 c00006b0645b7890 c000000000f99200 0000000000000000 
>> GPR04: 8e000001a5a4de58 400249cf1bfd5480 8e000001a5a4de50 400249cf1bfd5480 
>> GPR08: 8e000001a5a4de48 400249cf1bfd5480 8e000001a5a4de40 400249cf1bfd5480 
>> GPR12: ffffffffffffffff c00000001e690800 
>> NIP [c0000000000b995c] plpar_hcall9+0x44/0x7c
>> LR [c0000000000b8f64] pSeries_lpar_flush_hash_range+0x324/0x3d0
>> Call Trace:
>> [c00006b0645b7890] [8e000001a5a4dd20] 0x8e000001a5a4dd20 (unreliable)
>> [c00006b0645b7a00] [c00000000006d5b0] flush_hash_range+0x60/0x110
>> [c00006b0645b7a50] [c000000000072a2c] __flush_tlb_pending+0x4c/0xd0
>> [c00006b0645b7a80] [c0000000002eaf44] unmap_page_range+0x984/0xbd0
>> [c00006b0645b7bc0] [c0000000002eb594] unmap_vmas+0x84/0x100
>> [c00006b0645b7c10] [c0000000002f8afc] exit_mmap+0xac/0x1f0
>> [c00006b0645b7cd0] [c0000000000f2638] mmput+0x98/0x1b0
>> [c00006b0645b7d00] [c0000000000fc9d0] do_exit+0x330/0xc00
>> [c00006b0645b7dc0] [c0000000000fd384] do_group_exit+0x64/0x100
>> [c00006b0645b7e00] [c0000000000fd44c] sys_exit_group+0x2c/0x30
>> [c00006b0645b7e30] [c00000000000b960] system_call+0x58/0x6c
>> Instruction dump:
>> 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378 
>> e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008
>> 
>> This happens when removing the PTE by calling the hypervisor using the
>> H_BULK_REMOVE call. This call is processing up to 4 PTEs but is doing a
>> tlbie for each PTE it is processing. This could lead to long time spent in
>> the hypervisor (sometimes up to 4s) and soft lockup being raised because
>> the scheduler is not called in zap_pte_range().
>> 
>> Since the Power7's time, the hypervisor is providing a new hcall
>> H_BLOCK_REMOVE allowing processing up to 8 PTEs with one call to
>> tlbie. By limiting the amount of tlbie generated, this reduces the time
>> spent invalidating the PTEs.
>
> Oh that's a nice feature. I must have an ancient PAPR because I don't
> have it.

The only public document is LoPAPR, which is a stripped down version of
the 2012 PAPR.

cheers

^ permalink raw reply

* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Murilo Opsfelder Araujo @ 2018-08-03  0:42 UTC (permalink / raw)
  To: Christophe LEROY
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Paul Mackerras, Simon Guo,
	Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev,
	Segher Boessenkool
In-Reply-To: <7209fa95-8d40-14ca-f27a-ce3edb64191e@c-s.fr>

Hi, Christophe.

On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>
>
> Le 01/08/2018 à 23:33, Murilo Opsfelder Araujo a écrit :
> > show_user_instructions() is a slightly modified version of
> > show_instructions() that allows userspace instruction dump.
> >
> > This will be useful within show_signal_msg() to dump userspace
> > instructions of the faulty location.
> >
> > Here is a sample of what show_user_instructions() outputs:
> >
> >    pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> >    pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
> >
> > The current->comm and current->pid printed can serve as a glue that
> > links the instructions dump to its originator, allowing messages to be
> > interleaved in the logs.
> >
> > Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> > ---
> >   arch/powerpc/include/asm/stacktrace.h | 13 +++++++++
> >   arch/powerpc/kernel/process.c         | 40 +++++++++++++++++++++++++++
> >   2 files changed, 53 insertions(+)
> >   create mode 100644 arch/powerpc/include/asm/stacktrace.h
> >
> > diff --git a/arch/powerpc/include/asm/stacktrace.h b/arch/powerpc/include/asm/stacktrace.h
> > new file mode 100644
> > index 000000000000..6149b53b3bc8
> > --- /dev/null
> > +++ b/arch/powerpc/include/asm/stacktrace.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Stack trace functions.
> > + *
> > + * Copyright 2018, Murilo Opsfelder Araujo, IBM Corporation.
> > + */
> > +
> > +#ifndef _ASM_POWERPC_STACKTRACE_H
> > +#define _ASM_POWERPC_STACKTRACE_H
> > +
> > +void show_user_instructions(struct pt_regs *regs);
> > +
> > +#endif /* _ASM_POWERPC_STACKTRACE_H */
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index e9533b4d2f08..364645ac732c 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> >   	pr_cont("\n");
> >   }
> > +void show_user_instructions(struct pt_regs *regs)
> > +{
> > +	int i;
> > +	const char *prefix = KERN_INFO "%s[%d]: code: ";
> > +	unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > +					sizeof(int));
> > +
> > +	printk(prefix, current->comm, current->pid);
>
> Why not use pr_info() and remove KERN_INFO from *prefix ?

Because it doesn't compile:

  arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
    pr_info(prefix, current->comm, current->pid);
            ^
  ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
   #define pr_fmt(fmt) fmt
                     ^

`pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
which is an invalid string concatenation.

`pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
valid.

> > +
> > +	for (i = 0; i < instructions_to_print; i++) {
> > +		int instr;
> > +
> > +		if (!(i % 8) && (i > 0)) {
> > +			pr_cont("\n");
> > +			printk(prefix, current->comm, current->pid);
> > +		}
> > +
> > +#if !defined(CONFIG_BOOKE)
> > +		/* If executing with the IMMU off, adjust pc rather
> > +		 * than print XXXXXXXX.
> > +		 */
> > +		if (!(regs->msr & MSR_IR))
> > +			pc = (unsigned long)phys_to_virt(pc);
>
> Shouldn't this be done outside of the loop, only once ?

I don't think so.

pc gets incremented at the bottom of the loop:

  pc += sizeof(int);

Adjusting pc is necessary at each iteration.  Leaving this block inside
the loop seems correct.

Cheers
Murilo

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Jason Wang @ 2018-08-03  2:41 UTC (permalink / raw)
  To: Michael S. Tsirkin, Anshuman Khandual
  Cc: virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
	elfring, david, benh, mpe, hch, linuxram, haren, paulus, srikar
In-Reply-To: <20180802235332-mutt-send-email-mst@kernel.org>



On 2018年08月03日 04:55, Michael S. Tsirkin wrote:
> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
>> This patch series is the follow up on the discussions we had before about
>> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
>> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
>> were suggestions about doing away with two different paths of transactions
>> with the host/QEMU, first being the direct GPA and the other being the DMA
>> API based translations.
>>
>> First patch attempts to create a direct GPA mapping based DMA operations
>> structure called 'virtio_direct_dma_ops' with exact same implementation
>> of the direct GPA path which virtio core currently has but just wrapped in
>> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
>> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
>> existing semantics. The second patch does exactly that inside the function
>> virtio_finalize_features(). The third patch removes the default direct GPA
>> path from virtio core forcing it to use DMA API callbacks for all devices.
>> Now with that change, every device must have a DMA operations structure
>> associated with it. The fourth patch adds an additional hook which gives
>> the platform an opportunity to do yet another override if required. This
>> platform hook can be used on POWER Ultravisor based protected guests to
>> load up SWIOTLB DMA callbacks to do the required (as discussed previously
>> in the above mentioned thread how host is allowed to access only parts of
>> the guest GPA range) bounce buffering into the shared memory for all I/O
>> scatter gather buffers to be consumed on the host side.
>>
>> Please go through these patches and review whether this approach broadly
>> makes sense. I will appreciate suggestions, inputs, comments regarding
>> the patches or the approach in general. Thank you.
> Jason did some work on profiling this. Unfortunately he reports
> about 4% extra overhead from this switch on x86 with no vIOMMU.

The test is rather simple, just run pktgen (pktgen_sample01_simple.sh) 
in guest and measure PPS on tap on host.

Thanks

>
> I expect he's writing up the data in more detail, but
> just wanted to let you know this would be one more
> thing to debug before we can just switch to DMA APIs.
>
>
>> Anshuman Khandual (4):
>>    virtio: Define virtio_direct_dma_ops structure
>>    virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>>    virtio: Force virtio core to use DMA API callbacks for all virtio devices
>>    virtio: Add platform specific DMA API translation for virito devices
>>
>>   arch/powerpc/include/asm/dma-mapping.h |  6 +++
>>   arch/powerpc/platforms/pseries/iommu.c |  6 +++
>>   drivers/virtio/virtio.c                | 72 ++++++++++++++++++++++++++++++++++
>>   drivers/virtio/virtio_pci_common.h     |  3 ++
>>   drivers/virtio/virtio_ring.c           | 65 +-----------------------------
>>   5 files changed, 89 insertions(+), 63 deletions(-)
>>
>> -- 
>> 2.9.3

^ permalink raw reply

* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Joe Perches @ 2018-08-03  1:22 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo, Christophe LEROY
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Michael Ellerman, Michael Neuling,
	Nicholas Piggin, Paul Mackerras, Simon Guo, Sukadev Bhattiprolu,
	Tobin C . Harding, linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180803004201.GA5891@kermit-br-ibm-com>

On Thu, 2018-08-02 at 21:42 -0300, Murilo Opsfelder Araujo wrote:
> > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
[]
> > > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> > >   	pr_cont("\n");
> > >   }
> > > +void show_user_instructions(struct pt_regs *regs)
> > > +{
> > > +	int i;
> > > +	const char *prefix = KERN_INFO "%s[%d]: code: ";
> > > +	unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > > +					sizeof(int));
> > > +
> > > +	printk(prefix, current->comm, current->pid);
> > 
> > Why not use pr_info() and remove KERN_INFO from *prefix ?
> 
> Because it doesn't compile:
> 
>   arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
>     pr_info(prefix, current->comm, current->pid);
>             ^
>   ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
>    #define pr_fmt(fmt) fmt
>                      ^

What being suggested is using:

	pr_info("%s[%d]: code: ", current->comm, current->pid);

^ permalink raw reply

* [PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage
From: Reza Arbab @ 2018-08-03  4:03 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Alistair Popple

We've encountered a performance issue when multiple processors stress
{get,put}_mmio_atsd_reg(). These functions contend for mmio_atsd_usage,
an unsigned long used as a bitmask.

The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
and clear_bit_unlock(). As implemented, both of these will require a
(successful) stwcx to that same cache line.

What we end up with is thread A, attempting to unlock, being slowed by
other threads repeatedly attempting to lock. A's stwcx instructions fail
and retry because the memory reservation is lost every time a different
thread beats it to the punch.

There may be a long-term way to fix this at a larger scale, but for now
resolve the immediate problem by gating our call to
test_and_set_bit_lock() with one to test_bit(), which is obviously
implemented without using a store.

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 8cdf91f..c773465 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
 	int i;

 	for (i = 0; i < npu->mmio_atsd_count; i++) {
-		if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
-			return i;
+		if (!test_bit(i, &npu->mmio_atsd_usage))
+			if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
+				return i;
 	}

 	return -ENOSPC;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c
From: Nicholas Piggin @ 2018-08-03  4:13 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, kvm-ppc, Gautham R . Shenoy,
	Mahesh Jagannath Salgaonkar, Aneesh Kumar K.V, Akshay Adiga

The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will be used by pseries machine
check and idle code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  3 ++
 arch/powerpc/kernel/mce_power.c               | 21 ++--------
 arch/powerpc/mm/slb.c                         | 38 +++++++++++++++++++
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2f74bdc805e0..d4e398185b3a 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -497,6 +497,9 @@ extern void hpte_init_native(void);
 
 extern void slb_initialize(void);
 extern void slb_flush_and_rebolt(void);
+extern void slb_flush_all_realmode(void);
+extern void __slb_restore_bolted_realmode(void);
+extern void slb_restore_bolted_realmode(void);
 
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index d6756af6ec78..50f7b9817246 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
 #ifdef CONFIG_PPC_BOOK3S_64
 static void flush_and_reload_slb(void)
 {
-	struct slb_shadow *slb;
-	unsigned long i, n;
-
 	/* Invalidate all SLBs */
-	asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+	slb_flush_all_realmode();
 
 #ifdef CONFIG_KVM_BOOK3S_HANDLER
 	/*
@@ -76,22 +73,10 @@ static void flush_and_reload_slb(void)
 	if (get_paca()->kvm_hstate.in_guest)
 		return;
 #endif
-
-	/* For host kernel, reload the SLBs from shadow SLB buffer. */
-	slb = get_slb_shadow();
-	if (!slb)
+	if (early_radix_enabled())
 		return;
 
-	n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
-
-	/* Load up the SLB entries from shadow SLB */
-	for (i = 0; i < n; i++) {
-		unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
-		unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
-
-		rb = (rb & ~0xFFFul) | i;
-		asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
-	}
+	slb_restore_bolted_realmode();
 }
 #endif
 
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..136db8652577 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,6 +90,44 @@ static inline void create_shadowed_slbe(unsigned long ea, int ssize,
 		     : "memory" );
 }
 
+/*
+ * Insert bolted entries into SLB (which may not be empty).
+ */
+void __slb_restore_bolted_realmode(void)
+{
+	struct slb_shadow *p = get_slb_shadow();
+	enum slb_index index;
+
+	 /* No isync needed because realmode. */
+	for (index = 0; index < SLB_NUM_BOLTED; index++) {
+		asm volatile("slbmte  %0,%1" :
+		     : "r" (be64_to_cpu(p->save_area[index].vsid)),
+		       "r" (be64_to_cpu(p->save_area[index].esid)));
+	}
+}
+
+/*
+ * Insert bolted entries into an empty SLB.
+ * This is not the same as rebolt because the bolted segments
+ * (e.g., kstack) are not changed (rebolted).
+ */
+void slb_restore_bolted_realmode(void)
+{
+	__slb_restore_bolted_realmode();
+	get_paca()->slb_cache_ptr = 0;
+}
+
+/*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+void slb_flush_all_realmode(void)
+{
+	/*
+	 * This flushes all SLB entries including 0, so it must be realmode.
+	 */
+	asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+}
+
 static void __slb_flush_and_rebolt(void)
 {
 	/* If you change this make sure you change SLB_NUM_BOLTED
-- 
2.17.0

^ permalink raw reply related

* [PATCH 2/2] powerpc/64s: reimplement book3s idle code in C
From: Nicholas Piggin @ 2018-08-03  4:13 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nicholas Piggin, kvm-ppc, Gautham R . Shenoy,
	Mahesh Jagannath Salgaonkar, Aneesh Kumar K.V, Akshay Adiga
In-Reply-To: <20180803041350.25493-1-npiggin@gmail.com>

Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code. Generic book3s
assembly stubs are kept in common code and used only to save and
restore the stack frame and non-volatile GPRs just before going to
idle and just after waking up, which can return into C code.

Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
way this stuff will cope with non-trivial new CPU implementation
details, firmware changes, etc., without becoming unmaintainable.

This is not a strict translation to C code, there are some
differences.

- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
  but saves and restores them all explicitly.

- The optimisation where EC=ESL=0 idle modes did not have to save
  GPRs or change MSR is restored, because it's now simple to do.
  State loss modes that did not actually lose GPRs can use this
  optimization too.

- KVM secondary entry code is now more of a call/return style
  rather than spaghetti. nap_state_lost is not required beccause
  KVM always returns via NVGPR restorig path.

This seems pretty solid, so needs more review and testing now. The
KVM changes are pretty significant and complicated. POWER7 needs to
be tested too.

Open question:
- Why does P9 restore some of the PMU SPRs (but not others), and
  P8 only zeroes them?

Since RFC v1:
- Now tested and working with POWER9 hash and radix.
- KVM support added. This took a bit of work to untangle and might
  still have some issues, but POWER9 seems to work including hash on
  radix with dependent threads mode.
- This snowballed a bit because of KVM and other details making it
  not feasible to leave POWER7/8 code alone. That's only half done
  at the moment.
- So far this trades about 800 lines of asm for 500 of C. With POWER7/8
  support done it might be another hundred or so lines of C.

Since RFC v2:
- Fixed deep state SLB reloading
- Now tested and working with POWER8.
- Accounted for most feedback.

Since RFC v3:
- Rebased to powerpc merge + idle state bugfix
- Split SLB flush/restore code out and shared with MCE code (pseries
  MCE patches can also use).
- More testing on POWER8 including KVM with secondaries.
- Performance testing looks good. EC=ESL=0 is about 5% faster, other
  stop states look a little faster too.
- Adjusted SPR saving to handler POWER7, haven't tested it.
---
 arch/powerpc/include/asm/cpuidle.h       |  19 +-
 arch/powerpc/include/asm/paca.h          |  40 +-
 arch/powerpc/include/asm/processor.h     |   9 +-
 arch/powerpc/include/asm/reg.h           |   7 +-
 arch/powerpc/kernel/asm-offsets.c        |  18 -
 arch/powerpc/kernel/dt_cpu_ftrs.c        |  21 +-
 arch/powerpc/kernel/exceptions-64s.S     |  17 +-
 arch/powerpc/kernel/idle_book3s.S        | 998 ++---------------------
 arch/powerpc/kernel/setup-common.c       |   4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |  90 +-
 arch/powerpc/platforms/powernv/idle.c    | 816 ++++++++++++++----
 arch/powerpc/platforms/powernv/subcore.c |   2 +-
 arch/powerpc/xmon/xmon.c                 |  25 +-
 13 files changed, 859 insertions(+), 1207 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..9844b3ded187 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -27,10 +27,11 @@
  * the THREAD_WINKLE_BITS are set, which indicate which threads have not
  * yet woken from the winkle state.
  */
-#define PNV_CORE_IDLE_LOCK_BIT			0x10000000
+#define NR_PNV_CORE_IDLE_LOCK_BIT		28
+#define PNV_CORE_IDLE_LOCK_BIT			(1ULL << NR_PNV_CORE_IDLE_LOCK_BIT)
 
+#define PNV_CORE_IDLE_WINKLE_COUNT_SHIFT	16
 #define PNV_CORE_IDLE_WINKLE_COUNT		0x00010000
-#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT	0x00080000
 #define PNV_CORE_IDLE_WINKLE_COUNT_BITS		0x000F0000
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT	8
 #define PNV_CORE_IDLE_THREAD_WINKLE_BITS	0x0000FF00
@@ -68,16 +69,6 @@
 #define ERR_DEEP_STATE_ESL_MISMATCH	-2
 
 #ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
-	u64 pid;
-	u64 ldbar;
-	u64 fscr;
-	u64 hfscr;
-	u64 mmcr1;
-	u64 mmcr2;
-	u64 mmcra;
-};
 
 #define PNV_IDLE_NAME_LEN    16
 struct pnv_idle_states_t {
@@ -92,10 +83,6 @@ struct pnv_idle_states_t {
 
 extern struct pnv_idle_states_t *pnv_idle_states;
 extern int nr_pnv_idle_states;
-extern u32 pnv_fastsleep_workaround_at_entry[];
-extern u32 pnv_fastsleep_workaround_at_exit[];
-
-extern u64 pnv_first_deep_stop_state;
 
 unsigned long pnv_cpu_offline(unsigned int cpu);
 int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4e9cede5a7e7..d2cee5ebaaa1 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -168,7 +168,6 @@ struct paca_struct {
 	u8 irq_happened;		/* irq happened while soft-disabled */
 	u8 io_sync;			/* writel() needs spin_unlock sync */
 	u8 irq_work_pending;		/* IRQ_WORK interrupt while soft-disable */
-	u8 nap_state_lost;		/* NV GPR values lost in power7_idle */
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	u8 pmcregs_in_use;		/* pseries puts this in lppaca */
 #endif
@@ -178,23 +177,30 @@ struct paca_struct {
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
-	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
-	u32 *core_idle_state_ptr;
-	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
-	/* Mask to indicate thread id in core */
-	u8 thread_mask;
-	/* Mask to denote subcore sibling threads */
-	u8 subcore_sibling_mask;
-	/* Flag to request this thread not to stop */
-	atomic_t dont_stop;
-	/* The PSSCR value that the kernel requested before going to stop */
-	u64 requested_psscr;
+	/* PowerNV idle fields */
+	/* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
+	unsigned long idle_state;
+	union {
+		/* P7/P8 specific fields */
+		struct {
+			/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
+			u8 thread_idle_state;
+			/* Mask to indicate thread id in core */
+			u8 thread_mask;
+			/* Mask to denote subcore sibling threads */
+			u8 subcore_sibling_mask;
+		};
 
-	/*
-	 * Save area for additional SPRs that need to be
-	 * saved/restored during cpuidle stop.
-	 */
-	struct stop_sprs stop_sprs;
+		/* P9 specific fields */
+		struct {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+			/* The PSSCR value that the kernel requested before going to stop */
+			u64 requested_psscr;
+			/* Flag to request this thread not to stop */
+			atomic_t dont_stop;
+#endif
+		};
+	};
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 52fadded5c1e..3e00d959f628 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -506,14 +506,17 @@ static inline unsigned long get_clean_sp(unsigned long sp, int is_32)
 }
 #endif
 
+/* asm stubs */
+extern unsigned long isa3_idle_stop_noloss(unsigned long psscr_val);
+extern unsigned long isa3_idle_stop_mayloss(unsigned long psscr_val);
+extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
+
 extern unsigned long cpuidle_disable;
 enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
 
 extern int powersave_nap;	/* set if nap mode can be used in idle loop */
-extern unsigned long power7_idle_insn(unsigned long type); /* PNV_THREAD_NAP/etc*/
+
 extern void power7_idle_type(unsigned long type);
-extern unsigned long power9_idle_stop(unsigned long psscr_val);
-extern unsigned long power9_offline_stop(unsigned long psscr_val);
 extern void power9_idle_type(unsigned long stop_psscr_val,
 			      unsigned long stop_psscr_mask);
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 486b7c83b8c5..6dd294b1b216 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -752,10 +752,9 @@
 #define	  SRR1_WAKERESET	0x00100000 /* System reset */
 #define   SRR1_WAKEHDBELL	0x000c0000 /* Hypervisor doorbell on P8 */
 #define	  SRR1_WAKESTATE	0x00030000 /* Powersave exit mask [46:47] */
-#define	  SRR1_WS_DEEPEST	0x00030000 /* Some resources not maintained,
-					  * may not be recoverable */
-#define	  SRR1_WS_DEEPER	0x00020000 /* Some resources not maintained */
-#define	  SRR1_WS_DEEP		0x00010000 /* All resources maintained */
+#define	  SRR1_WS_HVLOSS	0x00030000 /* HV resources not maintained */
+#define	  SRR1_WS_GPRLOSS	0x00020000 /* GPRs not maintained */
+#define	  SRR1_WS_NOLOSS	0x00010000 /* All resources maintained */
 #define   SRR1_PROGTM		0x00200000 /* TM Bad Thing */
 #define   SRR1_PROGFPE		0x00100000 /* Floating Point Enabled */
 #define   SRR1_PROGILL		0x00080000 /* Illegal instruction */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 89cf15566c4e..7834256585f1 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -256,7 +256,6 @@ int main(void)
 	OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
 	OFFSET(ACCOUNT_SYSTEM_TIME, paca_struct, accounting.stime);
 	OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save);
-	OFFSET(PACA_NAPSTATELOST, paca_struct, nap_state_lost);
 	OFFSET(PACA_SPRG_VDSO, paca_struct, sprg_vdso);
 #else /* CONFIG_PPC64 */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
@@ -761,23 +760,6 @@ int main(void)
 	OFFSET(VCPU_TIMING_LAST_ENTER_TBL, kvm_vcpu, arch.timing_last_enter.tv32.tbl);
 #endif
 
-#ifdef CONFIG_PPC_POWERNV
-	OFFSET(PACA_CORE_IDLE_STATE_PTR, paca_struct, core_idle_state_ptr);
-	OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
-	OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
-	OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
-	OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
-	OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
-#define STOP_SPR(x, f)	OFFSET(x, paca_struct, stop_sprs.f)
-	STOP_SPR(STOP_PID, pid);
-	STOP_SPR(STOP_LDBAR, ldbar);
-	STOP_SPR(STOP_FSCR, fscr);
-	STOP_SPR(STOP_HFSCR, hfscr);
-	STOP_SPR(STOP_MMCR1, mmcr1);
-	STOP_SPR(STOP_MMCR2, mmcr2);
-	STOP_SPR(STOP_MMCRA, mmcra);
-#endif
-
 	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
 	DEFINE(PPC_DBELL_MSGTYPE, PPC_DBELL_MSGTYPE);
 
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index f432054234a4..d635d78facdc 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -71,7 +71,6 @@ static int hv_mode;
 
 static struct {
 	u64	lpcr;
-	u64	lpcr_clear;
 	u64	hfscr;
 	u64	fscr;
 } system_registers;
@@ -80,24 +79,7 @@ static void (*init_pmu_registers)(void);
 
 static void __restore_cpu_cpufeatures(void)
 {
-	u64 lpcr;
-
-	/*
-	 * LPCR is restored by the power on engine already. It can be changed
-	 * after early init e.g., by radix enable, and we have no unified API
-	 * for saving and restoring such SPRs.
-	 *
-	 * This ->restore hook should really be removed from idle and register
-	 * restore moved directly into the idle restore code, because this code
-	 * doesn't know how idle is implemented or what it needs restored here.
-	 *
-	 * The best we can do to accommodate secondary boot and idle restore
-	 * for now is "or" LPCR with existing.
-	 */
-	lpcr = mfspr(SPRN_LPCR);
-	lpcr |= system_registers.lpcr;
-	lpcr &= ~system_registers.lpcr_clear;
-	mtspr(SPRN_LPCR, lpcr);
+	mtspr(SPRN_LPCR, system_registers.lpcr);
 	if (hv_mode) {
 		mtspr(SPRN_LPID, 0);
 		mtspr(SPRN_HFSCR, system_registers.hfscr);
@@ -318,7 +300,6 @@ static int __init feat_enable_mmu_hash_v3(struct dt_cpu_feature *f)
 {
 	u64 lpcr;
 
-	system_registers.lpcr_clear |= (LPCR_ISL | LPCR_UPRT | LPCR_HR);
 	lpcr = mfspr(SPRN_LPCR);
 	lpcr &= ~(LPCR_ISL | LPCR_UPRT | LPCR_HR);
 	mtspr(SPRN_LPCR, lpcr);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 7a672dafd94f..fe1d9b091ac7 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -136,8 +136,9 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
 
 #ifdef CONFIG_PPC_P7_NAP
 EXC_COMMON_BEGIN(system_reset_idle_common)
-	mfspr	r12,SPRN_SRR1
-	b	pnv_powersave_wakeup
+	mfspr	r3,SPRN_SRR1
+	bltlr	cr3	/* no state loss, return to idle caller */
+	b	idle_return_gpr_loss
 #endif
 
 /*
@@ -416,17 +417,17 @@ EXC_COMMON_BEGIN(machine_check_idle_common)
 	 * Then decrement MCE nesting after finishing with the stack.
 	 */
 	ld	r3,_MSR(r1)
+	ld	r4,_LINK(r1)
 
 	lhz	r11,PACA_IN_MCE(r13)
 	subi	r11,r11,1
 	sth	r11,PACA_IN_MCE(r13)
 
-	/* Turn off the RI bit because SRR1 is used by idle wakeup code. */
-	/* Recoverability could be improved by reducing the use of SRR1. */
-	li	r11,0
-	mtmsrd	r11,1
-
-	b	pnv_powersave_wakeup_mce
+	mtlr	r4
+	rlwinm	r10,r3,47-31,30,31
+	cmpwi	cr3,r10,2
+	bltlr	cr3	/* no state loss, return to idle caller */
+	b	idle_return_gpr_loss
 #endif
 	/*
 	 * Handle machine check early in real mode. We come here with
diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index 7f5ac2e8581b..6375b29d1c12 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -1,6 +1,7 @@
 /*
- *  This file contains idle entry/exit functions for POWER7,
- *  POWER8 and POWER9 CPUs.
+ *  This file contains general idle entry/exit functions. The platform / CPU
+ *  must call the correct save/restore functions and ensure SPRs are saved
+ *  and restored correctly, handle KVM, interrupts, etc.
  *
  *  This program is free software; you can redistribute it and/or
  *  modify it under the terms of the GNU General Public License
@@ -8,220 +9,67 @@
  *  2 of the License, or (at your option) any later version.
  */
 
-#include <linux/threads.h>
-#include <asm/processor.h>
-#include <asm/page.h>
-#include <asm/cputable.h>
-#include <asm/thread_info.h>
 #include <asm/ppc_asm.h>
 #include <asm/asm-offsets.h>
 #include <asm/ppc-opcode.h>
-#include <asm/hw_irq.h>
-#include <asm/kvm_book3s_asm.h>
-#include <asm/opal.h>
 #include <asm/cpuidle.h>
-#include <asm/exception-64s.h>
-#include <asm/book3s/64/mmu-hash.h>
-#include <asm/mmu.h>
-#include <asm/asm-compat.h>
-#include <asm/feature-fixups.h>
-
-#undef DEBUG
-
-/*
- * Use unused space in the interrupt stack to save and restore
- * registers for winkle support.
- */
-#define _MMCR0	GPR0
-#define _SDR1	GPR3
-#define _PTCR	GPR3
-#define _RPR	GPR4
-#define _SPURR	GPR5
-#define _PURR	GPR6
-#define _TSCR	GPR7
-#define _DSCR	GPR8
-#define _AMOR	GPR9
-#define _WORT	GPR10
-#define _WORC	GPR11
-#define _LPCR	GPR12
-
-#define PSSCR_EC_ESL_MASK_SHIFTED          (PSSCR_EC | PSSCR_ESL) >> 16
-
-	.text
 
 /*
- * Used by threads before entering deep idle states. Saves SPRs
- * in interrupt stack frame
- */
-save_sprs_to_stack:
-	/*
-	 * Note all register i.e per-core, per-subcore or per-thread is saved
-	 * here since any thread in the core might wake up first
-	 */
-BEGIN_FTR_SECTION
-	/*
-	 * Note - SDR1 is dropped in Power ISA v3. Hence not restoring
-	 * SDR1 here
-	 */
-	mfspr	r3,SPRN_PTCR
-	std	r3,_PTCR(r1)
-	mfspr	r3,SPRN_LPCR
-	std	r3,_LPCR(r1)
-FTR_SECTION_ELSE
-	mfspr	r3,SPRN_SDR1
-	std	r3,_SDR1(r1)
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
-	mfspr	r3,SPRN_RPR
-	std	r3,_RPR(r1)
-	mfspr	r3,SPRN_SPURR
-	std	r3,_SPURR(r1)
-	mfspr	r3,SPRN_PURR
-	std	r3,_PURR(r1)
-	mfspr	r3,SPRN_TSCR
-	std	r3,_TSCR(r1)
-	mfspr	r3,SPRN_DSCR
-	std	r3,_DSCR(r1)
-	mfspr	r3,SPRN_AMOR
-	std	r3,_AMOR(r1)
-	mfspr	r3,SPRN_WORT
-	std	r3,_WORT(r1)
-	mfspr	r3,SPRN_WORC
-	std	r3,_WORC(r1)
-/*
- * On POWER9, there are idle states such as stop4, invoked via cpuidle,
- * that lose hypervisor resources. In such cases, we need to save
- * additional SPRs before entering those idle states so that they can
- * be restored to their older values on wakeup from the idle state.
+ * Desired PSSCR in r3
  *
- * On POWER8, the only such deep idle state is winkle which is used
- * only in the context of CPU-Hotplug, where these additional SPRs are
- * reinitiazed to a sane value. Hence there is no need to save/restore
- * these SPRs.
+ * No state will be lost regardless of wakeup mechanism (interrupt or NIA).
+ * Interrupt driven wakeup may clobber volatiles, and should blr (with LR
+ * unchanged) to return to caller with r3 set according to caller's expected
+ * return code (for Book3S/64 that is SRR1).
  */
-BEGIN_FTR_SECTION
-	blr
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-
-power9_save_additional_sprs:
-	mfspr	r3, SPRN_PID
-	mfspr	r4, SPRN_LDBAR
-	std	r3, STOP_PID(r13)
-	std	r4, STOP_LDBAR(r13)
-
-	mfspr	r3, SPRN_FSCR
-	mfspr	r4, SPRN_HFSCR
-	std	r3, STOP_FSCR(r13)
-	std	r4, STOP_HFSCR(r13)
-
-	mfspr	r3, SPRN_MMCRA
-	mfspr	r4, SPRN_MMCR0
-	std	r3, STOP_MMCRA(r13)
-	std	r4, _MMCR0(r1)
-
-	mfspr	r3, SPRN_MMCR1
-	mfspr	r4, SPRN_MMCR2
-	std	r3, STOP_MMCR1(r13)
-	std	r4, STOP_MMCR2(r13)
-	blr
-
-power9_restore_additional_sprs:
-	ld	r3,_LPCR(r1)
-	ld	r4, STOP_PID(r13)
-	mtspr	SPRN_LPCR,r3
-	mtspr	SPRN_PID, r4
-
-	ld	r3, STOP_LDBAR(r13)
-	ld	r4, STOP_FSCR(r13)
-	mtspr	SPRN_LDBAR, r3
-	mtspr	SPRN_FSCR, r4
-
-	ld	r3, STOP_HFSCR(r13)
-	ld	r4, STOP_MMCRA(r13)
-	mtspr	SPRN_HFSCR, r3
-	mtspr	SPRN_MMCRA, r4
-
-	ld	r3, _MMCR0(r1)
-	ld	r4, STOP_MMCR1(r13)
-	mtspr	SPRN_MMCR0, r3
-	mtspr	SPRN_MMCR1, r4
-
-	ld	r3, STOP_MMCR2(r13)
-	ld	r4, PACA_SPRG_VDSO(r13)
-	mtspr	SPRN_MMCR2, r3
-	mtspr	SPRN_SPRG3, r4
+_GLOBAL(isa3_idle_stop_noloss)
+	mtspr 	SPRN_PSSCR,r3
+	PPC_STOP
+	li	r3,0
 	blr
 
 /*
- * Used by threads when the lock bit of core_idle_state is set.
- * Threads will spin in HMT_LOW until the lock bit is cleared.
- * r14 - pointer to core_idle_state
- * r15 - used to load contents of core_idle_state
- * r9  - used as a temporary variable
+ * Desired PSSCR in r3
+ *
+ * GPRs may be lost, so they are saved here. Wakeup is by interrupt only.
+ * Wakeup interrupt returns to the caller by calling isa3_idle_wake_gpr_loss
+ * with r3 set to desired return value.
+ *
+ * A wakeup without GPR loss may alteratively be handled as in
+ * isa3_idle_stop_noloss as an optimisation.
+ *
+ * Caller is responsible for restoring SPRs, MSR, etc.
  */
-
-core_idle_lock_held:
-	HMT_LOW
-3:	lwz	r15,0(r14)
-	andis.	r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bne	3b
-	HMT_MEDIUM
-	lwarx	r15,0,r14
-	andis.	r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bne-	core_idle_lock_held
-	blr
+_GLOBAL(isa3_idle_stop_mayloss)
+	mtspr 	SPRN_PSSCR,r3
+	std	r1,PACAR1(r13)
+	mflr	r4
+	mfcr	r5
+	/* use stack red zone rather than a new frame */
+	addi	r6,r1,-INT_FRAME_SIZE
+	SAVE_GPR(2, r6)
+	SAVE_NVGPRS(r6)
+	std	r4,_LINK(r6)
+	std	r5,_CCR(r6)
+	PPC_STOP
+	b	.	/* catch bugs */
 
 /*
- * Pass requested state in r3:
- *	r3 - PNV_THREAD_NAP/SLEEP/WINKLE in POWER8
- *	   - Requested PSSCR value in POWER9
+ * Desired return value in r3
  *
- * Address of idle handler to branch to in realmode in r4
+ * Idle wakeup can call this after calling isa3_idle_stop_loss to
+ * return to its caller with r3 as return code.
  */
-pnv_powersave_common:
-	/* Use r3 to pass state nap/sleep/winkle */
-	/* NAP is a state loss, we create a regs frame on the
-	 * stack, fill it up with the state we care about and
-	 * stick a pointer to it in PACAR1. We really only
-	 * need to save PC, some CR bits and the NV GPRs,
-	 * but for now an interrupt frame will do.
-	 */
-	mtctr	r4
-
-	mflr	r0
-	std	r0,16(r1)
-	stdu	r1,-INT_FRAME_SIZE(r1)
-	std	r0,_LINK(r1)
-	std	r0,_NIP(r1)
-
-	/* We haven't lost state ... yet */
-	li	r0,0
-	stb	r0,PACA_NAPSTATELOST(r13)
-
-	/* Continue saving state */
-	SAVE_GPR(2, r1)
-	SAVE_NVGPRS(r1)
-	mfcr	r5
-	std	r5,_CCR(r1)
-	std	r1,PACAR1(r13)
-
-BEGIN_FTR_SECTION
-	/*
-	 * POWER9 does not require real mode to stop, and presently does not
-	 * set hwthread_state for KVM (threads don't share MMU context), so
-	 * we can remain in virtual mode for this.
-	 */
-	bctr
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-	/*
-	 * POWER8
-	 * Go to real mode to do the nap, as required by the architecture.
-	 * Also, we need to be in real mode before setting hwthread_state,
-	 * because as soon as we do that, another thread can switch
-	 * the MMU context to the guest.
-	 */
-	LOAD_REG_IMMEDIATE(r7, MSR_IDLE)
-	mtmsrd	r7,0
-	bctr
+_GLOBAL(idle_return_gpr_loss)
+	ld	r1,PACAR1(r13)
+	addi	r6,r1,-INT_FRAME_SIZE
+	ld	r4,_LINK(r6)
+	ld	r5,_CCR(r6)
+	REST_NVGPRS(r6)
+	REST_GPR(2, r6)
+	mtlr	r4
+	mtcr	r5
+	blr
 
 /*
  * This is the sequence required to execute idle instructions, as
@@ -234,723 +82,53 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 	ld	r0,0(r1);					\
 236:	cmpd	cr0,r0,r0;					\
 	bne	236b;						\
-	IDLE_INST;
-
-
-	.globl pnv_enter_arch207_idle_mode
-pnv_enter_arch207_idle_mode:
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-	/* Tell KVM we're entering idle */
-	li	r4,KVM_HWTHREAD_IN_IDLE
-	/******************************************************/
-	/*  N O T E   W E L L    ! ! !    N O T E   W E L L   */
-	/* The following store to HSTATE_HWTHREAD_STATE(r13)  */
-	/* MUST occur in real mode, i.e. with the MMU off,    */
-	/* and the MMU must stay off until we clear this flag */
-	/* and test HSTATE_HWTHREAD_REQ(r13) in               */
-	/* pnv_powersave_wakeup in this file.                 */
-	/* The reason is that another thread can switch the   */
-	/* MMU to a guest context whenever this flag is set   */
-	/* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on,    */
-	/* that would potentially cause this thread to start  */
-	/* executing instructions from guest memory in        */
-	/* hypervisor mode, leading to a host crash or data   */
-	/* corruption, or worse.                              */
-	/******************************************************/
-	stb	r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
-	stb	r3,PACA_THREAD_IDLE_STATE(r13)
-	cmpwi	cr3,r3,PNV_THREAD_SLEEP
-	bge	cr3,2f
-	IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
-	/* No return */
-2:
-	/* Sleep or winkle */
-	lbz	r7,PACA_THREAD_MASK(r13)
-	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
-	li	r5,0
-	beq	cr3,3f
-	lis	r5,PNV_CORE_IDLE_WINKLE_COUNT@h
-3:
-lwarx_loop1:
-	lwarx	r15,0,r14
-
-	andis.	r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bnel-	core_idle_lock_held
-
-	add	r15,r15,r5			/* Add if winkle */
-	andc	r15,r15,r7			/* Clear thread bit */
-
-	andi.	r9,r15,PNV_CORE_IDLE_THREAD_BITS
-
-/*
- * If cr0 = 0, then current thread is the last thread of the core entering
- * sleep. Last thread needs to execute the hardware bug workaround code if
- * required by the platform.
- * Make the workaround call unconditionally here. The below branch call is
- * patched out when the idle states are discovered if the platform does not
- * require it.
- */
-.global pnv_fastsleep_workaround_at_entry
-pnv_fastsleep_workaround_at_entry:
-	beq	fastsleep_workaround_at_entry
-
-	stwcx.	r15,0,r14
-	bne-	lwarx_loop1
-	isync
-
-common_enter: /* common code for all the threads entering sleep or winkle */
-	bgt	cr3,enter_winkle
-	IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
-
-fastsleep_workaround_at_entry:
-	oris	r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	stwcx.	r15,0,r14
-	bne-	lwarx_loop1
-	isync
-
-	/* Fast sleep workaround */
-	li	r3,1
-	li	r4,1
-	bl	opal_config_cpu_idle_state
-
-	/* Unlock */
-	xoris	r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	lwsync
-	stw	r15,0(r14)
-	b	common_enter
-
-enter_winkle:
-	bl	save_sprs_to_stack
-
-	IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
-
-/*
- * r3 - PSSCR value corresponding to the requested stop state.
- */
-power_enter_stop:
-/*
- * Check if we are executing the lite variant with ESL=EC=0
- */
-	andis.   r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
-	clrldi   r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
-	bne	 .Lhandle_esl_ec_set
-	PPC_STOP
-	li	r3,0  /* Since we didn't lose state, return 0 */
-	std	r3, PACA_REQ_PSSCR(r13)
-
-	/*
-	 * pnv_wakeup_noloss() expects r12 to contain the SRR1 value so
-	 * it can determine if the wakeup reason is an HMI in
-	 * CHECK_HMI_INTERRUPT.
-	 *
-	 * However, when we wakeup with ESL=0, SRR1 will not contain the wakeup
-	 * reason, so there is no point setting r12 to SRR1.
-	 *
-	 * Further, we clear r12 here, so that we don't accidentally enter the
-	 * HMI in pnv_wakeup_noloss() if the value of r12[42:45] == WAKE_HMI.
-	 */
-	li	r12, 0
-	b 	pnv_wakeup_noloss
-
-.Lhandle_esl_ec_set:
-BEGIN_FTR_SECTION
-	/*
-	 * POWER9 DD2.0 or earlier can incorrectly set PMAO when waking up after
-	 * a state-loss idle. Saving and restoring MMCR0 over idle is a
-	 * workaround.
-	 */
-	mfspr	r4,SPRN_MMCR0
-	std	r4,_MMCR0(r1)
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
-
-/*
- * Check if the requested state is a deep idle state.
- */
-	LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
-	ld	r4,ADDROFF(pnv_first_deep_stop_state)(r5)
-	cmpd	r3,r4
-	bge	.Lhandle_deep_stop
-	PPC_STOP	/* Does not return (system reset interrupt) */
-
-.Lhandle_deep_stop:
-/*
- * Entering deep idle state.
- * Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
- * stack and enter stop
- */
-	lbz     r7,PACA_THREAD_MASK(r13)
-	ld      r14,PACA_CORE_IDLE_STATE_PTR(r13)
-
-lwarx_loop_stop:
-	lwarx   r15,0,r14
-	andis.	r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bnel-	core_idle_lock_held
-	andc    r15,r15,r7                      /* Clear thread bit */
-
-	stwcx.  r15,0,r14
-	bne-    lwarx_loop_stop
-	isync
-
-	bl	save_sprs_to_stack
-
-	PPC_STOP	/* Does not return (system reset interrupt) */
-
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired idle state (PNV_THREAD_NAP/SLEEP/WINKLE).
- */
-_GLOBAL(power7_idle_insn)
-	/* Now check if user or arch enabled NAP mode */
-	LOAD_REG_ADDR(r4, pnv_enter_arch207_idle_mode)
-	b	pnv_powersave_common
-
-#define CHECK_HMI_INTERRUPT						\
-BEGIN_FTR_SECTION_NESTED(66);						\
-	rlwinm	r0,r12,45-31,0xf;  /* extract wake reason field (P8) */	\
-FTR_SECTION_ELSE_NESTED(66);						\
-	rlwinm	r0,r12,45-31,0xe;  /* P7 wake reason field is 3 bits */	\
-ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);		\
-	cmpwi	r0,0xa;			/* Hypervisor maintenance ? */	\
-	bne+	20f;							\
-	/* Invoke opal call to handle hmi */				\
-	ld	r2,PACATOC(r13);					\
-	ld	r1,PACAR1(r13);						\
-	std	r3,ORIG_GPR3(r1);	/* Save original r3 */		\
-	li	r3,0;			/* NULL argument */		\
-	bl	hmi_exception_realmode;					\
-	nop;								\
-	ld	r3,ORIG_GPR3(r1);	/* Restore original r3 */	\
-20:	nop;
-
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired PSSCR register value.
- *
- * Offline (CPU unplug) case also must notify KVM that the CPU is
- * idle.
- */
-_GLOBAL(power9_offline_stop)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-	/*
-	 * Tell KVM we're entering idle.
-	 * This does not have to be done in real mode because the P9 MMU
-	 * is independent per-thread. Some steppings share radix/hash mode
-	 * between threads, but in that case KVM has a barrier sync in real
-	 * mode before and after switching between radix and hash.
-	 */
-	li	r4,KVM_HWTHREAD_IN_IDLE
-	stb	r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
-	/* fall through */
-
-_GLOBAL(power9_idle_stop)
-	std	r3, PACA_REQ_PSSCR(r13)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-BEGIN_FTR_SECTION
-	sync
-	lwz	r5, PACA_DONT_STOP(r13)
-	cmpwi	r5, 0
-	bne	1f
-END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
-#endif
-	mtspr 	SPRN_PSSCR,r3
-	LOAD_REG_ADDR(r4,power_enter_stop)
-	b	pnv_powersave_common
-	/* No return */
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-1:
-	/*
-	 * We get here when TM / thread reconfiguration bug workaround
-	 * code wants to get the CPU into SMT4 mode, and therefore
-	 * we are being asked not to stop.
-	 */
-	li	r3, 0
-	std	r3, PACA_REQ_PSSCR(r13)
-	blr		/* return 0 for wakeup cause / SRR1 value */
-#endif
+	IDLE_INST;						\
+	b	.	/* catch bugs */
 
 /*
- * Called from machine check handler for powersave wakeups.
- * Low level machine check processing has already been done. Now just
- * go through the wake up path to get everything in order.
+ * Desired instruction type in r3
  *
- * r3 - The original SRR1 value.
- * Original SRR[01] have been clobbered.
- * MSR_RI is clear.
- */
-.global pnv_powersave_wakeup_mce
-pnv_powersave_wakeup_mce:
-	/* Set cr3 for pnv_powersave_wakeup */
-	rlwinm	r11,r3,47-31,30,31
-	cmpwi	cr3,r11,2
-
-	/*
-	 * Now put the original SRR1 with SRR1_WAKEMCE_RESVD as the wake
-	 * reason into r12, which allows reuse of the system reset wakeup
-	 * code without being mistaken for another type of wakeup.
-	 */
-	oris	r12,r3,SRR1_WAKEMCE_RESVD@h
-
-	b	pnv_powersave_wakeup
-
-/*
- * Called from reset vector for powersave wakeups.
- * cr3 - set to gt if waking up with partial/complete hypervisor state loss
- * r12 - SRR1
- */
-.global pnv_powersave_wakeup
-pnv_powersave_wakeup:
-	ld	r2, PACATOC(r13)
-
-BEGIN_FTR_SECTION
-	bl	pnv_restore_hyp_resource_arch300
-FTR_SECTION_ELSE
-	bl	pnv_restore_hyp_resource_arch207
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
-
-	li	r0,PNV_THREAD_RUNNING
-	stb	r0,PACA_THREAD_IDLE_STATE(r13)	/* Clear thread state */
-
-	mr	r3,r12
-
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-	lbz	r0,HSTATE_HWTHREAD_STATE(r13)
-	cmpwi	r0,KVM_HWTHREAD_IN_KERNEL
-	beq	0f
-	li	r0,KVM_HWTHREAD_IN_KERNEL
-	stb	r0,HSTATE_HWTHREAD_STATE(r13)
-	/* Order setting hwthread_state vs. testing hwthread_req */
-	sync
-0:	lbz	r0,HSTATE_HWTHREAD_REQ(r13)
-	cmpwi	r0,0
-	beq	1f
-	b	kvm_start_guest
-1:
-#endif
-
-	/* Return SRR1 from power7_nap() */
-	blt	cr3,pnv_wakeup_noloss
-	b	pnv_wakeup_loss
-
-/*
- * Check whether we have woken up with hypervisor state loss.
- * If yes, restore hypervisor state and return back to link.
- *
- * cr3 - set to gt if waking up with partial/complete hypervisor state loss
- */
-pnv_restore_hyp_resource_arch300:
-	/*
-	 * Workaround for POWER9, if we lost resources, the ERAT
-	 * might have been mixed up and needs flushing. We also need
-	 * to reload MMCR0 (see comment above). We also need to set
-	 * then clear bit 60 in MMCRA to ensure the PMU starts running.
-	 */
-	blt	cr3,1f
-BEGIN_FTR_SECTION
-	PPC_INVALIDATE_ERAT
-	ld	r1,PACAR1(r13)
-	ld	r4,_MMCR0(r1)
-	mtspr	SPRN_MMCR0,r4
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
-	mfspr	r4,SPRN_MMCRA
-	ori	r4,r4,(1 << (63-60))
-	mtspr	SPRN_MMCRA,r4
-	xori	r4,r4,(1 << (63-60))
-	mtspr	SPRN_MMCRA,r4
-1:
-	/*
-	 * POWER ISA 3. Use PSSCR to determine if we
-	 * are waking up from deep idle state
-	 */
-	LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
-	ld	r4,ADDROFF(pnv_first_deep_stop_state)(r5)
-
-	/*
-	 * 0-3 bits correspond to Power-Saving Level Status
-	 * which indicates the idle state we are waking up from
-	 */
-	mfspr	r5, SPRN_PSSCR
-	rldicl  r5,r5,4,60
-	li	r0, 0		/* clear requested_psscr to say we're awake */
-	std	r0, PACA_REQ_PSSCR(r13)
-	cmpd	cr4,r5,r4
-	bge	cr4,pnv_wakeup_tb_loss /* returns to caller */
-
-	blr	/* Waking up without hypervisor state loss. */
-
-/* Same calling convention as arch300 */
-pnv_restore_hyp_resource_arch207:
-	/*
-	 * POWER ISA 2.07 or less.
-	 * Check if we slept with sleep or winkle.
-	 */
-	lbz	r4,PACA_THREAD_IDLE_STATE(r13)
-	cmpwi	cr2,r4,PNV_THREAD_NAP
-	bgt	cr2,pnv_wakeup_tb_loss	/* Either sleep or Winkle */
-
-	/*
-	 * We fall through here if PACA_THREAD_IDLE_STATE shows we are waking
-	 * up from nap. At this stage CR3 shouldn't contains 'gt' since that
-	 * indicates we are waking with hypervisor state loss from nap.
-	 */
-	bgt	cr3,.
-
-	blr	/* Waking up without hypervisor state loss */
-
-/*
- * Called if waking up from idle state which can cause either partial or
- * complete hyp state loss.
- * In POWER8, called if waking up from fastsleep or winkle
- * In POWER9, called if waking up from stop state >= pnv_first_deep_stop_state
+ * GPRs may be lost, so they are saved here. Wakeup is by interrupt only.
+ * Wakeup interrupt returns to the caller by calling isa3_idle_wake_gpr_loss
+ * with r3 set to desired return value.
  *
- * r13 - PACA
- * cr3 - gt if waking up with partial/complete hypervisor state loss
+ * A wakeup without GPR loss may alteratively be handled as in
+ * isa3_idle_stop_noloss as an optimisation.
  *
- * If ISA300:
- * cr4 - gt or eq if waking up from complete hypervisor state loss.
+ * Caller is responsible for restoring SPRs, MSR, etc.
  *
- * If ISA207:
- * r4 - PACA_THREAD_IDLE_STATE
+ * This must be called in real-mode.
  */
-pnv_wakeup_tb_loss:
-	ld	r1,PACAR1(r13)
-	/*
-	 * Before entering any idle state, the NVGPRs are saved in the stack.
-	 * If there was a state loss, or PACA_NAPSTATELOST was set, then the
-	 * NVGPRs are restored. If we are here, it is likely that state is lost,
-	 * but not guaranteed -- neither ISA207 nor ISA300 tests to reach
-	 * here are the same as the test to restore NVGPRS:
-	 * PACA_THREAD_IDLE_STATE test for ISA207, PSSCR test for ISA300,
-	 * and SRR1 test for restoring NVGPRs.
-	 *
-	 * We are about to clobber NVGPRs now, so set NAPSTATELOST to
-	 * guarantee they will always be restored. This might be tightened
-	 * with careful reading of specs (particularly for ISA300) but this
-	 * is already a slow wakeup path and it's simpler to be safe.
-	 */
-	li	r0,1
-	stb	r0,PACA_NAPSTATELOST(r13)
-
-	/*
-	 *
-	 * Save SRR1 and LR in NVGPRs as they might be clobbered in
-	 * opal_call() (called in CHECK_HMI_INTERRUPT). SRR1 is required
-	 * to determine the wakeup reason if we branch to kvm_start_guest. LR
-	 * is required to return back to reset vector after hypervisor state
-	 * restore is complete.
-	 */
-	mr	r19,r12
-	mr	r18,r4
-	mflr	r17
-BEGIN_FTR_SECTION
-	CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-
-	ld	r14,PACA_CORE_IDLE_STATE_PTR(r13)
-	lbz	r7,PACA_THREAD_MASK(r13)
-
-	/*
-	 * Take the core lock to synchronize against other threads.
-	 *
-	 * Lock bit is set in one of the 2 cases-
-	 * a. In the sleep/winkle enter path, the last thread is executing
-	 * fastsleep workaround code.
-	 * b. In the wake up path, another thread is executing fastsleep
-	 * workaround undo code or resyncing timebase or restoring context
-	 * In either case loop until the lock bit is cleared.
-	 */
-1:
-	lwarx	r15,0,r14
-	andis.	r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	bnel-	core_idle_lock_held
-	oris	r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	stwcx.	r15,0,r14
-	bne-	1b
-	isync
-
-	andi.	r9,r15,PNV_CORE_IDLE_THREAD_BITS
-	cmpwi	cr2,r9,0
-
-	/*
-	 * At this stage
-	 * cr2 - eq if first thread to wakeup in core
-	 * cr3-  gt if waking up with partial/complete hypervisor state loss
-	 * ISA300:
-	 * cr4 - gt or eq if waking up from complete hypervisor state loss.
-	 */
-
-BEGIN_FTR_SECTION
-	/*
-	 * Were we in winkle?
-	 * If yes, check if all threads were in winkle, decrement our
-	 * winkle count, set all thread winkle bits if all were in winkle.
-	 * Check if our thread has a winkle bit set, and set cr4 accordingly
-	 * (to match ISA300, above). Pseudo-code for core idle state
-	 * transitions for ISA207 is as follows (everything happens atomically
-	 * due to store conditional and/or lock bit):
-	 *
-	 * nap_idle() { }
-	 * nap_wake() { }
-	 *
-	 * sleep_idle()
-	 * {
-	 *	core_idle_state &= ~thread_in_core
-	 * }
-	 *
-	 * sleep_wake()
-	 * {
-	 *     bool first_in_core, first_in_subcore;
-	 *
-	 *     first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
-	 *     first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
-	 *
-	 *     core_idle_state |= thread_in_core;
-	 * }
-	 *
-	 * winkle_idle()
-	 * {
-	 *	core_idle_state &= ~thread_in_core;
-	 *	core_idle_state += 1 << WINKLE_COUNT_SHIFT;
-	 * }
-	 *
-	 * winkle_wake()
-	 * {
-	 *     bool first_in_core, first_in_subcore, winkle_state_lost;
-	 *
-	 *     first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
-	 *     first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
-	 *
-	 *     core_idle_state |= thread_in_core;
-	 *
-	 *     if ((core_idle_state & WINKLE_MASK) == (8 << WINKLE_COUNT_SIHFT))
-	 *         core_idle_state |= THREAD_WINKLE_BITS;
-	 *     core_idle_state -= 1 << WINKLE_COUNT_SHIFT;
-	 *
-	 *     winkle_state_lost = core_idle_state &
-	 *				(thread_in_core << WINKLE_THREAD_SHIFT);
-	 *     core_idle_state &= ~(thread_in_core << WINKLE_THREAD_SHIFT);
-	 * }
-	 *
-	 */
-	cmpwi	r18,PNV_THREAD_WINKLE
+_GLOBAL(isa206_idle_insn_mayloss)
+	std	r1,PACAR1(r13)
+	mflr	r4
+	mfcr	r5
+	/* use stack red zone rather than a new frame */
+	addi	r6,r1,-INT_FRAME_SIZE
+	SAVE_GPR(2, r6)
+	SAVE_NVGPRS(r6)
+	std	r4,_LINK(r6)
+	std	r5,_CCR(r6)
+	cmpwi	r3,PNV_THREAD_NAP
+	bne	1f
+	IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
+1:	cmpwi	r3,PNV_THREAD_SLEEP
 	bne	2f
-	andis.	r9,r15,PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT@h
-	subis	r15,r15,PNV_CORE_IDLE_WINKLE_COUNT@h
-	beq	2f
-	ori	r15,r15,PNV_CORE_IDLE_THREAD_WINKLE_BITS /* all were winkle */
-2:
-	/* Shift thread bit to winkle mask, then test if this thread is set,
-	 * and remove it from the winkle bits */
-	slwi	r8,r7,8
-	and	r8,r8,r15
-	andc	r15,r15,r8
-	cmpwi	cr4,r8,1 /* cr4 will be gt if our bit is set, lt if not */
-
-	lbz	r4,PACA_SUBCORE_SIBLING_MASK(r13)
-	and	r4,r4,r15
-	cmpwi	r4,0	/* Check if first in subcore */
-
-	or	r15,r15,r7		/* Set thread bit */
-	beq	first_thread_in_subcore
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-
-	or	r15,r15,r7		/* Set thread bit */
-	beq	cr2,first_thread_in_core
-
-	/* Not first thread in core or subcore to wake up */
-	b	clear_lock
-
-first_thread_in_subcore:
-	/*
-	 * If waking up from sleep, subcore state is not lost. Hence
-	 * skip subcore state restore
-	 */
-	blt	cr4,subcore_state_restored
-
-	/* Restore per-subcore state */
-	ld      r4,_SDR1(r1)
-	mtspr   SPRN_SDR1,r4
-
-	ld      r4,_RPR(r1)
-	mtspr   SPRN_RPR,r4
-	ld	r4,_AMOR(r1)
-	mtspr	SPRN_AMOR,r4
-
-subcore_state_restored:
-	/*
-	 * Check if the thread is also the first thread in the core. If not,
-	 * skip to clear_lock.
-	 */
-	bne	cr2,clear_lock
-
-first_thread_in_core:
-
-	/*
-	 * First thread in the core waking up from any state which can cause
-	 * partial or complete hypervisor state loss. It needs to
-	 * call the fastsleep workaround code if the platform requires it.
-	 * Call it unconditionally here. The below branch instruction will
-	 * be patched out if the platform does not have fastsleep or does not
-	 * require the workaround. Patching will be performed during the
-	 * discovery of idle-states.
-	 */
-.global pnv_fastsleep_workaround_at_exit
-pnv_fastsleep_workaround_at_exit:
-	b	fastsleep_workaround_at_exit
-
-timebase_resync:
-	/*
-	 * Use cr3 which indicates that we are waking up with atleast partial
-	 * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
-	 */
-	ble	cr3,.Ltb_resynced
-	/* Time base re-sync */
-	bl	opal_resync_timebase;
-	/*
-	 * If waking up from sleep (POWER8), per core state
-	 * is not lost, skip to clear_lock.
-	 */
-.Ltb_resynced:
-	blt	cr4,clear_lock
-
-	/*
-	 * First thread in the core to wake up and its waking up with
-	 * complete hypervisor state loss. Restore per core hypervisor
-	 * state.
-	 */
-BEGIN_FTR_SECTION
-	ld	r4,_PTCR(r1)
-	mtspr	SPRN_PTCR,r4
-	ld	r4,_RPR(r1)
-	mtspr	SPRN_RPR,r4
-	ld	r4,_AMOR(r1)
-	mtspr	SPRN_AMOR,r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-
-	ld	r4,_TSCR(r1)
-	mtspr	SPRN_TSCR,r4
-	ld	r4,_WORC(r1)
-	mtspr	SPRN_WORC,r4
-
-clear_lock:
-	xoris	r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-	lwsync
-	stw	r15,0(r14)
-
-common_exit:
-	/*
-	 * Common to all threads.
-	 *
-	 * If waking up from sleep, hypervisor state is not lost. Hence
-	 * skip hypervisor state restore.
-	 */
-	blt	cr4,hypervisor_state_restored
-
-	/* Waking up from winkle */
-
-BEGIN_MMU_FTR_SECTION
-	b	no_segments
-END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
-	/* Restore SLB  from PACA */
-	ld	r8,PACA_SLBSHADOWPTR(r13)
-
-	.rept	SLB_NUM_BOLTED
-	li	r3, SLBSHADOW_SAVEAREA
-	LDX_BE	r5, r8, r3
-	addi	r3, r3, 8
-	LDX_BE	r6, r8, r3
-	andis.	r7,r5,SLB_ESID_V@h
-	beq	1f
-	slbmte	r6,r5
-1:	addi	r8,r8,16
-	.endr
-no_segments:
-
-	/* Restore per thread state */
-
-	ld	r4,_SPURR(r1)
-	mtspr	SPRN_SPURR,r4
-	ld	r4,_PURR(r1)
-	mtspr	SPRN_PURR,r4
-	ld	r4,_DSCR(r1)
-	mtspr	SPRN_DSCR,r4
-	ld	r4,_WORT(r1)
-	mtspr	SPRN_WORT,r4
+	IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
+	b	.	/* catch bugs */
+2:	IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
+	b	.	/* catch bugs */
 
-	/* Call cur_cpu_spec->cpu_restore() */
-	LOAD_REG_ADDR(r4, cur_cpu_spec)
-	ld	r4,0(r4)
-	ld	r12,CPU_SPEC_RESTORE(r4)
-#ifdef PPC64_ELF_ABI_v1
-	ld	r12,0(r12)
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+_GLOBAL(idle_kvm_start_guest)
+	std	r1,PACAR1(r13)
+	mflr	r4
+	mfcr	r5
+	/* use stack red zone rather than a new frame */
+	addi	r6,r1,-INT_FRAME_SIZE
+	SAVE_GPR(2, r6)
+	SAVE_NVGPRS(r6)
+	std	r4,_LINK(r6)
+	std	r5,_CCR(r6)
+	b	kvm_start_guest
 #endif
-	mtctr	r12
-	bctrl
-
-/*
- * On POWER9, we can come here on wakeup from a cpuidle stop state.
- * Hence restore the additional SPRs to the saved value.
- *
- * On POWER8, we come here only on winkle. Since winkle is used
- * only in the case of CPU-Hotplug, we don't need to restore
- * the additional SPRs.
- */
-BEGIN_FTR_SECTION
-	bl 	power9_restore_additional_sprs
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-hypervisor_state_restored:
-
-	mr	r12,r19
-	mtlr	r17
-	blr		/* return to pnv_powersave_wakeup */
-
-fastsleep_workaround_at_exit:
-	li	r3,1
-	li	r4,0
-	bl	opal_config_cpu_idle_state
-	b	timebase_resync
-
-/*
- * R3 here contains the value that will be returned to the caller
- * of power7_nap.
- * R12 contains SRR1 for CHECK_HMI_INTERRUPT.
- */
-.global pnv_wakeup_loss
-pnv_wakeup_loss:
-	ld	r1,PACAR1(r13)
-BEGIN_FTR_SECTION
-	CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-	REST_NVGPRS(r1)
-	REST_GPR(2, r1)
-	ld	r4,PACAKMSR(r13)
-	ld	r5,_LINK(r1)
-	ld	r6,_CCR(r1)
-	addi	r1,r1,INT_FRAME_SIZE
-	mtlr	r5
-	mtcr	r6
-	mtmsrd	r4
-	blr
-
-/*
- * R3 here contains the value that will be returned to the caller
- * of power7_nap.
- * R12 contains SRR1 for CHECK_HMI_INTERRUPT.
- */
-pnv_wakeup_noloss:
-	lbz	r0,PACA_NAPSTATELOST(r13)
-	cmpwi	r0,0
-	bne	pnv_wakeup_loss
-	ld	r1,PACAR1(r13)
-BEGIN_FTR_SECTION
-	CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-	ld	r4,PACAKMSR(r13)
-	ld	r5,_NIP(r1)
-	ld	r6,_CCR(r1)
-	addi	r1,r1,INT_FRAME_SIZE
-	mtlr	r5
-	mtcr	r6
-	mtmsrd	r4
-	blr
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 40b44bb53a4e..e089da156ef3 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -401,8 +401,8 @@ void __init check_for_initrd(void)
 
 #ifdef CONFIG_SMP
 
-int threads_per_core, threads_per_subcore, threads_shift;
-cpumask_t threads_core_mask;
+int threads_per_core, threads_per_subcore, threads_shift __read_mostly;
+cpumask_t threads_core_mask __read_mostly;
 EXPORT_SYMBOL_GPL(threads_per_core);
 EXPORT_SYMBOL_GPL(threads_per_subcore);
 EXPORT_SYMBOL_GPL(threads_shift);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 1d14046124a0..8bcbe5ac2754 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -34,6 +34,7 @@
 #include <asm/thread_info.h>
 #include <asm/asm-compat.h>
 #include <asm/feature-fixups.h>
+#include <asm/cpuidle.h>
 
 /* Sign-extend HDEC if not on POWER9 */
 #define EXTEND_HDEC(reg)			\
@@ -322,43 +323,32 @@ kvm_novcpu_exit:
 	b	kvmhv_switch_to_host
 
 /*
- * We come in here when wakened from nap mode.
- * Relocation is off and most register values are lost.
- * r13 points to the PACA.
+ * We come in here when wakened from Linux offline idle code.
+ * Relocation is off
  * r3 contains the SRR1 wakeup value, SRR1 is trashed.
  */
 	.globl	kvm_start_guest
 kvm_start_guest:
-	/* Set runlatch bit the minute you wake up from nap */
-	mfspr	r0, SPRN_CTRLF
-	ori 	r0, r0, 1
-	mtspr	SPRN_CTRLT, r0
-
 	/*
 	 * Could avoid this and pass it through in r3. For now,
 	 * code expects it to be in SRR1.
 	 */
 	mtspr	SPRN_SRR1,r3
 
-	ld	r2,PACATOC(r13)
-
 	li	r0,0
 	stb	r0,PACA_FTRACE_ENABLED(r13)
 
 	li	r0,KVM_HWTHREAD_IN_KVM
 	stb	r0,HSTATE_HWTHREAD_STATE(r13)
 
-	/* NV GPR values from power7_idle() will no longer be valid */
-	li	r0,1
-	stb	r0,PACA_NAPSTATELOST(r13)
-
-	/* were we napping due to cede? */
+	/* cede napping should not come through here */
 	lbz	r0,HSTATE_NAPPING(r13)
-	cmpwi	r0,NAPPING_CEDE
-	beq	kvm_end_cede
-	cmpwi	r0,NAPPING_NOVCPU
-	beq	kvm_novcpu_wakeup
+	twnei	r0,0
 
+/*
+ * We can come in at this point from KVM nap.
+ */
+do_start_guest:
 	ld	r1,PACAEMERGSP(r13)
 	subi	r1,r1,STACK_FRAME_OVERHEAD
 
@@ -469,19 +459,17 @@ kvm_no_guest:
 	lbz	r3, HSTATE_HWTHREAD_REQ(r13)
 	cmpwi	r3, 0
 	bne	54f
-/*
- * We jump to pnv_wakeup_loss, which will return to the caller
- * of power7_nap in the powernv cpu offline loop.  The value we
- * put in r3 becomes the return value for power7_nap. pnv_wakeup_loss
- * requires SRR1 in r12.
- */
+
+	/*
+	 * Jump to idle_return_gpr_loss, which returns to the
+	 * idle_kvm_start_guest caller.
+	 */
 	li	r3, LPCR_PECE0
 	mfspr	r4, SPRN_LPCR
 	rlwimi	r4, r3, 0, LPCR_PECE0 | LPCR_PECE1
 	mtspr	SPRN_LPCR, r4
 	li	r3, 0
-	mfspr	r12,SPRN_SRR1
-	b	pnv_wakeup_loss
+	b	idle_return_gpr_loss
 
 53:	HMT_LOW
 	ld	r5, HSTATE_KVM_VCORE(r13)
@@ -2760,21 +2748,47 @@ BEGIN_FTR_SECTION
 	li	r4, LPCR_PECE_HVEE@higher
 	sldi	r4, r4, 32
 	or	r5, r5, r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+FTR_SECTION_ELSE
+	li	r3, PNV_THREAD_NAP
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 	mtspr	SPRN_LPCR,r5
 	isync
-	li	r0, 0
-	std	r0, HSTATE_SCRATCH0(r13)
-	ptesync
-	ld	r0, HSTATE_SCRATCH0(r13)
-1:	cmpd	r0, r0
-	bne	1b
+
+	mr	r0, r1
+	ld	r1, PACAEMERGSP(r13)
+	subi	r1, r1, STACK_FRAME_OVERHEAD
+	std	r0, 0(r1)
+	ld	r0, PACAR1(r13)
+	std	r0, 8(r1)
+
 BEGIN_FTR_SECTION
-	nap
+	bl	isa3_idle_stop_mayloss
 FTR_SECTION_ELSE
-	PPC_STOP
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
-	b	.
+	bl	isa206_idle_insn_mayloss
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
+
+	mfspr	r0, SPRN_CTRLF
+	ori	r0, r0, 1
+	mtspr	SPRN_CTRLT, r0
+
+	ld	r0, 8(r1)
+	std	r0, PACAR1(r13)
+	ld	r1, 0(r1)
+
+	mtspr	SPRN_SRR1, r3
+
+	li	r0, 0
+	stb	r0, PACA_FTRACE_ENABLED(r13)
+
+	li	r0, KVM_HWTHREAD_IN_KVM
+	stb	r0, HSTATE_HWTHREAD_STATE(r13)
+
+	lbz	r0, HSTATE_NAPPING(r13)
+	cmpwi	r0, NAPPING_CEDE
+	beq	kvm_end_cede
+	cmpwi	r0, NAPPING_NOVCPU
+	beq	kvm_novcpu_wakeup
+	b	do_start_guest
 
 33:	mr	r4, r3
 	li	r3, 0
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index ecb002c5db83..446f8cc6070f 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -16,6 +16,7 @@
 #include <linux/device.h>
 #include <linux/cpu.h>
 
+#include <asm/asm-prototypes.h>
 #include <asm/firmware.h>
 #include <asm/machdep.h>
 #include <asm/opal.h>
@@ -48,10 +49,10 @@ static u64 pnv_default_stop_mask;
 static bool default_stop_found;
 
 /*
- * First deep stop state. Used to figure out when to save/restore
- * hypervisor context.
+ * First stop state levels when HV and TB loss can occur.
  */
-u64 pnv_first_deep_stop_state = MAX_STOP_STATE;
+static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -62,6 +63,8 @@ static u64 pnv_deepest_stop_psscr_mask;
 static u64 pnv_deepest_stop_flag;
 static bool deepest_stop_found;
 
+static unsigned long power7_offline_type;
+
 static int pnv_save_sprs_for_deep_states(void)
 {
 	int cpu;
@@ -72,12 +75,12 @@ static int pnv_save_sprs_for_deep_states(void)
 	 * all cpus at boot. Get these reg values of current cpu and use the
 	 * same across all cpus.
 	 */
-	uint64_t lpcr_val = mfspr(SPRN_LPCR);
-	uint64_t hid0_val = mfspr(SPRN_HID0);
-	uint64_t hid1_val = mfspr(SPRN_HID1);
-	uint64_t hid4_val = mfspr(SPRN_HID4);
-	uint64_t hid5_val = mfspr(SPRN_HID5);
-	uint64_t hmeer_val = mfspr(SPRN_HMEER);
+	uint64_t lpcr_val	= mfspr(SPRN_LPCR);
+	uint64_t hid0_val	= mfspr(SPRN_HID0);
+	uint64_t hid1_val	= mfspr(SPRN_HID1);
+	uint64_t hid4_val	= mfspr(SPRN_HID4);
+	uint64_t hid5_val	= mfspr(SPRN_HID5);
+	uint64_t hmeer_val	= mfspr(SPRN_HMEER);
 	uint64_t msr_val = MSR_IDLE;
 	uint64_t psscr_val = pnv_deepest_stop_psscr_val;
 
@@ -137,89 +140,6 @@ static int pnv_save_sprs_for_deep_states(void)
 	return 0;
 }
 
-static void pnv_alloc_idle_core_states(void)
-{
-	int i, j;
-	int nr_cores = cpu_nr_cores();
-	u32 *core_idle_state;
-
-	/*
-	 * core_idle_state - The lower 8 bits track the idle state of
-	 * each thread of the core.
-	 *
-	 * The most significant bit is the lock bit.
-	 *
-	 * Initially all the bits corresponding to threads_per_core
-	 * are set. They are cleared when the thread enters deep idle
-	 * state like sleep and winkle/stop.
-	 *
-	 * Initially the lock bit is cleared.  The lock bit has 2
-	 * purposes:
-	 * 	a. While the first thread in the core waking up from
-	 * 	   idle is restoring core state, it prevents other
-	 * 	   threads in the core from switching to process
-	 * 	   context.
-	 * 	b. While the last thread in the core is saving the
-	 *	   core state, it prevents a different thread from
-	 *	   waking up.
-	 */
-	for (i = 0; i < nr_cores; i++) {
-		int first_cpu = i * threads_per_core;
-		int node = cpu_to_node(first_cpu);
-		size_t paca_ptr_array_size;
-
-		core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
-		*core_idle_state = (1 << threads_per_core) - 1;
-		paca_ptr_array_size = (threads_per_core *
-				       sizeof(struct paca_struct *));
-
-		for (j = 0; j < threads_per_core; j++) {
-			int cpu = first_cpu + j;
-
-			paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
-			paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
-			paca_ptrs[cpu]->thread_mask = 1 << j;
-		}
-	}
-
-	update_subcore_sibling_mask();
-
-	if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
-		int rc = pnv_save_sprs_for_deep_states();
-
-		if (likely(!rc))
-			return;
-
-		/*
-		 * The stop-api is unable to restore hypervisor
-		 * resources on wakeup from platform idle states which
-		 * lose full context. So disable such states.
-		 */
-		supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
-		pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
-		pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
-
-		if (cpu_has_feature(CPU_FTR_ARCH_300) &&
-		    (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
-			/*
-			 * Use the default stop state for CPU-Hotplug
-			 * if available.
-			 */
-			if (default_stop_found) {
-				pnv_deepest_stop_psscr_val =
-					pnv_default_stop_val;
-				pnv_deepest_stop_psscr_mask =
-					pnv_default_stop_mask;
-				pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
-					pnv_deepest_stop_psscr_val);
-			} else { /* Fallback to snooze loop for CPU-Hotplug */
-				deepest_stop_found = false;
-				pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
-			}
-		}
-	}
-}
-
 u32 pnv_get_supported_cpuidle_states(void)
 {
 	return supported_cpuidle_states;
@@ -238,6 +158,9 @@ static void pnv_fastsleep_workaround_apply(void *info)
 		*err = 1;
 }
 
+static bool power7_fastsleep_workaround_entry = true;
+static bool power7_fastsleep_workaround_exit = true;
+
 /*
  * Used to store fastsleep workaround state
  * 0 - Workaround applied/undone at fastsleep entry/exit path (Default)
@@ -277,13 +200,7 @@ static ssize_t store_fastsleep_workaround_applyonce(struct device *dev,
 	 * offlined, as last thread of the core entering fastsleep or deeper
 	 * state would have applied workaround.
 	 */
-	err = patch_instruction(
-		(unsigned int *)pnv_fastsleep_workaround_at_exit,
-		PPC_INST_NOP);
-	if (err) {
-		pr_err("fastsleep_workaround_applyonce change failed while patching pnv_fastsleep_workaround_at_exit");
-		goto fail;
-	}
+	power7_fastsleep_workaround_exit = false;
 
 	get_online_cpus();
 	primary_thread_mask = cpu_online_cores_map();
@@ -296,13 +213,7 @@ static ssize_t store_fastsleep_workaround_applyonce(struct device *dev,
 		goto fail;
 	}
 
-	err = patch_instruction(
-		(unsigned int *)pnv_fastsleep_workaround_at_entry,
-		PPC_INST_NOP);
-	if (err) {
-		pr_err("fastsleep_workaround_applyonce change failed while patching pnv_fastsleep_workaround_at_entry");
-		goto fail;
-	}
+	power7_fastsleep_workaround_entry = false;
 
 	fastsleep_workaround_applyonce = 1;
 
@@ -315,6 +226,303 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
 			show_fastsleep_workaround_applyonce,
 			store_fastsleep_workaround_applyonce);
 
+static inline void atomic_start_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	int thread_nr = cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	clear_bit(thread_nr, state);
+}
+
+static inline void atomic_stop_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	int thread_nr = cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	set_bit(thread_nr, state);
+}
+
+static inline void atomic_lock_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	while (unlikely(test_and_set_bit_lock(NR_PNV_CORE_IDLE_LOCK_BIT, state)))
+		barrier();
+	isync();
+}
+
+static inline void atomic_unlock_and_stop_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+	u64 s = READ_ONCE(*state);
+	u64 new, tmp;
+
+	isync();
+
+	BUG_ON(!(s & PNV_CORE_IDLE_LOCK_BIT));
+	BUG_ON(s & thread);
+
+again:
+	new = (s | thread) & ~PNV_CORE_IDLE_LOCK_BIT;
+	tmp = cmpxchg(state, s, new);
+	if (unlikely(tmp != s)) {
+		s = tmp;
+		goto again;
+	}
+}
+
+static inline void atomic_unlock_thread_idle(void)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+
+	BUG_ON(!test_bit(NR_PNV_CORE_IDLE_LOCK_BIT, state));
+	clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
+}
+
+/* P7 and P8 */
+struct p7_sprs {
+	/* per core */
+	u64 tscr;
+	u64 worc;
+
+	/* per subcore */
+	u64 sdr1;
+	u64 rpr;
+	u64 amor;
+
+	/* per thread */
+	u64 lpcr;
+	u64 hfscr;
+	u64 fscr;
+	u64 purr;
+	u64 spurr;
+	u64 dscr;
+	u64 wort;
+};
+
+static unsigned long power7_idle_insn(unsigned long type)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+	unsigned long srr1;
+	bool full_winkle;
+	struct p7_sprs sprs;
+	bool sprs_saved = false;
+	int rc;
+
+	memset(&sprs, 0, sizeof(sprs));
+
+	if (unlikely(type != PNV_THREAD_NAP)) {
+		atomic_lock_thread_idle();
+
+		BUG_ON(!(*state & thread));
+		*state &= ~thread;
+
+		if (power7_fastsleep_workaround_entry) {
+			if ((*state & ((1 << threads_per_core) - 1)) == 0) {
+				rc = opal_config_cpu_idle_state(
+						OPAL_CONFIG_IDLE_FASTSLEEP,
+						OPAL_CONFIG_IDLE_APPLY);
+				BUG_ON(rc);
+			}
+		}
+
+		if (type == PNV_THREAD_WINKLE) {
+			sprs.tscr	= mfspr(SPRN_TSCR);
+			sprs.worc	= mfspr(SPRN_WORC);
+
+			sprs.sdr1	= mfspr(SPRN_SDR1);
+			sprs.rpr	= mfspr(SPRN_RPR);
+			sprs.amor	= mfspr(SPRN_AMOR);
+
+			sprs.lpcr	= mfspr(SPRN_LPCR);
+			if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
+				sprs.hfscr	= mfspr(SPRN_HFSCR);
+				sprs.fscr	= mfspr(SPRN_FSCR);
+			}
+			sprs.purr	= mfspr(SPRN_PURR);
+			sprs.spurr	= mfspr(SPRN_SPURR);
+			sprs.dscr	= mfspr(SPRN_DSCR);
+			sprs.wort	= mfspr(SPRN_WORT);
+
+			sprs_saved = true;
+
+			/*
+			 * Increment winkle counter and set all winkle bits if
+			 * all threads are winkling. This allows wakeup side to
+			 * distinguish between fast sleep and winkle state
+			 * loss. Fast sleep still has to resync the timebase so
+			 * this may not be a really big win.
+			 */
+			*state += 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+			if ((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) >> PNV_CORE_IDLE_WINKLE_COUNT_SHIFT == threads_per_core)
+				*state |= PNV_CORE_IDLE_THREAD_WINKLE_BITS;
+			WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+		}
+
+		atomic_unlock_thread_idle();
+	}
+
+	srr1 = isa206_idle_insn_mayloss(type);
+
+	WARN_ON_ONCE(!srr1);
+	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
+
+	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
+		hmi_exception_realmode(NULL);
+
+	if (likely((srr1 & SRR1_WAKESTATE) != SRR1_WS_HVLOSS)) {
+		if (unlikely(type != PNV_THREAD_NAP)) {
+			atomic_lock_thread_idle();
+			if (type == PNV_THREAD_WINKLE) {
+				WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+				*state -= 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+				*state &= ~(thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT);
+			}
+			atomic_unlock_and_stop_thread_idle();
+		}
+		return srr1;
+	}
+
+	/* HV state loss */
+	BUG_ON(type == PNV_THREAD_NAP);
+
+	atomic_lock_thread_idle();
+
+	full_winkle = false;
+	if (type == PNV_THREAD_WINKLE) {
+		WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+		*state -= 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+		if (*state & (thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT)) {
+			*state &= ~(thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT);
+			full_winkle = true;
+			BUG_ON(!sprs_saved);
+		}
+	}
+
+	WARN_ON(*state & thread);
+
+	if ((*state & ((1 << threads_per_core) - 1)) != 0)
+		goto core_woken;
+
+	/* Per-core SPRs */
+	if (full_winkle) {
+		mtspr(SPRN_TSCR,	sprs.tscr);
+		mtspr(SPRN_WORC,	sprs.worc);
+	}
+
+	if (power7_fastsleep_workaround_exit) {
+		rc = opal_config_cpu_idle_state(OPAL_CONFIG_IDLE_FASTSLEEP,
+						OPAL_CONFIG_IDLE_UNDO);
+		BUG_ON(rc);
+	}
+
+	/* TB */
+	if (opal_resync_timebase() != OPAL_SUCCESS)
+		BUG();
+
+core_woken:
+	if (!full_winkle)
+		goto subcore_woken;
+
+	if ((*state & local_paca->subcore_sibling_mask) != 0)
+		goto subcore_woken;
+
+	/* Per-subcore SPRs */
+	mtspr(SPRN_SDR1,	sprs.sdr1);
+	mtspr(SPRN_RPR,		sprs.rpr);
+	mtspr(SPRN_AMOR,	sprs.amor);
+
+subcore_woken:
+	atomic_unlock_and_stop_thread_idle();
+
+	/* Fast sleep does not lose SPRs */
+	if (!full_winkle)
+		return srr1;
+
+	/* Per-thread SPRs */
+	mtspr(SPRN_LPCR,	sprs.lpcr);
+	if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
+		mtspr(SPRN_HFSCR,	sprs.hfscr);
+		mtspr(SPRN_FSCR,	sprs.fscr);
+	}
+	mtspr(SPRN_PURR,	sprs.purr);
+	mtspr(SPRN_SPURR,	sprs.spurr);
+	mtspr(SPRN_DSCR,	sprs.dscr);
+	mtspr(SPRN_WORT,	sprs.wort);
+
+	mtspr(SPRN_SPRG3,	local_paca->sprg_vdso);
+
+	/*
+	 * The SLB has to be restored here, but it sometimes still
+	 * contain entries, so the __ variant must be used to prevent
+	 * multi hits.
+	 */
+	__slb_restore_bolted_realmode();
+
+	return srr1;
+}
+
+extern unsigned long idle_kvm_start_guest(unsigned long srr1);
+
+static unsigned long power7_offline(void)
+{
+	unsigned long srr1;
+
+	mtmsr(MSR_IDLE);
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	/* Tell KVM we're entering idle. */
+	/******************************************************/
+	/*  N O T E   W E L L    ! ! !    N O T E   W E L L   */
+	/* The following store to HSTATE_HWTHREAD_STATE(r13)  */
+	/* MUST occur in real mode, i.e. with the MMU off,    */
+	/* and the MMU must stay off until we clear this flag */
+	/* and test HSTATE_HWTHREAD_REQ(r13) in               */
+	/* pnv_powersave_wakeup in this file.                 */
+	/* The reason is that another thread can switch the   */
+	/* MMU to a guest context whenever this flag is set   */
+	/* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on,    */
+	/* that would potentially cause this thread to start  */
+	/* executing instructions from guest memory in        */
+	/* hypervisor mode, leading to a host crash or data   */
+	/* corruption, or worse.                              */
+	/******************************************************/
+	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
+#endif
+
+	__ppc64_runlatch_off();
+	srr1 = power7_idle_insn(power7_offline_type);
+	__ppc64_runlatch_on();
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
+		local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
+		/* Order setting hwthread_state vs. testing hwthread_req */
+		smp_mb();
+	}
+	if (local_paca->kvm_hstate.hwthread_req)
+		srr1 = idle_kvm_start_guest(srr1);
+#endif
+
+	mtmsr(MSR_KERNEL);
+
+	return srr1;
+}
+
 static unsigned long __power7_idle_type(unsigned long type)
 {
 	unsigned long srr1;
@@ -322,9 +530,11 @@ static unsigned long __power7_idle_type(unsigned long type)
 	if (!prep_irq_for_idle_irqsoff())
 		return 0;
 
+	mtmsr(MSR_IDLE);
 	__ppc64_runlatch_off();
 	srr1 = power7_idle_insn(type);
 	__ppc64_runlatch_on();
+	mtmsr(MSR_KERNEL);
 
 	fini_irq_for_idle_irqsoff();
 
@@ -347,6 +557,244 @@ void power7_idle(void)
 	power7_idle_type(PNV_THREAD_NAP);
 }
 
+struct p9_sprs {
+	/* per core */
+	u64 ptcr;
+	u64 rpr;
+	u64 tscr;
+	u64 ldbar;
+	u64 amor;
+
+	/* per thread */
+	u64 lpcr;
+	u64 hfscr;
+	u64 fscr;
+	u64 pid;
+	u64 purr;
+	u64 spurr;
+	u64 dscr;
+	u64 wort;
+
+	u64 mmcra;
+	u32 mmcr0;
+	u32 mmcr1;
+	u64 mmcr2;
+};
+
+static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
+{
+	int cpu = raw_smp_processor_id();
+	int first = cpu_first_thread_sibling(cpu);
+	unsigned long *state = &paca_ptrs[first]->idle_state;
+	unsigned long srr1;
+	unsigned long mmcr0 = 0;
+	struct p9_sprs sprs;
+	bool sprs_saved = false;
+
+	memset(&sprs, 0, sizeof(sprs));
+
+	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
+		BUG_ON(!mmu_on);
+
+		/*
+		 * Wake synchronously. SRESET via xscom may still cause
+		 * a 0x100 powersave wakeup with SRR1 reason!
+		 */
+		srr1 = isa3_idle_stop_noloss(psscr);
+		if (likely(!srr1))
+			return 0;
+
+		/*
+		 * Registers not saved, can't recover!
+		 * This would be a hardware bug
+		 */
+		BUG_ON((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS);
+
+		goto out;
+	}
+
+	/* EC=ESL=1 case */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	if (cpu_has_feature(CPU_FTR_P9_TM_XER_SO_BUG)) {
+		local_paca->requested_psscr = psscr;
+		/* order setting requested_psscr vs testing dont_stop */
+		smp_mb();
+		if (atomic_read(&local_paca->dont_stop)) {
+			local_paca->requested_psscr = 0;
+			return 0;
+		}
+	}
+#endif
+
+	if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+		 /*
+		  * POWER9 DD2 can incorrectly set PMAO when waking up
+		  * after a state-loss idle. Saving and restoring MMCR0
+		  * over idle is a workaround.
+		  */
+		mmcr0		= mfspr(SPRN_MMCR0);
+	}
+	if ((psscr & PSSCR_RL_MASK) >= pnv_first_hv_loss_level) {
+		sprs.lpcr	= mfspr(SPRN_LPCR);
+		sprs.hfscr	= mfspr(SPRN_HFSCR);
+		sprs.fscr	= mfspr(SPRN_FSCR);
+		sprs.pid	= mfspr(SPRN_PID);
+		sprs.purr	= mfspr(SPRN_PURR);
+		sprs.spurr	= mfspr(SPRN_SPURR);
+		sprs.dscr	= mfspr(SPRN_DSCR);
+		sprs.wort	= mfspr(SPRN_WORT);
+
+		sprs.mmcra	= mfspr(SPRN_MMCRA);
+		sprs.mmcr0	= mfspr(SPRN_MMCR0);
+		sprs.mmcr1	= mfspr(SPRN_MMCR1);
+		sprs.mmcr2	= mfspr(SPRN_MMCR2);
+
+		sprs.ptcr	= mfspr(SPRN_PTCR);
+		sprs.rpr	= mfspr(SPRN_RPR);
+		sprs.tscr	= mfspr(SPRN_TSCR);
+		sprs.ldbar	= mfspr(SPRN_LDBAR);
+		sprs.amor	= mfspr(SPRN_AMOR);
+
+		sprs_saved = true;
+
+		atomic_start_thread_idle();
+	}
+
+	srr1 = isa3_idle_stop_mayloss(psscr);
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	local_paca->requested_psscr = 0;
+#endif
+
+	psscr = mfspr(SPRN_PSSCR);
+
+	WARN_ON_ONCE(!srr1);
+	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
+
+	if ((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS) {
+		unsigned long mmcra;
+
+		/*
+		 * Workaround for POWER9 DD2.0, if we lost resources, the ERAT
+		 * might have been corrupted and needs flushing. We also need
+		 * to reload MMCR0 (see mmcr0 comment above).
+		 */
+		if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+			asm volatile(PPC_INVALIDATE_ERAT);
+			mtspr(SPRN_MMCR0, mmcr0);
+		}
+
+		/*
+		 * DD2.2 and earlier need to set then clear bit 60 in MMCRA
+		 * to ensure the PMU starts running.
+		 */
+		mmcra = mfspr(SPRN_MMCRA);
+		mmcra |= PPC_BIT(60);
+		mtspr(SPRN_MMCRA, mmcra);
+		mmcra &= ~PPC_BIT(60);
+		mtspr(SPRN_MMCRA, mmcra);
+	}
+
+	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
+		hmi_exception_realmode(NULL);
+
+	/*
+	 * On POWER9, SRR1 bits do not match exactly as expected.
+	 * SRR1_WS_GPRLOSS (10b) can also result in SPR loss, so
+	 * always test PSSCR if there is any state loss.
+	 */
+	if (likely((psscr & PSSCR_RL_MASK) < pnv_first_hv_loss_level)) {
+		if (sprs_saved)
+			atomic_stop_thread_idle();
+		goto out;
+	}
+
+	/* HV state loss */
+	BUG_ON(!sprs_saved);
+
+	atomic_lock_thread_idle();
+
+	if ((*state & ((1 << threads_per_core) - 1)) != 0)
+		goto core_woken;
+
+	/* Per-core SPRs */
+	mtspr(SPRN_PTCR,	sprs.ptcr);
+	mtspr(SPRN_RPR,		sprs.rpr);
+	mtspr(SPRN_TSCR,	sprs.tscr);
+	mtspr(SPRN_LDBAR,	sprs.ldbar);
+	mtspr(SPRN_AMOR,	sprs.amor);
+
+	if ((psscr & PSSCR_RL_MASK) >= pnv_first_tb_loss_level) {
+		/* TB loss */
+		if (opal_resync_timebase() != OPAL_SUCCESS)
+			BUG();
+	}
+
+core_woken:
+	atomic_unlock_and_stop_thread_idle();
+
+	/* Per-thread SPRs */
+	mtspr(SPRN_LPCR,	sprs.lpcr);
+	mtspr(SPRN_HFSCR,	sprs.hfscr);
+	mtspr(SPRN_FSCR,	sprs.fscr);
+	mtspr(SPRN_PID,		sprs.pid);
+	mtspr(SPRN_PURR,	sprs.purr);
+	mtspr(SPRN_SPURR,	sprs.spurr);
+	mtspr(SPRN_DSCR,	sprs.dscr);
+	mtspr(SPRN_WORT,	sprs.wort);
+
+	mtspr(SPRN_MMCRA,	sprs.mmcra);
+	mtspr(SPRN_MMCR0,	sprs.mmcr0);
+	mtspr(SPRN_MMCR1,	sprs.mmcr1);
+	mtspr(SPRN_MMCR2,	sprs.mmcr2);
+
+	mtspr(SPRN_SPRG3,	local_paca->sprg_vdso);
+
+	if (!radix_enabled())
+		__slb_restore_bolted_realmode();
+
+out:
+	if (mmu_on)
+		mtmsr(MSR_KERNEL);
+
+	return srr1;
+}
+
+static unsigned long power9_offline_stop(unsigned long psscr)
+{
+	unsigned long srr1;
+
+#ifndef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+	__ppc64_runlatch_off();
+	srr1 = power9_idle_stop(psscr, true);
+	__ppc64_runlatch_on();
+#else
+	/*
+	 * Tell KVM we're entering idle.
+	 * This does not have to be done in real mode because the P9 MMU
+	 * is independent per-thread. Some steppings share radix/hash mode
+	 * between threads, but in that case KVM has a barrier sync in real
+	 * mode before and after switching between radix and hash.
+	 */
+	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
+
+	__ppc64_runlatch_off();
+	srr1 = power9_idle_stop(psscr, false);
+	__ppc64_runlatch_on();
+
+	if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
+		local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
+		/* Order setting hwthread_state vs. testing hwthread_req */
+		smp_mb();
+	}
+	if (local_paca->kvm_hstate.hwthread_req)
+		srr1 = idle_kvm_start_guest(srr1);
+	mtmsr(MSR_KERNEL);
+#endif
+
+	return srr1;
+}
+
 static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
 				      unsigned long stop_psscr_mask)
 {
@@ -360,7 +808,7 @@ static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
 	psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
 
 	__ppc64_runlatch_off();
-	srr1 = power9_idle_stop(psscr);
+	srr1 = power9_idle_stop(psscr, true);
 	__ppc64_runlatch_on();
 
 	fini_irq_for_idle_irqsoff();
@@ -409,7 +857,7 @@ void pnv_power9_force_smt4_catch(void)
 			atomic_inc(&paca_ptrs[cpu0+thr]->dont_stop);
 	}
 	/* order setting dont_stop vs testing requested_psscr */
-	mb();
+	smp_mb();
 	for (thr = 0; thr < threads_per_core; ++thr) {
 		if (!paca_ptrs[cpu0+thr]->requested_psscr)
 			++awake_threads;
@@ -480,7 +928,6 @@ static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
 unsigned long pnv_cpu_offline(unsigned int cpu)
 {
 	unsigned long srr1;
-	u32 idle_states = pnv_get_supported_cpuidle_states();
 	u64 lpcr_val;
 
 	/*
@@ -505,15 +952,8 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
 		psscr = (psscr & ~pnv_deepest_stop_psscr_mask) |
 						pnv_deepest_stop_psscr_val;
 		srr1 = power9_offline_stop(psscr);
-
-	} else if ((idle_states & OPAL_PM_WINKLE_ENABLED) &&
-		   (idle_states & OPAL_PM_LOSE_FULL_CONTEXT)) {
-		srr1 = power7_idle_insn(PNV_THREAD_WINKLE);
-	} else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
-		   (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
-		srr1 = power7_idle_insn(PNV_THREAD_SLEEP);
-	} else if (idle_states & OPAL_PM_NAP_ENABLED) {
-		srr1 = power7_idle_insn(PNV_THREAD_NAP);
+	} else if (cpu_has_feature(CPU_FTR_ARCH_206) && power7_offline_type) {
+		srr1 = power7_offline();
 	} else {
 		/* This is the fallback method. We emulate snooze */
 		while (!generic_check_cpu_restart(cpu)) {
@@ -619,33 +1059,32 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
  * @dt_idle_states: Number of idle state entries
  * Returns 0 on success
  */
-static int __init pnv_power9_idle_init(void)
+static void __init pnv_power9_idle_init(void)
 {
 	u64 max_residency_ns = 0;
 	int i;
 
 	/*
-	 * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
-	 * and the pnv_default_stop_{val,mask}.
-	 *
-	 * pnv_first_deep_stop_state should be set to the first stop
-	 * level to cause hypervisor state loss.
-	 *
 	 * pnv_deepest_stop_{val,mask} should be set to values corresponding to
 	 * the deepest stop state.
 	 *
 	 * pnv_default_stop_{val,mask} should be set to values corresponding to
-	 * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
+	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 	 */
-	pnv_first_deep_stop_state = MAX_STOP_STATE;
+	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+	pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
 	for (i = 0; i < nr_pnv_idle_states; i++) {
 		int err;
 		struct pnv_idle_states_t *state = &pnv_idle_states[i];
 		u64 psscr_rl = state->psscr_val & PSSCR_RL_MASK;
 
+		if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
+		     (pnv_first_tb_loss_level > psscr_rl))
+			pnv_first_tb_loss_level = psscr_rl;
+
 		if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
-		    pnv_first_deep_stop_state > psscr_rl)
-			pnv_first_deep_stop_state = psscr_rl;
+		     (pnv_first_hv_loss_level > psscr_rl))
+			pnv_first_hv_loss_level = psscr_rl;
 
 		err = validate_psscr_val_mask(&state->psscr_val,
 					      &state->psscr_mask,
@@ -670,6 +1109,7 @@ static int __init pnv_power9_idle_init(void)
 			pnv_default_stop_val = state->psscr_val;
 			pnv_default_stop_mask = state->psscr_mask;
 			default_stop_found = true;
+			WARN_ON(state->flags & OPAL_PM_LOSE_FULL_CONTEXT);
 		}
 	}
 
@@ -689,10 +1129,40 @@ static int __init pnv_power9_idle_init(void)
 			pnv_deepest_stop_psscr_mask);
 	}
 
-	pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
-		pnv_first_deep_stop_state);
+	pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%lld\n",
+		pnv_first_hv_loss_level);
 
-	return 0;
+	pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%lld\n",
+		pnv_first_tb_loss_level);
+}
+
+static void __init pnv_disable_deep_states(void)
+{
+	/*
+	 * The stop-api is unable to restore hypervisor
+	 * resources on wakeup from platform idle states which
+	 * lose full context. So disable such states.
+	 */
+	supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
+	pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
+	pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
+
+	if (cpu_has_feature(CPU_FTR_ARCH_300) &&
+	    (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
+		/*
+		 * Use the default stop state for CPU-Hotplug
+		 * if available.
+		 */
+		if (default_stop_found) {
+			pnv_deepest_stop_psscr_val = pnv_default_stop_val;
+			pnv_deepest_stop_psscr_mask = pnv_default_stop_mask;
+			pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
+				pnv_deepest_stop_psscr_val);
+		} else { /* Fallback to snooze loop for CPU-Hotplug */
+			deepest_stop_found = false;
+			pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
+		}
+	}
 }
 
 /*
@@ -707,10 +1177,8 @@ static void __init pnv_probe_idle_states(void)
 		return;
 	}
 
-	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-		if (pnv_power9_idle_init())
-			return;
-	}
+	if (cpu_has_feature(CPU_FTR_ARCH_300))
+		pnv_power9_idle_init();
 
 	for (i = 0; i < nr_pnv_idle_states; i++)
 		supported_cpuidle_states |= pnv_idle_states[i].flags;
@@ -818,7 +1286,7 @@ static int pnv_parse_cpuidle_dt(void)
 	}
 	for (i = 0; i < nr_idle_states; i++)
 		strncpy(pnv_idle_states[i].name, temp_string[i],
-			PNV_IDLE_NAME_LEN);
+			PNV_IDLE_NAME_LEN - 1);
 	nr_pnv_idle_states = nr_idle_states;
 	rc = 0;
 out:
@@ -830,11 +1298,34 @@ static int pnv_parse_cpuidle_dt(void)
 
 static int __init pnv_init_idle_states(void)
 {
+	int cpu;
 	int rc = 0;
-	supported_cpuidle_states = 0;
+
+	/* Set up PACA fields */
+	for_each_present_cpu(cpu) {
+		struct paca_struct *p = paca_ptrs[cpu];
+
+		p->idle_state = 0;
+		if (cpu == cpu_first_thread_sibling(cpu))
+			p->idle_state = (1 << threads_per_core) - 1;
+
+		if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+			/* P7/P8 nap */
+			p->thread_idle_state = PNV_THREAD_RUNNING;
+			p->thread_mask = 1 << cpu_thread_in_core(cpu);
+		} else {
+			/* P9 stop */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+			p->requested_psscr = 0;
+			atomic_set(&p->dont_stop, 0);
+#endif
+		}
+	}
 
 	/* In case we error out nr_pnv_idle_states will be zero */
 	nr_pnv_idle_states = 0;
+	supported_cpuidle_states = 0;
+
 	if (cpuidle_disable != IDLE_NO_OVERRIDE)
 		goto out;
 	rc = pnv_parse_cpuidle_dt();
@@ -842,27 +1333,40 @@ static int __init pnv_init_idle_states(void)
 		return rc;
 	pnv_probe_idle_states();
 
-	if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
-		patch_instruction(
-			(unsigned int *)pnv_fastsleep_workaround_at_entry,
-			PPC_INST_NOP);
-		patch_instruction(
-			(unsigned int *)pnv_fastsleep_workaround_at_exit,
-			PPC_INST_NOP);
-	} else {
-		/*
-		 * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
-		 * workaround is needed to use fastsleep. Provide sysfs
-		 * control to choose how this workaround has to be applied.
-		 */
-		device_create_file(cpu_subsys.dev_root,
+	if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+		if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
+			power7_fastsleep_workaround_entry = false;
+			power7_fastsleep_workaround_exit = false;
+		} else {
+			/*
+			 * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
+			 * workaround is needed to use fastsleep. Provide sysfs
+			 * control to choose how this workaround has to be
+			 * applied.
+			 */
+			device_create_file(cpu_subsys.dev_root,
 				&dev_attr_fastsleep_workaround_applyonce);
-	}
+		}
+
+		update_subcore_sibling_mask();
 
-	pnv_alloc_idle_core_states();
+		if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED) {
+			ppc_md.power_save = power7_idle;
+			power7_offline_type = PNV_THREAD_NAP;
+		}
 
-	if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
-		ppc_md.power_save = power7_idle;
+		if ((supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED) &&
+			   (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT))
+			power7_offline_type = PNV_THREAD_WINKLE;
+		else if ((supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED) ||
+			   (supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1))
+			power7_offline_type = PNV_THREAD_SLEEP;
+	}
+
+	if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
+		if (pnv_save_sprs_for_deep_states())
+			pnv_disable_deep_states();
+	}
 
 out:
 	return 0;
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index 45563004feda..1d7a9fd30dd1 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -183,7 +183,7 @@ static void unsplit_core(void)
 	cpu = smp_processor_id();
 	if (cpu_thread_in_core(cpu) != 0) {
 		while (mfspr(SPRN_HID0) & mask)
-			power7_idle_insn(PNV_THREAD_NAP);
+			power7_idle_type(PNV_THREAD_NAP);
 
 		per_cpu(split_state, cpu).step = SYNC_STEP_UNSPLIT;
 		return;
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index c7324e8469ee..b8167f4d53ef 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2417,7 +2417,6 @@ static void dump_one_paca(int cpu)
 	DUMP(p, irq_happened, "%#-*x");
 	DUMP(p, io_sync, "%#-*x");
 	DUMP(p, irq_work_pending, "%#-*x");
-	DUMP(p, nap_state_lost, "%#-*x");
 	DUMP(p, sprg_vdso, "%#-*llx");
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -2425,19 +2424,17 @@ static void dump_one_paca(int cpu)
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
-	DUMP(p, core_idle_state_ptr, "%-*px");
-	DUMP(p, thread_idle_state, "%#-*x");
-	DUMP(p, thread_mask, "%#-*x");
-	DUMP(p, subcore_sibling_mask, "%#-*x");
-	DUMP(p, requested_psscr, "%#-*llx");
-	DUMP(p, stop_sprs.pid, "%#-*llx");
-	DUMP(p, stop_sprs.ldbar, "%#-*llx");
-	DUMP(p, stop_sprs.fscr, "%#-*llx");
-	DUMP(p, stop_sprs.hfscr, "%#-*llx");
-	DUMP(p, stop_sprs.mmcr1, "%#-*llx");
-	DUMP(p, stop_sprs.mmcr2, "%#-*llx");
-	DUMP(p, stop_sprs.mmcra, "%#-*llx");
-	DUMP(p, dont_stop.counter, "%#-*x");
+	DUMP(p, idle_state, "%#-*lx");
+	if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+		DUMP(p, thread_idle_state, "%#-*x");
+		DUMP(p, thread_mask, "%#-*x");
+		DUMP(p, subcore_sibling_mask, "%#-*x");
+	} else {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+		DUMP(p, requested_psscr, "%#-*llx");
+		DUMP(p, dont_stop.counter, "%#-*x");
+#endif
+	}
 #endif
 
 	DUMP(p, accounting.utime, "%#-*lx");
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage
From: Alistair Popple @ 2018-08-03  4:29 UTC (permalink / raw)
  To: Reza Arbab; +Cc: linuxppc-dev
In-Reply-To: <1533269016-16238-1-git-send-email-arbab@linux.ibm.com>

> There may be a long-term way to fix this at a larger scale, but for now
> resolve the immediate problem by gating our call to
> test_and_set_bit_lock() with one to test_bit(), which is obviously
> implemented without using a store.

I am less sure of this now but am continuing to investigate. However this patch
looks good.

Acked-by: Alistair Popple <alistair@popple.id.au>

> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8cdf91f..c773465 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
>  	int i;
>  
>  	for (i = 0; i < npu->mmio_atsd_count; i++) {
> -		if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
> -			return i;
> +		if (!test_bit(i, &npu->mmio_atsd_usage))
> +			if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
> +				return i;
>  	}
>  
>  	return -ENOSPC;
> 

^ permalink raw reply

* Re: [PATCH v5 09/11] hugetlb: Introduce generic version of huge_ptep_set_wrprotect
From: Alex Ghiti @ 2018-08-03  5:24 UTC (permalink / raw)
  To: Michael Ellerman, linux-mm, mike.kravetz, linux, catalin.marinas,
	will.deacon, tony.luck, fenghua.yu, ralf, paul.burton, jhogan,
	jejb, deller, benh, ysato, dalias, davem, tglx, mingo, hpa, x86,
	arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
	linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch,
	aneesh.kumar@linux.ibm.com
In-Reply-To: <6acb1389-6998-bafb-cf69-174fd522c04c@ghiti.fr>

Ok, I tried every defconfig available:

- for the nohash/32, I found that I could use mpc885_ads_defconfig and I 
activated HUGETLBFS.
I removed the definition of huge_ptep_set_wrprotect from 
nohash/32/pgtable.h, add an #error in
include/asm-generic/hugetlb.h right before the generic definition of 
huge_ptep_set_wrprotect,
and fell onto it at compile-time:
=> I'm pretty confident then that removing the definition of 
huge_ptep_set_wrprotect does not
break anythingin this case.

- regardind book3s/32, I did not find any defconfig with 
CONFIG_PPC_BOOK3S_32, CONFIG_PPC32
allowing to enable huge page support (ie CONFIG_SYS_SUPPORTS_HUGETLBFS)
=> Do you have a defconfig that would allow me to try the same as above ?

Thanks,

Alex


On 07/31/2018 11:17 AM, Alexandre Ghiti wrote:
>
> On 07/31/2018 12:06 PM, Michael Ellerman wrote:
>> Alexandre Ghiti <alex@ghiti.fr> writes:
>>
>>> arm, ia64, mips, sh, x86 architectures use the same version
>>> of huge_ptep_set_wrprotect, so move this generic implementation into
>>> asm-generic/hugetlb.h.
>>> Note: powerpc uses twice for book3s/32 and nohash/32 the same 
>>> version as
>>> the above architectures, but the modification was not straightforward
>>> and hence has not been done.
>> Do you remember what the problem was there?
>>
>> It looks like you should just be able to drop them like the others. I
>> assume there's some header spaghetti that causes problems though?
>
> Yes, the header spaghetti frightened me a bit. Maybe I should have 
> tried harder: I can try to remove them and find the right defconfigs 
> to compile both to begin with. And to guarantee the functionality is 
> preserved, can I use the testsuite of libhugetlbfs with qemu ?
>
> Alex
>
>>
>> cheers
>>
>>
>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>   arch/arm/include/asm/hugetlb-3level.h        | 6 ------
>>>   arch/arm64/include/asm/hugetlb.h             | 1 +
>>>   arch/ia64/include/asm/hugetlb.h              | 6 ------
>>>   arch/mips/include/asm/hugetlb.h              | 6 ------
>>>   arch/parisc/include/asm/hugetlb.h            | 1 +
>>>   arch/powerpc/include/asm/book3s/32/pgtable.h | 2 ++
>>>   arch/powerpc/include/asm/book3s/64/pgtable.h | 1 +
>>>   arch/powerpc/include/asm/nohash/32/pgtable.h | 2 ++
>>>   arch/powerpc/include/asm/nohash/64/pgtable.h | 1 +
>>>   arch/sh/include/asm/hugetlb.h                | 6 ------
>>>   arch/sparc/include/asm/hugetlb.h             | 1 +
>>>   arch/x86/include/asm/hugetlb.h               | 6 ------
>>>   include/asm-generic/hugetlb.h                | 8 ++++++++
>>>   13 files changed, 17 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/arch/arm/include/asm/hugetlb-3level.h 
>>> b/arch/arm/include/asm/hugetlb-3level.h
>>> index b897541520ef..8247cd6a2ac6 100644
>>> --- a/arch/arm/include/asm/hugetlb-3level.h
>>> +++ b/arch/arm/include/asm/hugetlb-3level.h
>>> @@ -37,12 +37,6 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
>>>       return retval;
>>>   }
>>>   -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> -                       unsigned long addr, pte_t *ptep)
>>> -{
>>> -    ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>>   static inline int huge_ptep_set_access_flags(struct vm_area_struct 
>>> *vma,
>>>                            unsigned long addr, pte_t *ptep,
>>>                            pte_t pte, int dirty)
>>> diff --git a/arch/arm64/include/asm/hugetlb.h 
>>> b/arch/arm64/include/asm/hugetlb.h
>>> index 3e7f6e69b28d..f4f69ae5466e 100644
>>> --- a/arch/arm64/include/asm/hugetlb.h
>>> +++ b/arch/arm64/include/asm/hugetlb.h
>>> @@ -48,6 +48,7 @@ extern int huge_ptep_set_access_flags(struct 
>>> vm_area_struct *vma,
>>>   #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
>>>   extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>>>                        unsigned long addr, pte_t *ptep);
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                       unsigned long addr, pte_t *ptep);
>>>   #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
>>> diff --git a/arch/ia64/include/asm/hugetlb.h 
>>> b/arch/ia64/include/asm/hugetlb.h
>>> index cbe296271030..49d1f7949f3a 100644
>>> --- a/arch/ia64/include/asm/hugetlb.h
>>> +++ b/arch/ia64/include/asm/hugetlb.h
>>> @@ -27,12 +27,6 @@ static inline void huge_ptep_clear_flush(struct 
>>> vm_area_struct *vma,
>>>   {
>>>   }
>>>   -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> -                       unsigned long addr, pte_t *ptep)
>>> -{
>>> -    ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>>   static inline int huge_ptep_set_access_flags(struct vm_area_struct 
>>> *vma,
>>>                            unsigned long addr, pte_t *ptep,
>>>                            pte_t pte, int dirty)
>>> diff --git a/arch/mips/include/asm/hugetlb.h 
>>> b/arch/mips/include/asm/hugetlb.h
>>> index 6ff2531cfb1d..3dcf5debf8c4 100644
>>> --- a/arch/mips/include/asm/hugetlb.h
>>> +++ b/arch/mips/include/asm/hugetlb.h
>>> @@ -63,12 +63,6 @@ static inline int huge_pte_none(pte_t pte)
>>>       return !val || (val == (unsigned long)invalid_pte_table);
>>>   }
>>>   -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> -                       unsigned long addr, pte_t *ptep)
>>> -{
>>> -    ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>>   static inline int huge_ptep_set_access_flags(struct vm_area_struct 
>>> *vma,
>>>                            unsigned long addr,
>>>                            pte_t *ptep, pte_t pte,
>>> diff --git a/arch/parisc/include/asm/hugetlb.h 
>>> b/arch/parisc/include/asm/hugetlb.h
>>> index fb7e0fd858a3..9c3950ca2974 100644
>>> --- a/arch/parisc/include/asm/hugetlb.h
>>> +++ b/arch/parisc/include/asm/hugetlb.h
>>> @@ -39,6 +39,7 @@ static inline void huge_ptep_clear_flush(struct 
>>> vm_area_struct *vma,
>>>   {
>>>   }
>>>   +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep);
>>>   diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
>>> b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> index 02f5acd7ccc4..d2cd1d0226e9 100644
>>> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> @@ -228,6 +228,8 @@ static inline void ptep_set_wrprotect(struct 
>>> mm_struct *mm, unsigned long addr,
>>>   {
>>>       pte_update(ptep, (_PAGE_RW | _PAGE_HWWRITE), _PAGE_RO);
>>>   }
>>> +
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep)
>>>   {
>>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> index 42aafba7a308..7d957f7c47cd 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> @@ -451,6 +451,7 @@ static inline void ptep_set_wrprotect(struct 
>>> mm_struct *mm, unsigned long addr,
>>>           pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 0);
>>>   }
>>>   +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep)
>>>   {
>>> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
>>> b/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> index 7c46a98cc7f4..f39e200d9591 100644
>>> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> @@ -249,6 +249,8 @@ static inline void ptep_set_wrprotect(struct 
>>> mm_struct *mm, unsigned long addr,
>>>   {
>>>       pte_update(ptep, (_PAGE_RW | _PAGE_HWWRITE), _PAGE_RO);
>>>   }
>>> +
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep)
>>>   {
>>> diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
>>> b/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> index dd0c7236208f..69fbf7e9b4db 100644
>>> --- a/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> @@ -238,6 +238,7 @@ static inline void ptep_set_wrprotect(struct 
>>> mm_struct *mm, unsigned long addr,
>>>       pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
>>>   }
>>>   +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep)
>>>   {
>>> diff --git a/arch/sh/include/asm/hugetlb.h 
>>> b/arch/sh/include/asm/hugetlb.h
>>> index f1bbd255ee43..8df4004977b9 100644
>>> --- a/arch/sh/include/asm/hugetlb.h
>>> +++ b/arch/sh/include/asm/hugetlb.h
>>> @@ -32,12 +32,6 @@ static inline void huge_ptep_clear_flush(struct 
>>> vm_area_struct *vma,
>>>   {
>>>   }
>>>   -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> -                       unsigned long addr, pte_t *ptep)
>>> -{
>>> -    ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>>   static inline int huge_ptep_set_access_flags(struct vm_area_struct 
>>> *vma,
>>>                            unsigned long addr, pte_t *ptep,
>>>                            pte_t pte, int dirty)
>>> diff --git a/arch/sparc/include/asm/hugetlb.h 
>>> b/arch/sparc/include/asm/hugetlb.h
>>> index 2101ea217f33..c41754a113f3 100644
>>> --- a/arch/sparc/include/asm/hugetlb.h
>>> +++ b/arch/sparc/include/asm/hugetlb.h
>>> @@ -32,6 +32,7 @@ static inline void huge_ptep_clear_flush(struct 
>>> vm_area_struct *vma,
>>>   {
>>>   }
>>>   +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>>                          unsigned long addr, pte_t *ptep)
>>>   {
>>> diff --git a/arch/x86/include/asm/hugetlb.h 
>>> b/arch/x86/include/asm/hugetlb.h
>>> index 59c056adb3c9..a3f781f7a264 100644
>>> --- a/arch/x86/include/asm/hugetlb.h
>>> +++ b/arch/x86/include/asm/hugetlb.h
>>> @@ -13,12 +13,6 @@ static inline int is_hugepage_only_range(struct 
>>> mm_struct *mm,
>>>       return 0;
>>>   }
>>>   -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> -                       unsigned long addr, pte_t *ptep)
>>> -{
>>> -    ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>>   static inline int huge_ptep_set_access_flags(struct vm_area_struct 
>>> *vma,
>>>                            unsigned long addr, pte_t *ptep,
>>>                            pte_t pte, int dirty)
>>> diff --git a/include/asm-generic/hugetlb.h 
>>> b/include/asm-generic/hugetlb.h
>>> index 6c0c8b0c71e0..9b9039845278 100644
>>> --- a/include/asm-generic/hugetlb.h
>>> +++ b/include/asm-generic/hugetlb.h
>>> @@ -102,4 +102,12 @@ static inline int prepare_hugepage_range(struct 
>>> file *file,
>>>   }
>>>   #endif
>>>   +#ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> +static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> +        unsigned long addr, pte_t *ptep)
>>> +{
>>> +    ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +#endif
>>> +
>>>   #endif /* _ASM_GENERIC_HUGETLB_H */
>>> -- 
>>> 2.16.2
>

^ permalink raw reply

* [PATCH 1/2] powerpc: Allow memory that has been hot-removed to be hot-added
From: Rashmica Gupta @ 2018-08-03  6:06 UTC (permalink / raw)
  To: linuxppc-dev, mpe, mikey, benh, paulus, bsingharora; +Cc: Rashmica Gupta

This patch allows the memory removed by memtrace to be readded to the
kernel. So now you don't have to reboot your system to add the memory
back to the kernel or to have a different amount of memory removed.

Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
---
To remove 1GB from each node:
echo 1073741824  >  /sys/kernel/debug/powerpc/memtrace/enable

To add this memory back and remove 2GB:
echo 2147483648 >  /sys/kernel/debug/powerpc/memtrace/enable 

To just re-add memory:
echo 0  >  /sys/kernel/debug/powerpc/memtrace/enable 


 arch/powerpc/platforms/powernv/memtrace.c | 93 ++++++++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index b99283df8584..51fe0862dcab 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -206,8 +206,11 @@ static int memtrace_init_debugfs(void)
 
 		snprintf(ent->name, 16, "%08x", ent->nid);
 		dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
-		if (!dir)
+		if (!dir) {
+			pr_err("Failed to create debugfs directory for node %d\n",
+				ent->nid);
 			return -1;
+		}
 
 		ent->dir = dir;
 		debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
@@ -218,18 +221,94 @@ static int memtrace_init_debugfs(void)
 	return ret;
 }
 
+static int online_mem_block(struct memory_block *mem, void *arg)
+{
+	return device_online(&mem->dev);
+}
+
+/*
+ * Iterate through the chunks of memory we have removed from the kernel
+ * and attempt to add them back to the kernel.
+ */
+static int memtrace_online(void)
+{
+	int i, ret = 0;
+	struct memtrace_entry *ent;
+
+	for (i = memtrace_array_nr - 1; i >= 0; i--) {
+		ent = &memtrace_array[i];
+
+		/* We have onlined this chunk previously */
+		if (ent->nid == -1)
+			continue;
+
+		/* Remove from io mappings */
+		if (ent->mem) {
+			iounmap(ent->mem);
+			ent->mem = 0;
+		}
+
+		if (add_memory(ent->nid, ent->start, ent->size)) {
+			pr_err("Failed to add trace memory to node %d\n",
+					 ent->nid);
+			ret += 1;
+			continue;
+		}
+
+		/*
+		 * If kernel isn't compiled with the auto online option
+		 * we need to online the memory ourselves.
+		 */
+		if (!memhp_auto_online) {
+			walk_memory_range(PFN_DOWN(ent->start),
+				PFN_UP(ent->start + ent->size - 1),
+				NULL, online_mem_block);
+		}
+
+		/*
+		 * Memory was added successfully so clean up references to it
+		 * so on reentry we can tell that this chunk was added.
+		 */
+		debugfs_remove_recursive(ent->dir);
+		pr_info("Added trace memory back to node %d\n", ent->nid);
+		ent->size = ent->start = ent->nid = -1;
+	}
+	if (ret)
+		return ret;
+
+	/* If all chunks of memory were added successfully, reset globals */
+	kfree(memtrace_array);
+	memtrace_array = NULL;
+	memtrace_size = 0;
+	memtrace_array_nr = 0;
+	return 0;
+
+}
+
 static int memtrace_enable_set(void *data, u64 val)
 {
-	if (memtrace_size)
+	uint64_t bytes;
+
+	/*
+	 * Don't attempt to do anything if size isn't aligned to a memory
+	 * block or equal to zero.
+	 */
+	bytes = memory_block_size_bytes();
+	if (val & (bytes - 1)) {
+		pr_err("Value must be aligned with 0x%llx\n", bytes);
 		return -EINVAL;
+	}
 
-	if (!val)
-		return -EINVAL;
+	/* Re-add/online previously removed/offlined memory */
+	if (memtrace_size) {
+		if (memtrace_online())
+			return -EAGAIN;
+	}
 
-	/* Make sure size is aligned to a memory block */
-	if (val & (memory_block_size_bytes() - 1))
-		return -EINVAL;
+	if (!val)
+		return 0;
 
+	/* Offline and remove memory */
 	if (memtrace_init_regions_runtime(val))
 		return -EINVAL;
 
-- 
2.14.4

^ permalink raw reply related

* [PATCH 2/2] Update documentation on ppc-memtrace
From: Rashmica Gupta @ 2018-08-03  6:06 UTC (permalink / raw)
  To: linuxppc-dev, mpe, mikey, benh, paulus, bsingharora; +Cc: Rashmica Gupta
In-Reply-To: <20180803060601.724-1-rashmica.g@gmail.com>

Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
---
 Documentation/ABI/testing/ppc-memtrace | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/testing/ppc-memtrace b/Documentation/ABI/testing/ppc-memtrace
index 2e8b93741270..9606aed33137 100644
--- a/Documentation/ABI/testing/ppc-memtrace
+++ b/Documentation/ABI/testing/ppc-memtrace
@@ -13,10 +13,11 @@ Contact:	linuxppc-dev@lists.ozlabs.org
 Description:	Write an integer containing the size in bytes of the memory
 		you want removed from each NUMA node to this file - it must be
 		aligned to the memblock size. This amount of RAM will be removed
-		from the kernel mappings and the following debugfs files will be
-		created. This can only be successfully done once per boot. Once
-		memory is successfully removed from each node, the following
-		files are created.
+		from each NUMA node in the kernel mappings and the following
+		debugfs files will be created. Once memory is successfully
+		removed from each node, the following files are created. To
+		re-add memory to the kernel, echo 0 into this file (it will be
+		automatically onlined).
 
 What:		/sys/kernel/debug/powerpc/memtrace/<node-id>
 Date:		Aug 2017
-- 
2.14.4

^ permalink raw reply related

* Re: [PATCH] powernv/cpuidle: Fix idle states all being marked invalid
From: Akshay Adiga @ 2018-08-03  6:14 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, Gautham R . Shenoy
In-Reply-To: <20180802153951.3576-1-npiggin@gmail.com>

On Fri, Aug 03, 2018 at 01:39:51AM +1000, Nicholas Piggin wrote:
> Commit 9c7b185ab2 ("powernv/cpuidle: Parse dt idle properties into
> global structure") parses dt idle states into structs, but never
> marks them valid. This results in all idle states being lost.
>
My bad. Thanks nick for fixing this. We definatetely need this.


> Cc: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Acked-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>

> ---
>  arch/powerpc/platforms/powernv/idle.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 3116bab10aa3..ecb002c5db83 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -651,11 +651,12 @@ static int __init pnv_power9_idle_init(void)
>  					      &state->psscr_mask,
>  					      state->flags);
>  		if (err) {
> -			state->valid = false;
>  			report_invalid_psscr_val(state->psscr_val, err);
>  			continue;
>  		}
> 
> +		state->valid = true;
> +
>  		if (max_residency_ns < state->residency_ns) {
>  			max_residency_ns = state->residency_ns;
>  			pnv_deepest_stop_psscr_val = state->psscr_val;
> -- 
> 2.17.0
> 

^ permalink raw reply

* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Christophe LEROY @ 2018-08-03  6:38 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Ellerman,
	Michael Neuling, Nicholas Piggin, Paul Mackerras, Simon Guo,
	Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev,
	Segher Boessenkool
In-Reply-To: <20180803004201.GA5891@kermit-br-ibm-com>

Hi Murilo,

Le 03/08/2018 à 02:42, Murilo Opsfelder Araujo a écrit :
> Hi, Christophe.
> 
> On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>>
>>
>> Le 01/08/2018 à 23:33, Murilo Opsfelder Araujo a écrit :
>>> show_user_instructions() is a slightly modified version of
>>> show_instructions() that allows userspace instruction dump.
>>>
>>> This will be useful within show_signal_msg() to dump userspace
>>> instructions of the faulty location.
>>>
>>> Here is a sample of what show_user_instructions() outputs:
>>>
>>>     pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
>>>     pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
>>>
>>> The current->comm and current->pid printed can serve as a glue that
>>> links the instructions dump to its originator, allowing messages to be
>>> interleaved in the logs.
>>>
>>> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>>> ---
>>>    arch/powerpc/include/asm/stacktrace.h | 13 +++++++++
>>>    arch/powerpc/kernel/process.c         | 40 +++++++++++++++++++++++++++
>>>    2 files changed, 53 insertions(+)
>>>    create mode 100644 arch/powerpc/include/asm/stacktrace.h
>>>
>>> diff --git a/arch/powerpc/include/asm/stacktrace.h b/arch/powerpc/include/asm/stacktrace.h
>>> new file mode 100644
>>> index 000000000000..6149b53b3bc8
>>> --- /dev/null
>>> +++ b/arch/powerpc/include/asm/stacktrace.h
>>> @@ -0,0 +1,13 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Stack trace functions.
>>> + *
>>> + * Copyright 2018, Murilo Opsfelder Araujo, IBM Corporation.
>>> + */
>>> +
>>> +#ifndef _ASM_POWERPC_STACKTRACE_H
>>> +#define _ASM_POWERPC_STACKTRACE_H
>>> +
>>> +void show_user_instructions(struct pt_regs *regs);
>>> +
>>> +#endif /* _ASM_POWERPC_STACKTRACE_H */
>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>> index e9533b4d2f08..364645ac732c 100644
>>> --- a/arch/powerpc/kernel/process.c
>>> +++ b/arch/powerpc/kernel/process.c
>>> @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
>>>    	pr_cont("\n");
>>>    }
>>> +void show_user_instructions(struct pt_regs *regs)
>>> +{
>>> +	int i;
>>> +	const char *prefix = KERN_INFO "%s[%d]: code: ";
>>> +	unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
>>> +					sizeof(int));
>>> +
>>> +	printk(prefix, current->comm, current->pid);
>>
>> Why not use pr_info() and remove KERN_INFO from *prefix ?
> 
> Because it doesn't compile:
> 
>    arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
>      pr_info(prefix, current->comm, current->pid);
>              ^
>    ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
>     #define pr_fmt(fmt) fmt
>                       ^
> 
> `pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
> which is an invalid string concatenation.
> 
> `pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
> valid.

Then what about using directly:

pr_info("%s[%d]: code: ", ...);

> 
>>> +
>>> +	for (i = 0; i < instructions_to_print; i++) {
>>> +		int instr;
>>> +
>>> +		if (!(i % 8) && (i > 0)) {
>>> +			pr_cont("\n");
>>> +			printk(prefix, current->comm, current->pid);
>>> +		}
>>> +
>>> +#if !defined(CONFIG_BOOKE)
>>> +		/* If executing with the IMMU off, adjust pc rather
>>> +		 * than print XXXXXXXX.
>>> +		 */
>>> +		if (!(regs->msr & MSR_IR))
>>> +			pc = (unsigned long)phys_to_virt(pc);
>>
>> Shouldn't this be done outside of the loop, only once ?
> 
> I don't think so.
> 
> pc gets incremented at the bottom of the loop:
> 
>    pc += sizeof(int);
> 
> Adjusting pc is necessary at each iteration.  Leaving this block inside
> the loop seems correct.

This looks pretty strange.
The first time, pc is a physical address, that you change to a virtual 
address. Then when you increment it it is still a virtual address.
So when you call phys_to_virt(pc) for the second time, pc is already a 
virt address, so what happens indeed ?

Christophe

> 
> Cheers
> Murilo
> 

^ permalink raw reply

* RE: [PATCH v11 00/26] Speculative page faults
From: Song, HaiyanX @ 2018-08-03  6:36 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm@linux-foundation.org, mhocko@kernel.org,
	peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com,
	dave@stgolabs.net, jack@suse.cz, Matthew Wilcox,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner, Ingo Molnar, hpa@zytor.com, Will Deacon,
	Sergey Senozhatsky, sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli, Alexei Starovoitov, Wang, Kemi, Daniel Jordan,
	David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
	Punit Agrawal, vinayak menon, Yang Shi,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org
In-Reply-To: <166434ae-ecaf-05d8-3cc7-7aa75bc3737b@linux.vnet.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 42188 bytes --]

Hi Laurent,

Thanks for your analysis for the last perf results.
Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported
on the head side", which is a bug of 0-day,and it caused the item is not counted in perf. 

I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times).
I checked the perf report have no above mentioned problem.

I have compared them, found some items have difference, such as below case:
       page_fault2-thp-always: handle_mm_fault, base: 45.22%    head: 29.41%
       page_fault3-thp-always: handle_mm_fault, base: 22.95%    head: 14.15%       

So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit.


Thanks,
Haiyan, Song
 
________________________________________
From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Song, HaiyanX
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults

On 13/07/2018 05:56, Song, HaiyanX wrote:
> Hi Laurent,

Hi Haiyan,

Thanks a lot for sharing this perf reports.

I looked at them closely, and I've to admit that I was not able to found a
major difference between the base and the head report, except that
handle_pte_fault() is no more in-lined in the head one.

As expected, __handle_speculative_fault() is never traced since these tests are
dealing with file mapping, not handled in the speculative way.

When running these test did you seen a major differences in the test's result
between base and head ?

>From the number of cycles counted, the biggest difference is page_fault3 when
run with the THP enabled:
                                BASE            HEAD            Delta
page_fault2_base_thp_never      1142252426747   1065866197589   -6.69%
page_fault2_base_THP-Alwasys    1124844374523   1076312228927   -4.31%
page_fault3_base_thp_never      1099387298152   1134118402345   3.16%
page_fault3_base_THP-Always     1059370178101   853985561949    -19.39%


The very weird thing is the difference of the delta cycles reported between
thp never and thp always, because the speculative way is aborted when checking
for the vma->ops field, which is the same in both case, and the thp is never
checked. So there is no code covering differnce, on the speculative path,
between these 2 cases. This leads me to think that there are other interactions
interfering in the measure.

Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at
the head of the perf report is the 92% testcase which is weirdly not reported
on the head side :
    92.02%    22.33%  page_fault3_processes  [.] testcase
92.02% testcase

Then the base reported 37.67% for __do_page_fault() where the head reported
48.41%, but the only difference in this function, between base and head, is the
call to handle_speculative_fault(). But this is a macro checking for the fault
flags, and mm->users and then calling __handle_speculative_fault() if needed.
So this can't explain this difference, except if __handle_speculative_fault()
is inlined in __do_page_fault().
Is this the case on your build ?

Haiyan, do you still have the output of the test to check those numbers too ?

Cheers,
Laurent

> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case.
> Please help to check on these data if it can help you to find the higher change. Thanks.
>
> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2
>     tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration.
>
> Best regards,
> Haiyan Song
>
> ________________________________________
> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
> Sent: Thursday, July 12, 2018 1:05 AM
> To: Song, HaiyanX
> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
>
> Hi Haiyan,
>
> Do you get a chance to capture some performance cycles on your system ?
> I still can't get these numbers on my hardware.
>
> Thanks,
> Laurent.
>
> On 04/07/2018 09:51, Laurent Dufour wrote:
>> On 04/07/2018 05:23, Song, HaiyanX wrote:
>>> Hi Laurent,
>>>
>>>
>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times.
>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev.
>>
>> Repeating the test only 3 times seems a bit too low to me.
>>
>> I'll focus on the higher change for the moment, but I don't have access to such
>> a hardware.
>>
>> Is possible to provide a diff between base and SPF of the performance cycles
>> measured when running page_fault3 and page_fault2 when the 20% change is detected.
>>
>> Please stay focus on the test case process to see exactly where the series is
>> impacting.
>>
>> Thanks,
>> Laurent.
>>
>>>
>>> And I did not find other high variation on test case result.
>>>
>>> a). Enable THP
>>> testcase                          base     stddev       change      head     stddev         metric
>>> page_fault3/enable THP           10519      ± 3%        -20.5%      8368      ±6%          will-it-scale.per_thread_ops
>>> page_fault2/enalbe THP            8281      ± 2%        -18.8%      6728                   will-it-scale.per_thread_ops
>>> brk1/eanble THP                 998475                   -2.2%    976893                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      223910                   -1.3%    220930                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      233722                   -1.0%    231288                   will-it-scale.per_thread_ops
>>>
>>> b). Disable THP
>>> page_fault3/disable THP          10856                  -23.1%      8344                   will-it-scale.per_thread_ops
>>> page_fault2/disable THP           8147                  -18.8%      6613                   will-it-scale.per_thread_ops
>>> brk1/disable THP                   957                    -7.9%      881                   will-it-scale.per_thread_ops
>>> context_switch1/disable THP     237006                    -2.2%    231907                  will-it-scale.per_thread_ops
>>> brk1/disable THP                997317                    -2.0%    977778                  will-it-scale.per_process_ops
>>> page_fault3/disable THP         467454                    -1.8%    459251                  will-it-scale.per_process_ops
>>> context_switch1/disable THP     224431                    -1.3%    221567                  will-it-scale.per_process_ops
>>>
>>>
>>> Best regards,
>>> Haiyan Song
>>> ________________________________________
>>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>> Sent: Monday, July 02, 2018 4:59 PM
>>> To: Song, HaiyanX
>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>
>>> On 11/06/2018 09:49, Song, HaiyanX wrote:
>>>> Hi Laurent,
>>>>
>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
>>>> V9 patch serials.
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
>>>> commit id:
>>>>   head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>>>>   base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
>>>> Benchmark: will-it-scale
>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
>>>>
>>>> Metrics:
>>>>   will-it-scale.per_process_ops=processes/nr_cpu
>>>>   will-it-scale.per_thread_ops=threads/nr_cpu
>>>>   test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task:100%
>>>>
>>>> 1. Regressions:
>>>>
>>>> a). Enable THP
>>>> testcase                          base           change      head           metric
>>>> page_fault3/enable THP           10519          -20.5%        836      will-it-scale.per_thread_ops
>>>> page_fault2/enalbe THP            8281          -18.8%       6728      will-it-scale.per_thread_ops
>>>> brk1/eanble THP                 998475           -2.2%     976893      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      223910           -1.3%     220930      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      233722           -1.0%     231288      will-it-scale.per_thread_ops
>>>>
>>>> b). Disable THP
>>>> page_fault3/disable THP          10856          -23.1%       8344      will-it-scale.per_thread_ops
>>>> page_fault2/disable THP           8147          -18.8%       6613      will-it-scale.per_thread_ops
>>>> brk1/disable THP                   957           -7.9%        881      will-it-scale.per_thread_ops
>>>> context_switch1/disable THP     237006           -2.2%     231907      will-it-scale.per_thread_ops
>>>> brk1/disable THP                997317           -2.0%     977778      will-it-scale.per_process_ops
>>>> page_fault3/disable THP         467454           -1.8%     459251      will-it-scale.per_process_ops
>>>> context_switch1/disable THP     224431           -1.3%     221567      will-it-scale.per_process_ops
>>>>
>>>> Notes: for the above  values of test result, the higher is better.
>>>
>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
>>> get reproducible results. The results have huge variation, even on the vanilla
>>> kernel, and I can't state on any changes due to that.
>>>
>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
>>> measure any changes between the vanilla and the SPF patched ones:
>>>
>>> test THP enabled                4.17.0-rc4-mm1  spf             delta
>>> page_fault3_threads             2697.7          2683.5          -0.53%
>>> page_fault2_threads             170660.6        169574.1        -0.64%
>>> context_switch1_threads         6915269.2       6877507.3       -0.55%
>>> context_switch1_processes       6478076.2       6529493.5       0.79%
>>> brk1                            243391.2        238527.5        -2.00%
>>>
>>> Tests were run 10 times, no high variation detected.
>>>
>>> Did you see high variation on your side ? How many times the test were run to
>>> compute the average values ?
>>>
>>> Thanks,
>>> Laurent.
>>>
>>>
>>>>
>>>> 2. Improvement: not found improvement based on the selected test cases.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>> ________________________________________
>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>> Sent: Monday, May 28, 2018 4:54 PM
>>>> To: Song, HaiyanX
>>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> On 28/05/2018 10:22, Haiyan Song wrote:
>>>>> Hi Laurent,
>>>>>
>>>>> Yes, these tests are done on V9 patch.
>>>>
>>>> Do you plan to give this V11 a run ?
>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Haiyan Song
>>>>>
>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>>>>
>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>>>>> tested on Intel 4s Skylake platform.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>>>>> series" while responding to the v11 header series...
>>>>>> Were these tests done on v9 or v11 ?
>>>>>>
>>>>>> Cheers,
>>>>>> Laurent.
>>>>>>
>>>>>>>
>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>>>>> Commit id:
>>>>>>>     base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>>>>     head commit: 0355322b3577eeab7669066df42c550a56801110
>>>>>>> Benchmark suite: will-it-scale
>>>>>>> Download link:
>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>>>>> Metrics:
>>>>>>>     will-it-scale.per_process_ops=processes/nr_cpu
>>>>>>>     will-it-scale.per_thread_ops=threads/nr_cpu
>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>>>>> THP: enable / disable
>>>>>>> nr_task: 100%
>>>>>>>
>>>>>>> 1. Regressions:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ enable THP         10092           -17.5%          8323       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ enable THP          8300           -17.2%          6869       will-it-scale.per_thread_ops
>>>>>>> brk1/ enable THP                  957.67         -7.6%           885       will-it-scale.per_thread_ops
>>>>>>> page_fault3/ enable THP        172821            -5.3%        163692       will-it-scale.per_process_ops
>>>>>>> signal1/ enable THP              9125            -3.2%          8834       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ disable THP        10107           -19.1%          8180       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ disable THP         8432           -17.8%          6931       will-it-scale.per_thread_ops
>>>>>>> context_switch1/ disable THP   215389            -6.8%        200776       will-it-scale.per_thread_ops
>>>>>>> brk1/ disable THP                 939.67         -6.6%           877.33    will-it-scale.per_thread_ops
>>>>>>> page_fault3/ disable THP       173145            -4.7%        165064       will-it-scale.per_process_ops
>>>>>>> signal1/ disable THP             9162            -3.9%          8802       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> 2. Improvements:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ enable THP               66.33        +469.8%           383.67    will-it-scale.per_thread_ops
>>>>>>> writeseek3/ enable THP          2531             +4.5%          2646       will-it-scale.per_thread_ops
>>>>>>> signal1/ enable THP              989.33          +2.8%          1016       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ disable THP              90.33        +417.3%           467.33    will-it-scale.per_thread_ops
>>>>>>> read2/ disable THP             58934            +39.2%         82060       will-it-scale.per_thread_ops
>>>>>>> page_fault1/ disable THP        8607            +36.4%         11736       will-it-scale.per_thread_ops
>>>>>>> read1/ disable THP            314063            +12.7%        353934       will-it-scale.per_thread_ops
>>>>>>> writeseek3/ disable THP         2452            +12.5%          2759       will-it-scale.per_thread_ops
>>>>>>> signal1/ disable THP             971.33          +5.5%          1024       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>>>>> on head commit is better than that on base commit for this benchmark.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Haiyan Song
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>>>>
>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>>>>> page fault without holding the mm semaphore [1].
>>>>>>>
>>>>>>> The idea is to try to handle user space page faults without holding the
>>>>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>>>>> process since the page fault handler will not wait for other threads memory
>>>>>>> layout change to be done, assuming that this change is done in another part
>>>>>>> of the process's memory space. This type page fault is named speculative
>>>>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>>>>> is failing its processing and a classic page fault is then tried.
>>>>>>>
>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>>>>> freeing operation which was hitting the performance by 20% as reported by
>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>>>>> limiting the locking contention to these operations which are expected to
>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>>>>> our back a reference count is added and 2 services (get_vma() and
>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>>>>> benchmark anymore.
>>>>>>>
>>>>>>> The VMA's attributes checked during the speculative page fault processing
>>>>>>> have to be protected against parallel changes. This is done by using a per
>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>>>>> handler to fast check for parallel changes in progress and to abort the
>>>>>>> speculative page fault in that case.
>>>>>>>
>>>>>>> Once the VMA has been found, the speculative page fault handler would check
>>>>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>>>>> is tried.  VMA sequence lockings are added when VMA attributes which are
>>>>>>> checked during the page fault are modified.
>>>>>>>
>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>>>>> leading to touching this PTE will need to lock the page table, so no
>>>>>>> parallel change is possible at this time.
>>>>>>>
>>>>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>>>>> different from the one recorded at the beginning of the SPF operation, the
>>>>>>> classic page fault handler will be called to handle the operation while
>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>>>>> PTE.
>>>>>>>
>>>>>>> In pseudo code, this could be seen as:
>>>>>>>     speculative_page_fault()
>>>>>>>     {
>>>>>>>             vma = get_vma()
>>>>>>>             check vma sequence count
>>>>>>>             check vma's support
>>>>>>>             disable interrupt
>>>>>>>                   check pgd,p4d,...,pte
>>>>>>>                   save pmd and pte in vmf
>>>>>>>                   save vma sequence counter in vmf
>>>>>>>             enable interrupt
>>>>>>>             check vma sequence count
>>>>>>>             handle_pte_fault(vma)
>>>>>>>                     ..
>>>>>>>                     page = alloc_page()
>>>>>>>                     pte_map_lock()
>>>>>>>                             disable interrupt
>>>>>>>                                     abort if sequence counter has changed
>>>>>>>                                     abort if pmd or pte has changed
>>>>>>>                                     pte map and lock
>>>>>>>                             enable interrupt
>>>>>>>                     if abort
>>>>>>>                        free page
>>>>>>>                        abort
>>>>>>>                     ...
>>>>>>>     }
>>>>>>>
>>>>>>>     arch_fault_handler()
>>>>>>>     {
>>>>>>>             if (speculative_page_fault(&vma))
>>>>>>>                goto done
>>>>>>>     again:
>>>>>>>             lock(mmap_sem)
>>>>>>>             vma = find_vma();
>>>>>>>             handle_pte_fault(vma);
>>>>>>>             if retry
>>>>>>>                unlock(mmap_sem)
>>>>>>>                goto again;
>>>>>>>     done:
>>>>>>>             handle fault error
>>>>>>>     }
>>>>>>>
>>>>>>> Support for THP is not done because when checking for the PMD, we can be
>>>>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>>>>
>>>>>>> This series add a new software performance event named 'speculative-faults'
>>>>>>> or 'spf'. It counts the number of successful page fault event handled
>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>>>>> counting the total number of page fault events while 'spf' is only counting
>>>>>>> the part of the faults processed speculatively.
>>>>>>>
>>>>>>> There are some trace events introduced by this series. They allow
>>>>>>> identifying why the page faults were not processed speculatively. This
>>>>>>> doesn't take in account the faults generated by a monothreaded process
>>>>>>> which directly processed while holding the mmap_sem. This trace events are
>>>>>>> grouped in a system named 'pagefault', they are:
>>>>>>>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>>>>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>>>>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>>>>  - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>>>>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>>>>    back.
>>>>>>>
>>>>>>> To record all the related events, the easier is to run perf with the
>>>>>>> following arguments :
>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>>>>
>>>>>>> There is also a dedicated vmstat counter showing the number of successful
>>>>>>> page fault handled speculatively. I can be seen this way:
>>>>>>> $ grep speculative_pgfault /proc/vmstat
>>>>>>>
>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>>>>> on x86, PowerPC and arm64.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Real Workload results
>>>>>>>
>>>>>>> As mentioned in previous email, we did non official runs using a "popular
>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>>>>> which showed a 30% improvements in the number of transaction processed per
>>>>>>> second. This run has been done on the v6 series, but changes introduced in
>>>>>>> this new version should not impact the performance boost seen.
>>>>>>>
>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>>>>> series:
>>>>>>>                 vanilla         spf
>>>>>>> faults          89.418          101.364         +13%
>>>>>>> spf                n/a           97.989
>>>>>>>
>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>>>>> way.
>>>>>>>
>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>>>>> it a try on an android device. He reported that the application launch time
>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>>>>> 20%.
>>>>>>>
>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>>>>
>>>>>>> Application                             4.9     4.9+spf delta
>>>>>>> com.tencent.mm                          416     389     -7%
>>>>>>> com.eg.android.AlipayGphone             1135    986     -13%
>>>>>>> com.tencent.mtt                         455     454     0%
>>>>>>> com.qqgame.hlddz                        1497    1409    -6%
>>>>>>> com.autonavi.minimap                    711     701     -1%
>>>>>>> com.tencent.tmgp.sgame                  788     748     -5%
>>>>>>> com.immomo.momo                         501     487     -3%
>>>>>>> com.tencent.peng                        2145    2112    -2%
>>>>>>> com.smile.gifmaker                      491     461     -6%
>>>>>>> com.baidu.BaiduMap                      479     366     -23%
>>>>>>> com.taobao.taobao                       1341    1198    -11%
>>>>>>> com.baidu.searchbox                     333     314     -6%
>>>>>>> com.tencent.mobileqq                    394     384     -3%
>>>>>>> com.sina.weibo                          907     906     0%
>>>>>>> com.youku.phone                         816     731     -11%
>>>>>>> com.happyelements.AndroidAnimal.qq      763     717     -6%
>>>>>>> com.UCMobile                            415     411     -1%
>>>>>>> com.tencent.tmgp.ak                     1464    1431    -2%
>>>>>>> com.tencent.qqmusic                     336     329     -2%
>>>>>>> com.sankuai.meituan                     1661    1302    -22%
>>>>>>> com.netease.cloudmusic                  1193    1200    1%
>>>>>>> air.tv.douyu.android                    4257    4152    -2%
>>>>>>>
>>>>>>> ------------------
>>>>>>> Benchmarks results
>>>>>>>
>>>>>>> Base kernel is v4.17.0-rc4-mm1
>>>>>>> SPF is BASE + this series
>>>>>>>
>>>>>>> Kernbench:
>>>>>>> ----------
>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>>>>> kernel (kernel is build 5 times):
>>>>>>>
>>>>>>> Average Half load -j 8
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     1448.65 (5.72312)      1455.84 (4.84951)       0.50%
>>>>>>> User    Time     10135.4 (30.3699)      10148.8 (31.1252)       0.13%
>>>>>>> System  Time     900.47  (2.81131)      923.28  (7.52779)       2.53%
>>>>>>> Percent CPU      761.4   (1.14018)      760.2   (0.447214)      -0.16%
>>>>>>> Context Switches 85380   (3419.52)      84748   (1904.44)       -0.74%
>>>>>>> Sleeps           105064  (1240.96)      105074  (337.612)       0.01%
>>>>>>>
>>>>>>> Average Optimal load -j 16
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     920.528 (10.1212)      927.404 (8.91789)       0.75%
>>>>>>> User    Time     11064.8 (981.142)      11085   (990.897)       0.18%
>>>>>>> System  Time     979.904 (84.0615)      1001.14 (82.5523)       2.17%
>>>>>>> Percent CPU      1089.5  (345.894)      1086.1  (343.545)       -0.31%
>>>>>>> Context Switches 159488  (78156.4)      158223  (77472.1)       -0.79%
>>>>>>> Sleeps           110566  (5877.49)      110388  (5617.75)       -0.16%
>>>>>>>
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          526743764      faults
>>>>>>>                210      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>               2278      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Very few speculative page faults were recorded as most of the processes
>>>>>>> involved are monothreaded (sounds that on this architecture some threads
>>>>>>> were created during the kernel build processing).
>>>>>>>
>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>>>>
>>>>>>> Average Half load -j 40
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     117.152 (0.774642)     117.166 (0.476057)      0.01%
>>>>>>> User    Time     4478.52 (24.7688)      4479.76 (9.08555)       0.03%
>>>>>>> System  Time     131.104 (0.720056)     134.04  (0.708414)      2.24%
>>>>>>> Percent CPU      3934    (19.7104)      3937.2  (19.0184)       0.08%
>>>>>>> Context Switches 92125.4 (576.787)      92581.6 (198.622)       0.50%
>>>>>>> Sleeps           317923  (652.499)      318469  (1255.59)       0.17%
>>>>>>>
>>>>>>> Average Optimal load -j 80
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     107.73  (0.632416)     107.31  (0.584936)      -0.39%
>>>>>>> User    Time     5869.86 (1466.72)      5871.71 (1467.27)       0.03%
>>>>>>> System  Time     153.728 (23.8573)      157.153 (24.3704)       2.23%
>>>>>>> Percent CPU      5418.6  (1565.17)      5436.7  (1580.91)       0.33%
>>>>>>> Context Switches 223861  (138865)       225032  (139632)        0.52%
>>>>>>> Sleeps           330529  (13495.1)      332001  (14746.2)       0.45%
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          116730856      faults
>>>>>>>                  0      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                476      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>>>>> there is no impact on the performance.
>>>>>>>
>>>>>>> Ebizzy:
>>>>>>> -------
>>>>>>> The test is counting the number of records per second it can manage, the
>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>>>>> consistent result I repeated the test 100 times and measure the average
>>>>>>> result. The number is the record processes per second, the higher is the
>>>>>>> best.
>>>>>>>
>>>>>>>                 BASE            SPF             delta
>>>>>>> 16 CPUs x86 VM  742.57          1490.24         100.69%
>>>>>>> 80 CPUs P8 node 13105.4         24174.23        84.46%
>>>>>>>
>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>>>>  Performance counter stats for './ebizzy -mTt 16':
>>>>>>>            1706379      faults
>>>>>>>            1674599      spf
>>>>>>>              30588      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                363      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>>>>  Performance counter stats for './ebizzy -mTt 80':
>>>>>>>            1874773      faults
>>>>>>>            1461153      spf
>>>>>>>             413293      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                200      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>>>>> leading the ebizzy performance boost.
>>>>>>>
>>>>>>> ------------------
>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>>>>  - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>>>>    and Minchan Kim, hopefully.
>>>>>>>  - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>>>>    __do_page_fault().
>>>>>>>  - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>>>>    instead
>>>>>>>    of aborting the speculative page fault handling. Dropping the now
>>>>>>> useless
>>>>>>>    trace event pagefault:spf_pte_lock.
>>>>>>>  - No more try to reuse the fetched VMA during the speculative page fault
>>>>>>>    handling when retrying is needed. This adds a lot of complexity and
>>>>>>>    additional tests done didn't show a significant performance improvement.
>>>>>>>  - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>>>>
>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>>>>
>>>>>>>
>>>>>>> Laurent Dufour (20):
>>>>>>>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>>>>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: make pte_unmap_same compatible with SPF
>>>>>>>   mm: introduce INIT_VMA()
>>>>>>>   mm: protect VMA modifications using VMA sequence count
>>>>>>>   mm: protect mremap() against SPF hanlder
>>>>>>>   mm: protect SPF handler against anon_vma changes
>>>>>>>   mm: cache some VMA fields in the vm_fault structure
>>>>>>>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>>>>   mm: introduce __lru_cache_add_active_or_unevictable
>>>>>>>   mm: introduce __vm_normal_page()
>>>>>>>   mm: introduce __page_add_new_anon_rmap()
>>>>>>>   mm: protect mm_rb tree with a rwlock
>>>>>>>   mm: adding speculative page fault failure trace events
>>>>>>>   perf: add a speculative page fault sw event
>>>>>>>   perf tools: add support for the SPF perf event
>>>>>>>   mm: add speculative page fault vmstats
>>>>>>>   powerpc/mm: add speculative page fault
>>>>>>>
>>>>>>> Mahendran Ganesh (2):
>>>>>>>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   arm64/mm: add speculative page fault
>>>>>>>
>>>>>>> Peter Zijlstra (4):
>>>>>>>   mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: VMA sequence count
>>>>>>>   mm: provide speculative fault infrastructure
>>>>>>>   x86/mm: add speculative pagefault handling
>>>>>>>
>>>>>>>  arch/arm64/Kconfig                    |   1 +
>>>>>>>  arch/arm64/mm/fault.c                 |  12 +
>>>>>>>  arch/powerpc/Kconfig                  |   1 +
>>>>>>>  arch/powerpc/mm/fault.c               |  16 +
>>>>>>>  arch/x86/Kconfig                      |   1 +
>>>>>>>  arch/x86/mm/fault.c                   |  27 +-
>>>>>>>  fs/exec.c                             |   2 +-
>>>>>>>  fs/proc/task_mmu.c                    |   5 +-
>>>>>>>  fs/userfaultfd.c                      |  17 +-
>>>>>>>  include/linux/hugetlb_inline.h        |   2 +-
>>>>>>>  include/linux/migrate.h               |   4 +-
>>>>>>>  include/linux/mm.h                    | 136 +++++++-
>>>>>>>  include/linux/mm_types.h              |   7 +
>>>>>>>  include/linux/pagemap.h               |   4 +-
>>>>>>>  include/linux/rmap.h                  |  12 +-
>>>>>>>  include/linux/swap.h                  |  10 +-
>>>>>>>  include/linux/vm_event_item.h         |   3 +
>>>>>>>  include/trace/events/pagefault.h      |  80 +++++
>>>>>>>  include/uapi/linux/perf_event.h       |   1 +
>>>>>>>  kernel/fork.c                         |   5 +-
>>>>>>>  mm/Kconfig                            |  22 ++
>>>>>>>  mm/huge_memory.c                      |   6 +-
>>>>>>>  mm/hugetlb.c                          |   2 +
>>>>>>>  mm/init-mm.c                          |   3 +
>>>>>>>  mm/internal.h                         |  20 ++
>>>>>>>  mm/khugepaged.c                       |   5 +
>>>>>>>  mm/madvise.c                          |   6 +-
>>>>>>>  mm/memory.c                           | 612 +++++++++++++++++++++++++++++-----
>>>>>>>  mm/mempolicy.c                        |  51 ++-
>>>>>>>  mm/migrate.c                          |   6 +-
>>>>>>>  mm/mlock.c                            |  13 +-
>>>>>>>  mm/mmap.c                             | 229 ++++++++++---
>>>>>>>  mm/mprotect.c                         |   4 +-
>>>>>>>  mm/mremap.c                           |  13 +
>>>>>>>  mm/nommu.c                            |   2 +-
>>>>>>>  mm/rmap.c                             |   5 +-
>>>>>>>  mm/swap.c                             |   6 +-
>>>>>>>  mm/swap_state.c                       |   8 +-
>>>>>>>  mm/vmstat.c                           |   5 +-
>>>>>>>  tools/include/uapi/linux/perf_event.h |   1 +
>>>>>>>  tools/perf/util/evsel.c               |   1 +
>>>>>>>  tools/perf/util/parse-events.c        |   4 +
>>>>>>>  tools/perf/util/parse-events.l        |   1 +
>>>>>>>  tools/perf/util/python.c              |   1 +
>>>>>>>  44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>>>>  create mode 100644 include/trace/events/pagefault.h
>>>>>>>
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>


[-- Attachment #2: perf-profile_page_fault2_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12167 bytes --]

[-- Attachment #3: perf-profile_page_fault2_base_thp_never.gz --]
[-- Type: application/gzip, Size: 11543 bytes --]

[-- Attachment #4: perf-profile_page_fault2_head_thp_always.gz --]
[-- Type: application/gzip, Size: 12019 bytes --]

[-- Attachment #5: perf-profile_page_fault3_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12701 bytes --]

[-- Attachment #6: perf-profile_page_fault3_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12701 bytes --]

[-- Attachment #7: perf-profile_page_fault3_base_thp_never.gz --]
[-- Type: application/gzip, Size: 12699 bytes --]

^ permalink raw reply

* RE: [PATCH v11 00/26] Speculative page faults
From: Song, HaiyanX @ 2018-08-03  6:45 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm@linux-foundation.org, mhocko@kernel.org,
	peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com,
	dave@stgolabs.net, jack@suse.cz, Matthew Wilcox,
	khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
	benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
	Thomas Gleixner, Ingo Molnar, hpa@zytor.com, Will Deacon,
	Sergey Senozhatsky, sergey.senozhatsky.work@gmail.com,
	Andrea Arcangeli, Alexei Starovoitov, Wang, Kemi, Daniel Jordan,
	David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
	Punit Agrawal, vinayak menon, Yang Shi,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	haren@linux.vnet.ibm.com, npiggin@gmail.com,
	bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen,
	linuxppc-dev@lists.ozlabs.org, x86@kernel.org
In-Reply-To: <9FE19350E8A7EE45B64D8D63D368C8966B876B4B@SHSMSX101.ccr.corp.intel.com>

[-- Attachment #1: Type: text/plain, Size: 43157 bytes --]

Add another 3 perf file.
________________________________________
From: Song, HaiyanX
Sent: Friday, August 03, 2018 2:36 PM
To: Laurent Dufour
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: RE: [PATCH v11 00/26] Speculative page faults

Hi Laurent,

Thanks for your analysis for the last perf results.
Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported
on the head side", which is a bug of 0-day,and it caused the item is not counted in perf.

I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times).
I checked the perf report have no above mentioned problem.

I have compared them, found some items have difference, such as below case:
       page_fault2-thp-always: handle_mm_fault, base: 45.22%    head: 29.41%
       page_fault3-thp-always: handle_mm_fault, base: 22.95%    head: 14.15%

So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit.


Thanks,
Haiyan, Song

________________________________________
From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Song, HaiyanX
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults

On 13/07/2018 05:56, Song, HaiyanX wrote:
> Hi Laurent,

Hi Haiyan,

Thanks a lot for sharing this perf reports.

I looked at them closely, and I've to admit that I was not able to found a
major difference between the base and the head report, except that
handle_pte_fault() is no more in-lined in the head one.

As expected, __handle_speculative_fault() is never traced since these tests are
dealing with file mapping, not handled in the speculative way.

When running these test did you seen a major differences in the test's result
between base and head ?

>From the number of cycles counted, the biggest difference is page_fault3 when
run with the THP enabled:
                                BASE            HEAD            Delta
page_fault2_base_thp_never      1142252426747   1065866197589   -6.69%
page_fault2_base_THP-Alwasys    1124844374523   1076312228927   -4.31%
page_fault3_base_thp_never      1099387298152   1134118402345   3.16%
page_fault3_base_THP-Always     1059370178101   853985561949    -19.39%


The very weird thing is the difference of the delta cycles reported between
thp never and thp always, because the speculative way is aborted when checking
for the vma->ops field, which is the same in both case, and the thp is never
checked. So there is no code covering differnce, on the speculative path,
between these 2 cases. This leads me to think that there are other interactions
interfering in the measure.

Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at
the head of the perf report is the 92% testcase which is weirdly not reported
on the head side :
    92.02%    22.33%  page_fault3_processes  [.] testcase
92.02% testcase

Then the base reported 37.67% for __do_page_fault() where the head reported
48.41%, but the only difference in this function, between base and head, is the
call to handle_speculative_fault(). But this is a macro checking for the fault
flags, and mm->users and then calling __handle_speculative_fault() if needed.
So this can't explain this difference, except if __handle_speculative_fault()
is inlined in __do_page_fault().
Is this the case on your build ?

Haiyan, do you still have the output of the test to check those numbers too ?

Cheers,
Laurent

> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case.
> Please help to check on these data if it can help you to find the higher change. Thanks.
>
> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2
>     tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration.
>
> Best regards,
> Haiyan Song
>
> ________________________________________
> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
> Sent: Thursday, July 12, 2018 1:05 AM
> To: Song, HaiyanX
> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
>
> Hi Haiyan,
>
> Do you get a chance to capture some performance cycles on your system ?
> I still can't get these numbers on my hardware.
>
> Thanks,
> Laurent.
>
> On 04/07/2018 09:51, Laurent Dufour wrote:
>> On 04/07/2018 05:23, Song, HaiyanX wrote:
>>> Hi Laurent,
>>>
>>>
>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times.
>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev.
>>
>> Repeating the test only 3 times seems a bit too low to me.
>>
>> I'll focus on the higher change for the moment, but I don't have access to such
>> a hardware.
>>
>> Is possible to provide a diff between base and SPF of the performance cycles
>> measured when running page_fault3 and page_fault2 when the 20% change is detected.
>>
>> Please stay focus on the test case process to see exactly where the series is
>> impacting.
>>
>> Thanks,
>> Laurent.
>>
>>>
>>> And I did not find other high variation on test case result.
>>>
>>> a). Enable THP
>>> testcase                          base     stddev       change      head     stddev         metric
>>> page_fault3/enable THP           10519      ± 3%        -20.5%      8368      ±6%          will-it-scale.per_thread_ops
>>> page_fault2/enalbe THP            8281      ± 2%        -18.8%      6728                   will-it-scale.per_thread_ops
>>> brk1/eanble THP                 998475                   -2.2%    976893                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      223910                   -1.3%    220930                   will-it-scale.per_process_ops
>>> context_switch1/enable THP      233722                   -1.0%    231288                   will-it-scale.per_thread_ops
>>>
>>> b). Disable THP
>>> page_fault3/disable THP          10856                  -23.1%      8344                   will-it-scale.per_thread_ops
>>> page_fault2/disable THP           8147                  -18.8%      6613                   will-it-scale.per_thread_ops
>>> brk1/disable THP                   957                    -7.9%      881                   will-it-scale.per_thread_ops
>>> context_switch1/disable THP     237006                    -2.2%    231907                  will-it-scale.per_thread_ops
>>> brk1/disable THP                997317                    -2.0%    977778                  will-it-scale.per_process_ops
>>> page_fault3/disable THP         467454                    -1.8%    459251                  will-it-scale.per_process_ops
>>> context_switch1/disable THP     224431                    -1.3%    221567                  will-it-scale.per_process_ops
>>>
>>>
>>> Best regards,
>>> Haiyan Song
>>> ________________________________________
>>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>> Sent: Monday, July 02, 2018 4:59 PM
>>> To: Song, HaiyanX
>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>
>>> On 11/06/2018 09:49, Song, HaiyanX wrote:
>>>> Hi Laurent,
>>>>
>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
>>>> V9 patch serials.
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
>>>> commit id:
>>>>   head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>>>>   base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
>>>> Benchmark: will-it-scale
>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
>>>>
>>>> Metrics:
>>>>   will-it-scale.per_process_ops=processes/nr_cpu
>>>>   will-it-scale.per_thread_ops=threads/nr_cpu
>>>>   test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task:100%
>>>>
>>>> 1. Regressions:
>>>>
>>>> a). Enable THP
>>>> testcase                          base           change      head           metric
>>>> page_fault3/enable THP           10519          -20.5%        836      will-it-scale.per_thread_ops
>>>> page_fault2/enalbe THP            8281          -18.8%       6728      will-it-scale.per_thread_ops
>>>> brk1/eanble THP                 998475           -2.2%     976893      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      223910           -1.3%     220930      will-it-scale.per_process_ops
>>>> context_switch1/enable THP      233722           -1.0%     231288      will-it-scale.per_thread_ops
>>>>
>>>> b). Disable THP
>>>> page_fault3/disable THP          10856          -23.1%       8344      will-it-scale.per_thread_ops
>>>> page_fault2/disable THP           8147          -18.8%       6613      will-it-scale.per_thread_ops
>>>> brk1/disable THP                   957           -7.9%        881      will-it-scale.per_thread_ops
>>>> context_switch1/disable THP     237006           -2.2%     231907      will-it-scale.per_thread_ops
>>>> brk1/disable THP                997317           -2.0%     977778      will-it-scale.per_process_ops
>>>> page_fault3/disable THP         467454           -1.8%     459251      will-it-scale.per_process_ops
>>>> context_switch1/disable THP     224431           -1.3%     221567      will-it-scale.per_process_ops
>>>>
>>>> Notes: for the above  values of test result, the higher is better.
>>>
>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
>>> get reproducible results. The results have huge variation, even on the vanilla
>>> kernel, and I can't state on any changes due to that.
>>>
>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
>>> measure any changes between the vanilla and the SPF patched ones:
>>>
>>> test THP enabled                4.17.0-rc4-mm1  spf             delta
>>> page_fault3_threads             2697.7          2683.5          -0.53%
>>> page_fault2_threads             170660.6        169574.1        -0.64%
>>> context_switch1_threads         6915269.2       6877507.3       -0.55%
>>> context_switch1_processes       6478076.2       6529493.5       0.79%
>>> brk1                            243391.2        238527.5        -2.00%
>>>
>>> Tests were run 10 times, no high variation detected.
>>>
>>> Did you see high variation on your side ? How many times the test were run to
>>> compute the average values ?
>>>
>>> Thanks,
>>> Laurent.
>>>
>>>
>>>>
>>>> 2. Improvement: not found improvement based on the selected test cases.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>> ________________________________________
>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>> Sent: Monday, May 28, 2018 4:54 PM
>>>> To: Song, HaiyanX
>>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> On 28/05/2018 10:22, Haiyan Song wrote:
>>>>> Hi Laurent,
>>>>>
>>>>> Yes, these tests are done on V9 patch.
>>>>
>>>> Do you plan to give this V11 a run ?
>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Haiyan Song
>>>>>
>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>>>>
>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>>>>> tested on Intel 4s Skylake platform.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>>>>> series" while responding to the v11 header series...
>>>>>> Were these tests done on v9 or v11 ?
>>>>>>
>>>>>> Cheers,
>>>>>> Laurent.
>>>>>>
>>>>>>>
>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>>>>> Commit id:
>>>>>>>     base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>>>>     head commit: 0355322b3577eeab7669066df42c550a56801110
>>>>>>> Benchmark suite: will-it-scale
>>>>>>> Download link:
>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>>>>> Metrics:
>>>>>>>     will-it-scale.per_process_ops=processes/nr_cpu
>>>>>>>     will-it-scale.per_thread_ops=threads/nr_cpu
>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>>>>> THP: enable / disable
>>>>>>> nr_task: 100%
>>>>>>>
>>>>>>> 1. Regressions:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ enable THP         10092           -17.5%          8323       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ enable THP          8300           -17.2%          6869       will-it-scale.per_thread_ops
>>>>>>> brk1/ enable THP                  957.67         -7.6%           885       will-it-scale.per_thread_ops
>>>>>>> page_fault3/ enable THP        172821            -5.3%        163692       will-it-scale.per_process_ops
>>>>>>> signal1/ enable THP              9125            -3.2%          8834       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> page_fault3/ disable THP        10107           -19.1%          8180       will-it-scale.per_thread_ops
>>>>>>> page_fault2/ disable THP         8432           -17.8%          6931       will-it-scale.per_thread_ops
>>>>>>> context_switch1/ disable THP   215389            -6.8%        200776       will-it-scale.per_thread_ops
>>>>>>> brk1/ disable THP                 939.67         -6.6%           877.33    will-it-scale.per_thread_ops
>>>>>>> page_fault3/ disable THP       173145            -4.7%        165064       will-it-scale.per_process_ops
>>>>>>> signal1/ disable THP             9162            -3.9%          8802       will-it-scale.per_process_ops
>>>>>>>
>>>>>>> 2. Improvements:
>>>>>>> a) THP enabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ enable THP               66.33        +469.8%           383.67    will-it-scale.per_thread_ops
>>>>>>> writeseek3/ enable THP          2531             +4.5%          2646       will-it-scale.per_thread_ops
>>>>>>> signal1/ enable THP              989.33          +2.8%          1016       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase                        base            change          head       metric
>>>>>>> malloc1/ disable THP              90.33        +417.3%           467.33    will-it-scale.per_thread_ops
>>>>>>> read2/ disable THP             58934            +39.2%         82060       will-it-scale.per_thread_ops
>>>>>>> page_fault1/ disable THP        8607            +36.4%         11736       will-it-scale.per_thread_ops
>>>>>>> read1/ disable THP            314063            +12.7%        353934       will-it-scale.per_thread_ops
>>>>>>> writeseek3/ disable THP         2452            +12.5%          2759       will-it-scale.per_thread_ops
>>>>>>> signal1/ disable THP             971.33          +5.5%          1024       will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>>>>> on head commit is better than that on base commit for this benchmark.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Haiyan Song
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>>>>
>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>>>>> page fault without holding the mm semaphore [1].
>>>>>>>
>>>>>>> The idea is to try to handle user space page faults without holding the
>>>>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>>>>> process since the page fault handler will not wait for other threads memory
>>>>>>> layout change to be done, assuming that this change is done in another part
>>>>>>> of the process's memory space. This type page fault is named speculative
>>>>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>>>>> is failing its processing and a classic page fault is then tried.
>>>>>>>
>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>>>>> freeing operation which was hitting the performance by 20% as reported by
>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>>>>> limiting the locking contention to these operations which are expected to
>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>>>>> our back a reference count is added and 2 services (get_vma() and
>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>>>>> benchmark anymore.
>>>>>>>
>>>>>>> The VMA's attributes checked during the speculative page fault processing
>>>>>>> have to be protected against parallel changes. This is done by using a per
>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>>>>> handler to fast check for parallel changes in progress and to abort the
>>>>>>> speculative page fault in that case.
>>>>>>>
>>>>>>> Once the VMA has been found, the speculative page fault handler would check
>>>>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>>>>> is tried.  VMA sequence lockings are added when VMA attributes which are
>>>>>>> checked during the page fault are modified.
>>>>>>>
>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>>>>> leading to touching this PTE will need to lock the page table, so no
>>>>>>> parallel change is possible at this time.
>>>>>>>
>>>>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>>>>> different from the one recorded at the beginning of the SPF operation, the
>>>>>>> classic page fault handler will be called to handle the operation while
>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>>>>> PTE.
>>>>>>>
>>>>>>> In pseudo code, this could be seen as:
>>>>>>>     speculative_page_fault()
>>>>>>>     {
>>>>>>>             vma = get_vma()
>>>>>>>             check vma sequence count
>>>>>>>             check vma's support
>>>>>>>             disable interrupt
>>>>>>>                   check pgd,p4d,...,pte
>>>>>>>                   save pmd and pte in vmf
>>>>>>>                   save vma sequence counter in vmf
>>>>>>>             enable interrupt
>>>>>>>             check vma sequence count
>>>>>>>             handle_pte_fault(vma)
>>>>>>>                     ..
>>>>>>>                     page = alloc_page()
>>>>>>>                     pte_map_lock()
>>>>>>>                             disable interrupt
>>>>>>>                                     abort if sequence counter has changed
>>>>>>>                                     abort if pmd or pte has changed
>>>>>>>                                     pte map and lock
>>>>>>>                             enable interrupt
>>>>>>>                     if abort
>>>>>>>                        free page
>>>>>>>                        abort
>>>>>>>                     ...
>>>>>>>     }
>>>>>>>
>>>>>>>     arch_fault_handler()
>>>>>>>     {
>>>>>>>             if (speculative_page_fault(&vma))
>>>>>>>                goto done
>>>>>>>     again:
>>>>>>>             lock(mmap_sem)
>>>>>>>             vma = find_vma();
>>>>>>>             handle_pte_fault(vma);
>>>>>>>             if retry
>>>>>>>                unlock(mmap_sem)
>>>>>>>                goto again;
>>>>>>>     done:
>>>>>>>             handle fault error
>>>>>>>     }
>>>>>>>
>>>>>>> Support for THP is not done because when checking for the PMD, we can be
>>>>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>>>>
>>>>>>> This series add a new software performance event named 'speculative-faults'
>>>>>>> or 'spf'. It counts the number of successful page fault event handled
>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>>>>> counting the total number of page fault events while 'spf' is only counting
>>>>>>> the part of the faults processed speculatively.
>>>>>>>
>>>>>>> There are some trace events introduced by this series. They allow
>>>>>>> identifying why the page faults were not processed speculatively. This
>>>>>>> doesn't take in account the faults generated by a monothreaded process
>>>>>>> which directly processed while holding the mmap_sem. This trace events are
>>>>>>> grouped in a system named 'pagefault', they are:
>>>>>>>  - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>>>>  - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>>>>  - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>>>>  - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>>>>  - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>>>>    back.
>>>>>>>
>>>>>>> To record all the related events, the easier is to run perf with the
>>>>>>> following arguments :
>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>>>>
>>>>>>> There is also a dedicated vmstat counter showing the number of successful
>>>>>>> page fault handled speculatively. I can be seen this way:
>>>>>>> $ grep speculative_pgfault /proc/vmstat
>>>>>>>
>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>>>>> on x86, PowerPC and arm64.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Real Workload results
>>>>>>>
>>>>>>> As mentioned in previous email, we did non official runs using a "popular
>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>>>>> which showed a 30% improvements in the number of transaction processed per
>>>>>>> second. This run has been done on the v6 series, but changes introduced in
>>>>>>> this new version should not impact the performance boost seen.
>>>>>>>
>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>>>>> series:
>>>>>>>                 vanilla         spf
>>>>>>> faults          89.418          101.364         +13%
>>>>>>> spf                n/a           97.989
>>>>>>>
>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>>>>> way.
>>>>>>>
>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>>>>> it a try on an android device. He reported that the application launch time
>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>>>>> 20%.
>>>>>>>
>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>>>>
>>>>>>> Application                             4.9     4.9+spf delta
>>>>>>> com.tencent.mm                          416     389     -7%
>>>>>>> com.eg.android.AlipayGphone             1135    986     -13%
>>>>>>> com.tencent.mtt                         455     454     0%
>>>>>>> com.qqgame.hlddz                        1497    1409    -6%
>>>>>>> com.autonavi.minimap                    711     701     -1%
>>>>>>> com.tencent.tmgp.sgame                  788     748     -5%
>>>>>>> com.immomo.momo                         501     487     -3%
>>>>>>> com.tencent.peng                        2145    2112    -2%
>>>>>>> com.smile.gifmaker                      491     461     -6%
>>>>>>> com.baidu.BaiduMap                      479     366     -23%
>>>>>>> com.taobao.taobao                       1341    1198    -11%
>>>>>>> com.baidu.searchbox                     333     314     -6%
>>>>>>> com.tencent.mobileqq                    394     384     -3%
>>>>>>> com.sina.weibo                          907     906     0%
>>>>>>> com.youku.phone                         816     731     -11%
>>>>>>> com.happyelements.AndroidAnimal.qq      763     717     -6%
>>>>>>> com.UCMobile                            415     411     -1%
>>>>>>> com.tencent.tmgp.ak                     1464    1431    -2%
>>>>>>> com.tencent.qqmusic                     336     329     -2%
>>>>>>> com.sankuai.meituan                     1661    1302    -22%
>>>>>>> com.netease.cloudmusic                  1193    1200    1%
>>>>>>> air.tv.douyu.android                    4257    4152    -2%
>>>>>>>
>>>>>>> ------------------
>>>>>>> Benchmarks results
>>>>>>>
>>>>>>> Base kernel is v4.17.0-rc4-mm1
>>>>>>> SPF is BASE + this series
>>>>>>>
>>>>>>> Kernbench:
>>>>>>> ----------
>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>>>>> kernel (kernel is build 5 times):
>>>>>>>
>>>>>>> Average Half load -j 8
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     1448.65 (5.72312)      1455.84 (4.84951)       0.50%
>>>>>>> User    Time     10135.4 (30.3699)      10148.8 (31.1252)       0.13%
>>>>>>> System  Time     900.47  (2.81131)      923.28  (7.52779)       2.53%
>>>>>>> Percent CPU      761.4   (1.14018)      760.2   (0.447214)      -0.16%
>>>>>>> Context Switches 85380   (3419.52)      84748   (1904.44)       -0.74%
>>>>>>> Sleeps           105064  (1240.96)      105074  (337.612)       0.01%
>>>>>>>
>>>>>>> Average Optimal load -j 16
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     920.528 (10.1212)      927.404 (8.91789)       0.75%
>>>>>>> User    Time     11064.8 (981.142)      11085   (990.897)       0.18%
>>>>>>> System  Time     979.904 (84.0615)      1001.14 (82.5523)       2.17%
>>>>>>> Percent CPU      1089.5  (345.894)      1086.1  (343.545)       -0.31%
>>>>>>> Context Switches 159488  (78156.4)      158223  (77472.1)       -0.79%
>>>>>>> Sleeps           110566  (5877.49)      110388  (5617.75)       -0.16%
>>>>>>>
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          526743764      faults
>>>>>>>                210      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>               2278      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Very few speculative page faults were recorded as most of the processes
>>>>>>> involved are monothreaded (sounds that on this architecture some threads
>>>>>>> were created during the kernel build processing).
>>>>>>>
>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>>>>
>>>>>>> Average Half load -j 40
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     117.152 (0.774642)     117.166 (0.476057)      0.01%
>>>>>>> User    Time     4478.52 (24.7688)      4479.76 (9.08555)       0.03%
>>>>>>> System  Time     131.104 (0.720056)     134.04  (0.708414)      2.24%
>>>>>>> Percent CPU      3934    (19.7104)      3937.2  (19.0184)       0.08%
>>>>>>> Context Switches 92125.4 (576.787)      92581.6 (198.622)       0.50%
>>>>>>> Sleeps           317923  (652.499)      318469  (1255.59)       0.17%
>>>>>>>
>>>>>>> Average Optimal load -j 80
>>>>>>>                  Run    (std deviation)
>>>>>>>                  BASE                   SPF
>>>>>>> Elapsed Time     107.73  (0.632416)     107.31  (0.584936)      -0.39%
>>>>>>> User    Time     5869.86 (1466.72)      5871.71 (1467.27)       0.03%
>>>>>>> System  Time     153.728 (23.8573)      157.153 (24.3704)       2.23%
>>>>>>> Percent CPU      5418.6  (1565.17)      5436.7  (1580.91)       0.33%
>>>>>>> Context Switches 223861  (138865)       225032  (139632)        0.52%
>>>>>>> Sleeps           330529  (13495.1)      332001  (14746.2)       0.45%
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>>  Performance counter stats for '../kernbench -M':
>>>>>>>          116730856      faults
>>>>>>>                  0      spf
>>>>>>>                  3      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                476      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>>>>> there is no impact on the performance.
>>>>>>>
>>>>>>> Ebizzy:
>>>>>>> -------
>>>>>>> The test is counting the number of records per second it can manage, the
>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>>>>> consistent result I repeated the test 100 times and measure the average
>>>>>>> result. The number is the record processes per second, the higher is the
>>>>>>> best.
>>>>>>>
>>>>>>>                 BASE            SPF             delta
>>>>>>> 16 CPUs x86 VM  742.57          1490.24         100.69%
>>>>>>> 80 CPUs P8 node 13105.4         24174.23        84.46%
>>>>>>>
>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>>>>  Performance counter stats for './ebizzy -mTt 16':
>>>>>>>            1706379      faults
>>>>>>>            1674599      spf
>>>>>>>              30588      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                363      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>>>>  Performance counter stats for './ebizzy -mTt 80':
>>>>>>>            1874773      faults
>>>>>>>            1461153      spf
>>>>>>>             413293      pagefault:spf_vma_changed
>>>>>>>                  0      pagefault:spf_vma_noanon
>>>>>>>                200      pagefault:spf_vma_notsup
>>>>>>>                  0      pagefault:spf_vma_access
>>>>>>>                  0      pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>>>>> leading the ebizzy performance boost.
>>>>>>>
>>>>>>> ------------------
>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>>>>  - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>>>>    and Minchan Kim, hopefully.
>>>>>>>  - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>>>>    __do_page_fault().
>>>>>>>  - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>>>>    instead
>>>>>>>    of aborting the speculative page fault handling. Dropping the now
>>>>>>> useless
>>>>>>>    trace event pagefault:spf_pte_lock.
>>>>>>>  - No more try to reuse the fetched VMA during the speculative page fault
>>>>>>>    handling when retrying is needed. This adds a lot of complexity and
>>>>>>>    additional tests done didn't show a significant performance improvement.
>>>>>>>  - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>>>>
>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>>>>
>>>>>>>
>>>>>>> Laurent Dufour (20):
>>>>>>>   mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>>>>   x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: make pte_unmap_same compatible with SPF
>>>>>>>   mm: introduce INIT_VMA()
>>>>>>>   mm: protect VMA modifications using VMA sequence count
>>>>>>>   mm: protect mremap() against SPF hanlder
>>>>>>>   mm: protect SPF handler against anon_vma changes
>>>>>>>   mm: cache some VMA fields in the vm_fault structure
>>>>>>>   mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>>>>   mm: introduce __lru_cache_add_active_or_unevictable
>>>>>>>   mm: introduce __vm_normal_page()
>>>>>>>   mm: introduce __page_add_new_anon_rmap()
>>>>>>>   mm: protect mm_rb tree with a rwlock
>>>>>>>   mm: adding speculative page fault failure trace events
>>>>>>>   perf: add a speculative page fault sw event
>>>>>>>   perf tools: add support for the SPF perf event
>>>>>>>   mm: add speculative page fault vmstats
>>>>>>>   powerpc/mm: add speculative page fault
>>>>>>>
>>>>>>> Mahendran Ganesh (2):
>>>>>>>   arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>>   arm64/mm: add speculative page fault
>>>>>>>
>>>>>>> Peter Zijlstra (4):
>>>>>>>   mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>>>>   mm: VMA sequence count
>>>>>>>   mm: provide speculative fault infrastructure
>>>>>>>   x86/mm: add speculative pagefault handling
>>>>>>>
>>>>>>>  arch/arm64/Kconfig                    |   1 +
>>>>>>>  arch/arm64/mm/fault.c                 |  12 +
>>>>>>>  arch/powerpc/Kconfig                  |   1 +
>>>>>>>  arch/powerpc/mm/fault.c               |  16 +
>>>>>>>  arch/x86/Kconfig                      |   1 +
>>>>>>>  arch/x86/mm/fault.c                   |  27 +-
>>>>>>>  fs/exec.c                             |   2 +-
>>>>>>>  fs/proc/task_mmu.c                    |   5 +-
>>>>>>>  fs/userfaultfd.c                      |  17 +-
>>>>>>>  include/linux/hugetlb_inline.h        |   2 +-
>>>>>>>  include/linux/migrate.h               |   4 +-
>>>>>>>  include/linux/mm.h                    | 136 +++++++-
>>>>>>>  include/linux/mm_types.h              |   7 +
>>>>>>>  include/linux/pagemap.h               |   4 +-
>>>>>>>  include/linux/rmap.h                  |  12 +-
>>>>>>>  include/linux/swap.h                  |  10 +-
>>>>>>>  include/linux/vm_event_item.h         |   3 +
>>>>>>>  include/trace/events/pagefault.h      |  80 +++++
>>>>>>>  include/uapi/linux/perf_event.h       |   1 +
>>>>>>>  kernel/fork.c                         |   5 +-
>>>>>>>  mm/Kconfig                            |  22 ++
>>>>>>>  mm/huge_memory.c                      |   6 +-
>>>>>>>  mm/hugetlb.c                          |   2 +
>>>>>>>  mm/init-mm.c                          |   3 +
>>>>>>>  mm/internal.h                         |  20 ++
>>>>>>>  mm/khugepaged.c                       |   5 +
>>>>>>>  mm/madvise.c                          |   6 +-
>>>>>>>  mm/memory.c                           | 612 +++++++++++++++++++++++++++++-----
>>>>>>>  mm/mempolicy.c                        |  51 ++-
>>>>>>>  mm/migrate.c                          |   6 +-
>>>>>>>  mm/mlock.c                            |  13 +-
>>>>>>>  mm/mmap.c                             | 229 ++++++++++---
>>>>>>>  mm/mprotect.c                         |   4 +-
>>>>>>>  mm/mremap.c                           |  13 +
>>>>>>>  mm/nommu.c                            |   2 +-
>>>>>>>  mm/rmap.c                             |   5 +-
>>>>>>>  mm/swap.c                             |   6 +-
>>>>>>>  mm/swap_state.c                       |   8 +-
>>>>>>>  mm/vmstat.c                           |   5 +-
>>>>>>>  tools/include/uapi/linux/perf_event.h |   1 +
>>>>>>>  tools/perf/util/evsel.c               |   1 +
>>>>>>>  tools/perf/util/parse-events.c        |   4 +
>>>>>>>  tools/perf/util/parse-events.l        |   1 +
>>>>>>>  tools/perf/util/python.c              |   1 +
>>>>>>>  44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>>>>  create mode 100644 include/trace/events/pagefault.h
>>>>>>>
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>


[-- Attachment #2: perf-profile_page_fault3_head_thp_always.gz --]
[-- Type: application/gzip, Size: 12909 bytes --]

[-- Attachment #3: perf-profile_page_fault3_head_thp_never.gz --]
[-- Type: application/gzip, Size: 12535 bytes --]

[-- Attachment #4: perf-profile_page_fault2_head_thp_never.gz --]
[-- Type: application/gzip, Size: 11782 bytes --]

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-03  7:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Benjamin Herrenschmidt, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180802235233-mutt-send-email-mst@kernel.org>

On Thu, Aug 02, 2018 at 11:53:08PM +0300, Michael S. Tsirkin wrote:
> > We don't need cache flushing tricks.
> 
> You don't but do real devices on same platform need them?

IBMs power plaforms are always cache coherent.  There are some powerpc
platforms have not cache coherent DMA, but I guess this scheme isn't
intended for them.

^ permalink raw reply

* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-03  7:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, Christoph Hellwig, Will Deacon,
	Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
	paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <de4888b6457e220776e16a9c8958ff0886ffc66c.camel@kernel.crashing.org>

On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.
> 
> At the moment, virtio does this all over the place:
> 
> 	if (use_dma_api)
> 		dma_map/alloc_something(...)
> 	else
> 		use_pa
> 
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
> 
>  1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
> 
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?

If you call it often enough it does:

https://www.spinics.net/lists/netdev/msg495413.html

>  2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.

And total NAK the customer platform-provided part of this.  We need
a flag passed in from the hypervisor that the device needs all bus
specific dma api treatment, and then just use the normal plaform
dma mapping setup.  To get swiotlb you'll need to then use the DT/ACPI
dma-range property to limit the addressable range, and a swiotlb
capable plaform will use swiotlb automatically.

^ permalink raw reply

* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Michael Ellerman @ 2018-08-03  8:44 UTC (permalink / raw)
  To: Christophe LEROY, Murilo Opsfelder Araujo
  Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
	Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
	Eric W . Biederman, Joe Perches, Michael Neuling, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev, Segher Boessenkool
In-Reply-To: <69cf990b-d4aa-97e7-be3b-7936caa91688@c-s.fr>

Christophe LEROY <christophe.leroy@c-s.fr> writes:
> Le 03/08/2018 =C3=A0 02:42, Murilo Opsfelder Araujo a =C3=A9crit=C2=A0:
>> Hi, Christophe.
>> On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>>> Le 01/08/2018 =C3=A0 23:33, Murilo Opsfelder Araujo a =C3=A9crit=C2=A0:
>>>> show_user_instructions() is a slightly modified version of
>>>> show_instructions() that allows userspace instruction dump.
>>>>
>>>> This will be useful within show_signal_msg() to dump userspace
>>>> instructions of the faulty location.
>>>>
>>>> Here is a sample of what show_user_instructions() outputs:
>>>>
>>>>     pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1f=
ff8 f821ffc1 7c3f0b78 3d22fffe
>>>>     pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <9949=
0000> 39200000 7d234b78 383f0040
>>>>
>>>> The current->comm and current->pid printed can serve as a glue that
>>>> links the instructions dump to its originator, allowing messages to be
>>>> interleaved in the logs.
>>>>
>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/proce=
ss.c
>>>> index e9533b4d2f08..364645ac732c 100644
>>>> --- a/arch/powerpc/kernel/process.c
>>>> +++ b/arch/powerpc/kernel/process.c
>>>> @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *r=
egs)
>>>>    	pr_cont("\n");
>>>>    }
>>>> +void show_user_instructions(struct pt_regs *regs)
>>>> +{
>>>> +	int i;
>>>> +	const char *prefix =3D KERN_INFO "%s[%d]: code: ";
>>>> +	unsigned long pc =3D regs->nip - (instructions_to_print * 3 / 4 *
>>>> +					sizeof(int));
>>>> +
>>>> +	printk(prefix, current->comm, current->pid);
>>>
>>> Why not use pr_info() and remove KERN_INFO from *prefix ?
>>=20
>> Because it doesn't compile:
>>=20
>>    arch/powerpc/kernel/process.c:1317:10: error: expected =E2=80=98)=E2=
=80=99 before =E2=80=98prefix=E2=80=99
>>      pr_info(prefix, current->comm, current->pid);
>>              ^
>>    ./include/linux/printk.h:288:21: note: in definition of macro =E2=80=
=98pr_fmt=E2=80=99
>>     #define pr_fmt(fmt) fmt
>>                       ^
>>=20
>> `pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
>> which is an invalid string concatenation.
>>=20
>> `pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
>> valid.
>
> Then what about using directly:
>
> pr_info("%s[%d]: code: ", ...);

Yeah that's better, I'll fix it up when applying.

>>>> +#if !defined(CONFIG_BOOKE)
>>>> +		/* If executing with the IMMU off, adjust pc rather
>>>> +		 * than print XXXXXXXX.
>>>> +		 */
>>>> +		if (!(regs->msr & MSR_IR))
>>>> +			pc =3D (unsigned long)phys_to_virt(pc);
>>>
>>> Shouldn't this be done outside of the loop, only once ?
>>=20
>> I don't think so.
>>=20
>> pc gets incremented at the bottom of the loop:
>>=20
>>    pc +=3D sizeof(int);
>>=20
>> Adjusting pc is necessary at each iteration.  Leaving this block inside
>> the loop seems correct.
>
> This looks pretty strange.
> The first time, pc is a physical address, that you change to a virtual=20
> address. Then when you increment it it is still a virtual address.
> So when you call phys_to_virt(pc) for the second time, pc is already a=20
> virt address, so what happens indeed ?

Yeah that's a bit fishy.

On 64-bit it works because phys_to_virt() =3D=3D __va() which is:

  #define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) | PAGE_OFFSET))

ie. it uses bitwise or, so __va(__va(x)) =3D=3D __va(x).

But it looks like on 32-bit it's going to do the wrong thing. Do we ever
actually hit that case though, I'm not sure?


However for this patch I'll just remove the whole thing, because we
don't expect to be dumping user instructions in realmode.

cheers

^ permalink raw reply

* Re: [PATCH v5 09/11] hugetlb: Introduce generic version of huge_ptep_set_wrprotect
From: Michael Ellerman @ 2018-08-03  8:51 UTC (permalink / raw)
  To: Alex Ghiti, linux-mm, mike.kravetz, linux, catalin.marinas,
	will.deacon, tony.luck, fenghua.yu, ralf, paul.burton, jhogan,
	jejb, deller, benh, ysato, dalias, davem, tglx, mingo, hpa, x86,
	arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
	linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch,
	aneesh.kumar@linux.ibm.com
In-Reply-To: <90bf556f-144d-24b8-d2f6-70fee4a30559@ghiti.fr>

Hi Alex,

Sorry missed your previous mail.

Alex Ghiti <alex@ghiti.fr> writes:
> Ok, I tried every defconfig available:
>
> - for the nohash/32, I found that I could use mpc885_ads_defconfig and I 
> activated HUGETLBFS.
> I removed the definition of huge_ptep_set_wrprotect from 
> nohash/32/pgtable.h, add an #error in
> include/asm-generic/hugetlb.h right before the generic definition of 
> huge_ptep_set_wrprotect,
> and fell onto it at compile-time:
> => I'm pretty confident then that removing the definition of 
> huge_ptep_set_wrprotect does not
> break anythingin this case.

Thanks, that sounds good.

> - regardind book3s/32, I did not find any defconfig with 
> CONFIG_PPC_BOOK3S_32, CONFIG_PPC32
> allowing to enable huge page support (ie CONFIG_SYS_SUPPORTS_HUGETLBFS)
> => Do you have a defconfig that would allow me to try the same as above ?

I think you're right, it's dead code AFAICS.

We have:

config PPC_BOOK3S_64
        ...
	select SYS_SUPPORTS_HUGETLBFS

config PPC_FSL_BOOK3E
        ...
	select SYS_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64

config PPC_8xx
	...
	select SYS_SUPPORTS_HUGETLBFS


So we can't ever enable HUGETLBFS for Book3S 32.

Presumably the code got copied when we split the headers apart.

So I think you can just ignore that one, and we'll delete it.

cheers

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox