* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:53 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
jean-philippe.brucker, marc.zyngier
In-Reply-To: <7779442d7889ee943b3e4ff6c63ec90b4a58b79d.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 10:33:05AM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 00:56 +0300, Michael S. Tsirkin wrote:
> > > but it's not, VMs are
> > > created in "legacy" mode all the times and we don't know at VM creation
> > > time whether it will become a secure VM or not. The way our secure VMs
> > > work is that they start as a normal VM, load a secure "payload" and
> > > call the Ultravisor to "become" secure.
> > >
> > > So we're in a bit of a bind here. We need that one-liner optional arch
> > > hook to make virtio use swiotlb in that "IOMMU bypass" case.
> > >
> > > Ben.
> >
> > And just to make sure I understand, on your platform DMA APIs do include
> > some of the cache flushing tricks and this is why you don't want to
> > declare iommu support in the hypervisor?
>
> I'm not sure I parse what you mean.
>
> We don't need cache flushing tricks.
You don't but do real devices on same platform need them?
> The problem we have with our
> "secure" VMs is that:
>
> - At VM creation time we have no idea it's going to become a secure
> VM, qemu doesn't know anything about it, and thus qemu (or other
> management tools, libvirt etc...) are going to create "legacy" (ie
> iommu bypass) virtio devices.
>
> - Once the VM goes secure (early during boot but too late for qemu),
> it will need to make virtio do bounce-buffering via swiotlb because
> qemu cannot physically access most VM pages (blocked by HW security
> features), we need to bounce buffer using some unsecure pages that are
> accessible to qemu.
>
> That said, I wouldn't object for us to more generally switch long run
> to changing qemu so that virtio on powerpc starts using the IOMMU as a
> default provided we fix our guest firmware to understand it (it
> currently doesn't), and provided we verify that the performance impact
> on things like vhost is negligible.
>
> Cheers,
> Ben.
>
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 20:55 UTC (permalink / raw)
To: Anshuman Khandual
Cc: virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
elfring, david, jasowang, benh, mpe, hch, linuxram, haren, paulus,
srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>
On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
> This patch series is the follow up on the discussions we had before about
> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
> were suggestions about doing away with two different paths of transactions
> with the host/QEMU, first being the direct GPA and the other being the DMA
> API based translations.
>
> First patch attempts to create a direct GPA mapping based DMA operations
> structure called 'virtio_direct_dma_ops' with exact same implementation
> of the direct GPA path which virtio core currently has but just wrapped in
> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
> existing semantics. The second patch does exactly that inside the function
> virtio_finalize_features(). The third patch removes the default direct GPA
> path from virtio core forcing it to use DMA API callbacks for all devices.
> Now with that change, every device must have a DMA operations structure
> associated with it. The fourth patch adds an additional hook which gives
> the platform an opportunity to do yet another override if required. This
> platform hook can be used on POWER Ultravisor based protected guests to
> load up SWIOTLB DMA callbacks to do the required (as discussed previously
> in the above mentioned thread how host is allowed to access only parts of
> the guest GPA range) bounce buffering into the shared memory for all I/O
> scatter gather buffers to be consumed on the host side.
>
> Please go through these patches and review whether this approach broadly
> makes sense. I will appreciate suggestions, inputs, comments regarding
> the patches or the approach in general. Thank you.
Jason did some work on profiling this. Unfortunately he reports
about 4% extra overhead from this switch on x86 with no vIOMMU.
I expect he's writing up the data in more detail, but
just wanted to let you know this would be one more
thing to debug before we can just switch to DMA APIs.
> Anshuman Khandual (4):
> virtio: Define virtio_direct_dma_ops structure
> virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
> virtio: Force virtio core to use DMA API callbacks for all virtio devices
> virtio: Add platform specific DMA API translation for virito devices
>
> arch/powerpc/include/asm/dma-mapping.h | 6 +++
> arch/powerpc/platforms/pseries/iommu.c | 6 +++
> drivers/virtio/virtio.c | 72 ++++++++++++++++++++++++++++++++++
> drivers/virtio/virtio_pci_common.h | 3 ++
> drivers/virtio/virtio_ring.c | 65 +-----------------------------
> 5 files changed, 89 insertions(+), 63 deletions(-)
>
> --
> 2.9.3
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Benjamin Herrenschmidt @ 2018-08-02 21:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180802225738-mutt-send-email-mst@kernel.org>
On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > at the details of the patch in a while but that's what I told him to
> > implement ;-) :
> >
> > - Make virtio always use DMA ops to simplify the code path (with a set
> > of "transparent" ops for legacy)
> >
> > and
> >
> > - Provide an arch hook allowing us to "override" those "transparent"
> > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> >
> > Cheers,
> > Ben.
> >
>
> Right but as I tried to say doing that brings us to a bunch of issues
> with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> guest to hypervisor communication.
I'm not sure I see the problem, see below
> When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> bunch of overhead which we need to get rid of if we are to switch to
> PLATFORM_IOMMU by default. We need to fix that.
So let's differenciate the two problems of having an IOMMU (real or
emulated) which indeeds adds overhead etc... and using the DMA API.
At the moment, virtio does this all over the place:
if (use_dma_api)
dma_map/alloc_something(...)
else
use_pa
The idea of the patch set is to do two, somewhat orthogonal, changes
that together achieve what we want. Let me know where you think there
is "a bunch of issues" because I'm missing it:
1- Replace the above if/else constructs with just calling the DMA API,
and have virtio, at initialization, hookup its own dma_ops that just
"return pa" (roughly) when the IOMMU stuff isn't used.
This adds an indirect function call to the path that previously didn't
have one (the else case above). Is that a significant/measurable
overhead ?
This change stands alone, and imho "cleans" up virtio by avoiding all
that if/else "2 path" and unless it adds a measurable overhead, should
probably be done.
2- Make virtio use the DMA API with our custom platform-provided
swiotlb callbacks when needed, that is when not using IOMMU *and*
running on a secure VM in our case.
This benefits from -1- by making us just plumb in a different set of
DMA ops we would have cooked up specifically for virtio in our arch
code (or in virtio itself but build arch-conditionally in a separate
file). But it doesn't strictly need it -1-:
Now, -2- doesn't strictly needs -1-. We could have just done another
xen-like hack that forces the DMA API "ON" for virtio when running in a
secure VM.
The problem if we do that however is that we also then need the arch
PCI code to make sure it hooks up the virtio PCI devices with the
special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
not the same as other PCI devices. So it will have to play games such
as checking vendor/device IDs for virtio, checking the IOMMU flag,
etc... from the arch code which really bloody sucks when assigning PCI
DMA ops.
However, if we do it the way we plan here, on top of -1-, with a hook
called from virtio into the arch to "override" the virtio DMA ops, then
we avoid the problem completely: The arch hook would only be called by
virtio if the IOMMU flag is *not* set. IE only when using that special
"hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
PCI dma ops as usual.
That way, we have a very clear semantic: This hook is purely about
replacing those "null" DMA ops that just return PA introduced in -1-
with some arch provided specially cooked up DMA ops for non-IOMMU
virtio that know about the arch special requirements. For us bounce
buffering.
Is there something I'm missing ?
Cheers,
Ben.
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Michael S. Tsirkin @ 2018-08-02 21:51 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Christoph Hellwig, Will Deacon, Anshuman Khandual, virtualization,
linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
jasowang, mpe, linuxram, haren, paulus, srikar, robin.murphy,
jean-philippe.brucker, marc.zyngier
In-Reply-To: <de4888b6457e220776e16a9c8958ff0886ffc66c.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> On Thu, 2018-08-02 at 23:52 +0300, Michael S. Tsirkin wrote:
> > > Yes, this is the purpose of Anshuman original patch (I haven't looked
> > > at the details of the patch in a while but that's what I told him to
> > > implement ;-) :
> > >
> > > - Make virtio always use DMA ops to simplify the code path (with a set
> > > of "transparent" ops for legacy)
> > >
> > > and
> > >
> > > - Provide an arch hook allowing us to "override" those "transparent"
> > > DMA ops with some custom ones that do the appropriate swiotlb gunk.
> > >
> > > Cheers,
> > > Ben.
> > >
> >
> > Right but as I tried to say doing that brings us to a bunch of issues
> > with using DMA APIs in virtio. Put simply DMA APIs weren't designed for
> > guest to hypervisor communication.
>
> I'm not sure I see the problem, see below
>
> > When we do (as is the case with PLATFORM_IOMMU right now) this adds a
> > bunch of overhead which we need to get rid of if we are to switch to
> > PLATFORM_IOMMU by default. We need to fix that.
>
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.
Well actually it's the other way around. An iommu in theory doesn't need
to bring overhead if you set it in bypass mode. Which does imply the
iommu supports bypass mode. Is that universally the case? DMA API does
see Christoph's list of things it does some of which add overhead.
> At the moment, virtio does this all over the place:
>
> if (use_dma_api)
> dma_map/alloc_something(...)
> else
> use_pa
>
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
>
> 1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
>
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?
Seems to be :( Jason reports about 4%. I wonder whether we can support
map_sg and friends being NULL, then use that when mapping is an
identity. A conditional branch there is likely very cheap.
Would this cover all platforms with kvm (which is where we care
most about performance)?
> This change stands alone, and imho "cleans" up virtio by avoiding all
> that if/else "2 path" and unless it adds a measurable overhead, should
> probably be done.
>
> 2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
>
> This benefits from -1- by making us just plumb in a different set of
> DMA ops we would have cooked up specifically for virtio in our arch
> code (or in virtio itself but build arch-conditionally in a separate
> file). But it doesn't strictly need it -1-:
>
> Now, -2- doesn't strictly needs -1-. We could have just done another
> xen-like hack that forces the DMA API "ON" for virtio when running in a
> secure VM.
>
> The problem if we do that however is that we also then need the arch
> PCI code to make sure it hooks up the virtio PCI devices with the
> special "magic" DMA ops that avoid the iommu but still do swiotlb, ie,
> not the same as other PCI devices. So it will have to play games such
> as checking vendor/device IDs for virtio, checking the IOMMU flag,
> etc... from the arch code which really bloody sucks when assigning PCI
> DMA ops.
>
> However, if we do it the way we plan here, on top of -1-, with a hook
> called from virtio into the arch to "override" the virtio DMA ops, then
> we avoid the problem completely: The arch hook would only be called by
> virtio if the IOMMU flag is *not* set. IE only when using that special
> "hypervisor" iommu bypass. If the IOMMU flag is set, virtio uses normal
> PCI dma ops as usual.
>
> That way, we have a very clear semantic: This hook is purely about
> replacing those "null" DMA ops that just return PA introduced in -1-
> with some arch provided specially cooked up DMA ops for non-IOMMU
> virtio that know about the arch special requirements. For us bounce
> buffering.
>
> Is there something I'm missing ?
>
> Cheers,
> Ben.
Right so I was trying to write it up in a systematic way, but just to
give you one example, if there is a system where DMA API handles
coherency issues, or flushing of some buffers, then our PLATFORM_IOMMU
flag causes that to happen.
And we kinda worked around this without the IOMMU by basically saying
"ok we do not really need DMA API so let's just bypass it" and
it was kind of ok except now everyone is switching to vIOMMU
just in case. So now people do want some parts of what DMA API does,
such as the bounce buffer use, or IOMMU mappings.
And maybe in the end the solution is going to be to do something similar
to virt_Xmb except for DMA APIs: add APIs that handle just the
addressing bits but without the overhead. See
commit 6a65d26385bf487926a0616650927303058551e3
asm-generic: implement virt_xxx memory barriers
for reference, it's a similar set of issues.
So it's not a problem with your patches as such, it's just that
they don't solve that harder problem.
--
MST
^ permalink raw reply
* Re: [PATCH] QE GPIO: Add qe_gpio_set_multiple
From: Joakim Tjernlund @ 2018-08-02 22:36 UTC (permalink / raw)
To: leoyang.li@nxp.com, york.sun@nxp.com
Cc: linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au,
qiang.zhao@nxp.com
In-Reply-To: <AM4PR0401MB1699AFF424BDFAE1BC65E1FA8F420@AM4PR0401MB1699.eurprd04.prod.outlook.com>
TGVvLCBkaWQgdGhpcyBnbyBhbnl3aGVyZSA/DQoNCiAgICBKb2NrZQ0KDQpPbiBUdWUsIDIwMTgt
MDctMDMgYXQgMTY6MzUgKzAwMDAsIExlbyBMaSB3cm90ZToNCj4gDQo+ID4gLS0tLS1PcmlnaW5h
bCBNZXNzYWdlLS0tLS0NCj4gPiBGcm9tOiBZb3JrIFN1bg0KPiA+IFNlbnQ6IFR1ZXNkYXksIEp1
bHkgMywgMjAxOCAxMDoyNyBBTQ0KPiA+IFRvOiBqb2NrZUBpbmZpbmVyYS5jb20gPGpvYWtpbS50
amVybmx1bmRAaW5maW5lcmEuY29tPjsgTGVvIExpDQo+ID4gPGxlb3lhbmcubGlAbnhwLmNvbT4N
Cj4gPiBDYzogbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5vcmc7IG1wZUBlbGxlcm1hbi5pZC5h
dTsgUWlhbmcgWmhhbw0KPiA+IDxxaWFuZy56aGFvQG54cC5jb20+DQo+ID4gU3ViamVjdDogUmU6
IFtQQVRDSF0gUUUgR1BJTzogQWRkIHFlX2dwaW9fc2V0X211bHRpcGxlDQo+ID4gDQo+ID4gK0xl
bw0KPiA+IA0KPiA+IE9uIDA3LzAzLzIwMTggMDM6MzAgQU0sIEpvYWtpbSBUamVybmx1bmQgd3Jv
dGU6DQo+ID4gPiBPbiBUdWUsIDIwMTgtMDYtMjYgYXQgMjM6NDEgKzEwMDAsIE1pY2hhZWwgRWxs
ZXJtYW4gd3JvdGU6DQo+ID4gPiA+IA0KPiA+ID4gPiBKb2FraW0gVGplcm5sdW5kIDxKb2FraW0u
VGplcm5sdW5kQGluZmluZXJhLmNvbT4gd3JpdGVzOg0KPiA+ID4gPiA+IE9uIFRodSwgMjAxOC0w
Ni0yMSBhdCAwMjozOCArMDAwMCwgUWlhbmcgWmhhbyB3cm90ZToNCj4gPiA+ID4gPiA+IE9uIDA2
LzE5LzIwMTggMDk6MjIgQU0sIEpvYWtpbSBUamVybmx1bmQgd3JvdGU6DQo+ID4gPiA+ID4gPiAt
LS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gPiA+ID4gRnJvbTogTGludXhwcGMtZGV2
IFttYWlsdG86bGludXhwcGMtZGV2LQ0KPiA+IA0KPiA+IGJvdW5jZXMrcWlhbmcuemhhbz1ueHAu
Y29tQGxpc3RzLm96bGFicy5vcmddIE9uIEJlaGFsZiBPZiBKb2FraW0NCj4gPiBUamVybmx1bmQN
Cj4gPiA+ID4gPiA+IFNlbnQ6IDIwMTjlubQ25pyIMjDml6UgMDoyMg0KPiA+ID4gPiA+ID4gVG86
IFlvcmsgU3VuIDx5b3JrLnN1bkBueHAuY29tPjsgbGludXhwcGMtZGV2IDxsaW51eHBwYy0NCj4g
PiANCj4gPiBkZXZAbGlzdHMub3psYWJzLm9yZz4NCj4gPiA+ID4gPiA+IFN1YmplY3Q6IFtQQVRD
SF0gUUUgR1BJTzogQWRkIHFlX2dwaW9fc2V0X211bHRpcGxlDQo+ID4gPiA+ID4gPiANCj4gPiA+
ID4gPiA+IFRoaXMgY291c2luIHRvIGdwaW8tbXBjOHh4eCB3YXMgbGFja2luZyBhIG11bHRpcGxl
IHBpbnMgbWV0aG9kLCBhZGQNCj4gPiANCj4gPiBvbmUuDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4g
PiA+IFNpZ25lZC1vZmYtYnk6IEpvYWtpbSBUamVybmx1bmQgPGpvYWtpbS50amVybmx1bmRAaW5m
aW5lcmEuY29tPg0KPiA+ID4gPiA+ID4gLS0tDQo+ID4gPiA+ID4gPiAgZHJpdmVycy9zb2MvZnNs
L3FlL2dwaW8uYyB8IDI4ICsrKysrKysrKysrKysrKysrKysrKysrKysrKysNCj4gPiA+ID4gPiA+
ICAxIGZpbGUgY2hhbmdlZCwgMjggaW5zZXJ0aW9ucygrKQ0KPiA+ID4gPiANCj4gPiA+ID4gLi4u
DQo+ID4gPiA+ID4gPiAgc3RhdGljIGludCBxZV9ncGlvX2Rpcl9pbihzdHJ1Y3QgZ3Bpb19jaGlw
ICpnYywgdW5zaWduZWQgaW50IGdwaW8pICB7DQo+ID4gPiA+ID4gPiAgICAgICAgIHN0cnVjdCBv
Zl9tbV9ncGlvX2NoaXAgKm1tX2djID0gdG9fb2ZfbW1fZ3Bpb19jaGlwKGdjKTsgQEAgLQ0KPiA+
IA0KPiA+IDI5OCw2ICszMjUsNyBAQCBzdGF0aWMgaW50IF9faW5pdCBxZV9hZGRfZ3Bpb2NoaXBz
KHZvaWQpDQo+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmRpcmVjdGlvbl9vdXRwdXQg
PSBxZV9ncGlvX2Rpcl9vdXQ7DQo+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmdldCA9
IHFlX2dwaW9fZ2V0Ow0KPiA+ID4gPiA+ID4gICAgICAgICAgICAgICAgIGdjLT5zZXQgPSBxZV9n
cGlvX3NldDsNCj4gPiA+ID4gPiA+ICsgICAgICAgICAgICAgICBnYy0+c2V0X211bHRpcGxlID0g
cWVfZ3Bpb19zZXRfbXVsdGlwbGU7DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+ICAgICAgICAg
ICAgICAgICByZXQgPSBvZl9tbV9ncGlvY2hpcF9hZGRfZGF0YShucCwgbW1fZ2MsIHFlX2djKTsN
Cj4gPiA+ID4gPiA+ICAgICAgICAgICAgICAgICBpZiAocmV0KQ0KPiA+ID4gPiA+ID4gDQo+ID4g
PiA+ID4gPiBSZXZpZXdlZC1ieTogUWlhbmcgWmhhbyA8cWlhbmcuemhhb0BueHAuY29tPg0KPiA+
ID4gPiA+ID4gDQo+ID4gPiA+ID4gDQo+ID4gPiA+ID4gV2hvIHBpY2tzIHVwIHRoaXMgcGF0Y2gg
PyBJcyBpdCBxdWV1ZWQgc29tZXdoZXJlIGFscmVhZHk/DQo+ID4gPiA+IA0KPiA+ID4gPiBOb3Qg
bWUuDQo+ID4gPiANCj4gPiA+IFlvcms/IFlvdSBzZWVtIHRvIGJlIHRoZSBvbmx5IG9uZSBsZWZ0
Lg0KPiA+ID4gDQo+ID4gDQo+ID4gSSBhbSBub3QgYSBMaW51eCBtYWludGFpbmVyLiBFdmVuIEkg
d2FudCB0bywgSSBjYW4ndCBtZXJnZSB0aGlzIHBhdGNoLg0KPiA+IA0KPiA+IExlbywgd2hvIGNh
biBtZXJnZSB0aGlzIHBhdGNoIGFuZCByZXF1ZXN0IGEgcHVsbD8NCj4gDQo+IFNpbmNlIGl0IGZh
bGxzIHVuZGVyIHRoZSBkcml2ZXIvc29jL2ZzbC8gZm9sZGVyLiAgSSBjYW4gdGFrZSBpdC4NCj4g
DQo+IFJlZ2FyZHMsDQo+IExlbw0K
^ permalink raw reply
* RE: [PATCH] QE GPIO: Add qe_gpio_set_multiple
From: Leo Li @ 2018-08-02 22:44 UTC (permalink / raw)
To: jocke@infinera.com, York Sun
Cc: linuxppc-dev@lists.ozlabs.org, mpe@ellerman.id.au, Qiang Zhao
In-Reply-To: <bace5db4ede3d1e8cbe314f1306b90bc2893218e.camel@infinera.com>
DQoNCj4gLS0tLS1PcmlnaW5hbCBNZXNzYWdlLS0tLS0NCj4gRnJvbTogSm9ha2ltIFRqZXJubHVu
ZCBbbWFpbHRvOkpvYWtpbS5UamVybmx1bmRAaW5maW5lcmEuY29tXQ0KPiBTZW50OiBUaHVyc2Rh
eSwgQXVndXN0IDIsIDIwMTggNTozNyBQTQ0KPiBUbzogTGVvIExpIDxsZW95YW5nLmxpQG54cC5j
b20+OyBZb3JrIFN1biA8eW9yay5zdW5AbnhwLmNvbT4NCj4gQ2M6IGxpbnV4cHBjLWRldkBsaXN0
cy5vemxhYnMub3JnOyBtcGVAZWxsZXJtYW4uaWQuYXU7IFFpYW5nIFpoYW8NCj4gPHFpYW5nLnpo
YW9AbnhwLmNvbT4NCj4gU3ViamVjdDogUmU6IFtQQVRDSF0gUUUgR1BJTzogQWRkIHFlX2dwaW9f
c2V0X211bHRpcGxlDQo+IA0KPiBMZW8sIGRpZCB0aGlzIGdvIGFueXdoZXJlID8NCg0KSXQgaGFz
IGJlZW4gaW5jbHVkZWQgaW4gbXkgc29jIHB1bGwgcmVxdWVzdCBmb3IgNC4xOS4gIGh0dHA6Ly9s
a21sLml1LmVkdS9oeXBlcm1haWwvbGludXgva2VybmVsLzE4MDcuMy8wMjI5NS5odG1sDQoNCkxl
bw0KDQo+IA0KPiAgICAgSm9ja2UNCj4gDQo+IE9uIFR1ZSwgMjAxOC0wNy0wMyBhdCAxNjozNSAr
MDAwMCwgTGVvIExpIHdyb3RlOg0KPiA+DQo+ID4gPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0t
LQ0KPiA+ID4gRnJvbTogWW9yayBTdW4NCj4gPiA+IFNlbnQ6IFR1ZXNkYXksIEp1bHkgMywgMjAx
OCAxMDoyNyBBTQ0KPiA+ID4gVG86IGpvY2tlQGluZmluZXJhLmNvbSA8am9ha2ltLnRqZXJubHVu
ZEBpbmZpbmVyYS5jb20+OyBMZW8gTGkNCj4gPiA+IDxsZW95YW5nLmxpQG54cC5jb20+DQo+ID4g
PiBDYzogbGludXhwcGMtZGV2QGxpc3RzLm96bGFicy5vcmc7IG1wZUBlbGxlcm1hbi5pZC5hdTsg
UWlhbmcgWmhhbw0KPiA+ID4gPHFpYW5nLnpoYW9AbnhwLmNvbT4NCj4gPiA+IFN1YmplY3Q6IFJl
OiBbUEFUQ0hdIFFFIEdQSU86IEFkZCBxZV9ncGlvX3NldF9tdWx0aXBsZQ0KPiA+ID4NCj4gPiA+
ICtMZW8NCj4gPiA+DQo+ID4gPiBPbiAwNy8wMy8yMDE4IDAzOjMwIEFNLCBKb2FraW0gVGplcm5s
dW5kIHdyb3RlOg0KPiA+ID4gPiBPbiBUdWUsIDIwMTgtMDYtMjYgYXQgMjM6NDEgKzEwMDAsIE1p
Y2hhZWwgRWxsZXJtYW4gd3JvdGU6DQo+ID4gPiA+ID4NCj4gPiA+ID4gPiBKb2FraW0gVGplcm5s
dW5kIDxKb2FraW0uVGplcm5sdW5kQGluZmluZXJhLmNvbT4gd3JpdGVzOg0KPiA+ID4gPiA+ID4g
T24gVGh1LCAyMDE4LTA2LTIxIGF0IDAyOjM4ICswMDAwLCBRaWFuZyBaaGFvIHdyb3RlOg0KPiA+
ID4gPiA+ID4gPiBPbiAwNi8xOS8yMDE4IDA5OjIyIEFNLCBKb2FraW0gVGplcm5sdW5kIHdyb3Rl
Og0KPiA+ID4gPiA+ID4gPiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+ID4gPiA+ID4g
PiBGcm9tOiBMaW51eHBwYy1kZXYgW21haWx0bzpsaW51eHBwYy1kZXYtDQo+ID4gPg0KPiA+ID4g
Ym91bmNlcytxaWFuZy56aGFvPW54cC5jb21AbGlzdHMub3psYWJzLm9yZ10gT24gQmVoYWxmIE9m
IEpvYWtpbQ0KPiA+ID4gVGplcm5sdW5kDQo+ID4gPiA+ID4gPiA+IFNlbnQ6IDIwMTjlubQ25pyI
MjDml6UgMDoyMg0KPiA+ID4gPiA+ID4gPiBUbzogWW9yayBTdW4gPHlvcmsuc3VuQG54cC5jb20+
OyBsaW51eHBwYy1kZXYgPGxpbnV4cHBjLQ0KPiA+ID4NCj4gPiA+IGRldkBsaXN0cy5vemxhYnMu
b3JnPg0KPiA+ID4gPiA+ID4gPiBTdWJqZWN0OiBbUEFUQ0hdIFFFIEdQSU86IEFkZCBxZV9ncGlv
X3NldF9tdWx0aXBsZQ0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4gPiBUaGlzIGNvdXNpbiB0
byBncGlvLW1wYzh4eHggd2FzIGxhY2tpbmcgYSBtdWx0aXBsZSBwaW5zDQo+ID4gPiA+ID4gPiA+
IG1ldGhvZCwgYWRkDQo+ID4gPg0KPiA+ID4gb25lLg0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+
ID4gPiBTaWduZWQtb2ZmLWJ5OiBKb2FraW0gVGplcm5sdW5kDQo+ID4gPiA+ID4gPiA+IDxqb2Fr
aW0udGplcm5sdW5kQGluZmluZXJhLmNvbT4NCj4gPiA+ID4gPiA+ID4gLS0tDQo+ID4gPiA+ID4g
PiA+ICBkcml2ZXJzL3NvYy9mc2wvcWUvZ3Bpby5jIHwgMjggKysrKysrKysrKysrKysrKysrKysr
KysrKysrKw0KPiA+ID4gPiA+ID4gPiAgMSBmaWxlIGNoYW5nZWQsIDI4IGluc2VydGlvbnMoKykN
Cj4gPiA+ID4gPg0KPiA+ID4gPiA+IC4uLg0KPiA+ID4gPiA+ID4gPiAgc3RhdGljIGludCBxZV9n
cGlvX2Rpcl9pbihzdHJ1Y3QgZ3Bpb19jaGlwICpnYywgdW5zaWduZWQgaW50IGdwaW8pICB7DQo+
ID4gPiA+ID4gPiA+ICAgICAgICAgc3RydWN0IG9mX21tX2dwaW9fY2hpcCAqbW1fZ2MgPQ0KPiA+
ID4gPiA+ID4gPiB0b19vZl9tbV9ncGlvX2NoaXAoZ2MpOyBAQCAtDQo+ID4gPg0KPiA+ID4gMjk4
LDYgKzMyNSw3IEBAIHN0YXRpYyBpbnQgX19pbml0IHFlX2FkZF9ncGlvY2hpcHModm9pZCkNCj4g
PiA+ID4gPiA+ID4gICAgICAgICAgICAgICAgIGdjLT5kaXJlY3Rpb25fb3V0cHV0ID0gcWVfZ3Bp
b19kaXJfb3V0Ow0KPiA+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPmdldCA9IHFlX2dw
aW9fZ2V0Ow0KPiA+ID4gPiA+ID4gPiAgICAgICAgICAgICAgICAgZ2MtPnNldCA9IHFlX2dwaW9f
c2V0Ow0KPiA+ID4gPiA+ID4gPiArICAgICAgICAgICAgICAgZ2MtPnNldF9tdWx0aXBsZSA9IHFl
X2dwaW9fc2V0X211bHRpcGxlOw0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4gPiAgICAgICAg
ICAgICAgICAgcmV0ID0gb2ZfbW1fZ3Bpb2NoaXBfYWRkX2RhdGEobnAsIG1tX2djLCBxZV9nYyk7
DQo+ID4gPiA+ID4gPiA+ICAgICAgICAgICAgICAgICBpZiAocmV0KQ0KPiA+ID4gPiA+ID4gPg0K
PiA+ID4gPiA+ID4gPiBSZXZpZXdlZC1ieTogUWlhbmcgWmhhbyA8cWlhbmcuemhhb0BueHAuY29t
Pg0KPiA+ID4gPiA+ID4gPg0KPiA+ID4gPiA+ID4NCj4gPiA+ID4gPiA+IFdobyBwaWNrcyB1cCB0
aGlzIHBhdGNoID8gSXMgaXQgcXVldWVkIHNvbWV3aGVyZSBhbHJlYWR5Pw0KPiA+ID4gPiA+DQo+
ID4gPiA+ID4gTm90IG1lLg0KPiA+ID4gPg0KPiA+ID4gPiBZb3JrPyBZb3Ugc2VlbSB0byBiZSB0
aGUgb25seSBvbmUgbGVmdC4NCj4gPiA+ID4NCj4gPiA+DQo+ID4gPiBJIGFtIG5vdCBhIExpbnV4
IG1haW50YWluZXIuIEV2ZW4gSSB3YW50IHRvLCBJIGNhbid0IG1lcmdlIHRoaXMgcGF0Y2guDQo+
ID4gPg0KPiA+ID4gTGVvLCB3aG8gY2FuIG1lcmdlIHRoaXMgcGF0Y2ggYW5kIHJlcXVlc3QgYSBw
dWxsPw0KPiA+DQo+ID4gU2luY2UgaXQgZmFsbHMgdW5kZXIgdGhlIGRyaXZlci9zb2MvZnNsLyBm
b2xkZXIuICBJIGNhbiB0YWtlIGl0Lg0KPiA+DQo+ID4gUmVnYXJkcywNCj4gPiBMZW8NCg==
^ permalink raw reply
* Re: [resend] [PATCH 0/3] powerpc/pseries: use H_BLOCK_REMOVE
From: Michael Ellerman @ 2018-08-03 2:29 UTC (permalink / raw)
To: Nicholas Piggin, Laurent Dufour
Cc: linuxppc-dev, linux-kernel, aneesh.kumar, benh, paulus
In-Reply-To: <20180801115514.441eecc8@roar.ozlabs.ibm.com>
Nicholas Piggin <npiggin@gmail.com> writes:
> On Fri, 27 Jul 2018 15:51:30 +0200
> Laurent Dufour <ldufour@linux.vnet.ibm.com> wrote:
>> [Resending so everyone is getting the cover letter]
>> On very large system we could see soft lockup fired when a process is exiting
>>
>> watchdog: BUG: soft lockup - CPU#851 stuck for 21s! [forkoff:215523]
>> Modules linked in: pseries_rng rng_core xfs raid10 vmx_crypto btrfs libcrc32c xor zstd_decompress zstd_compress xxhash lzo_compress raid6_pq crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
>> CPU: 851 PID: 215523 Comm: forkoff Not tainted 4.17.0 #1
>> NIP: c0000000000b995c LR: c0000000000b8f64 CTR: 000000000000aa18
>> REGS: c00006b0645b7610 TRAP: 0901 Not tainted (4.17.0)
>> MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 22042082 XER: 00000000
>> CFAR: 00000000006cf8f0 SOFTE: 0
>> GPR00: 0010000000000000 c00006b0645b7890 c000000000f99200 0000000000000000
>> GPR04: 8e000001a5a4de58 400249cf1bfd5480 8e000001a5a4de50 400249cf1bfd5480
>> GPR08: 8e000001a5a4de48 400249cf1bfd5480 8e000001a5a4de40 400249cf1bfd5480
>> GPR12: ffffffffffffffff c00000001e690800
>> NIP [c0000000000b995c] plpar_hcall9+0x44/0x7c
>> LR [c0000000000b8f64] pSeries_lpar_flush_hash_range+0x324/0x3d0
>> Call Trace:
>> [c00006b0645b7890] [8e000001a5a4dd20] 0x8e000001a5a4dd20 (unreliable)
>> [c00006b0645b7a00] [c00000000006d5b0] flush_hash_range+0x60/0x110
>> [c00006b0645b7a50] [c000000000072a2c] __flush_tlb_pending+0x4c/0xd0
>> [c00006b0645b7a80] [c0000000002eaf44] unmap_page_range+0x984/0xbd0
>> [c00006b0645b7bc0] [c0000000002eb594] unmap_vmas+0x84/0x100
>> [c00006b0645b7c10] [c0000000002f8afc] exit_mmap+0xac/0x1f0
>> [c00006b0645b7cd0] [c0000000000f2638] mmput+0x98/0x1b0
>> [c00006b0645b7d00] [c0000000000fc9d0] do_exit+0x330/0xc00
>> [c00006b0645b7dc0] [c0000000000fd384] do_group_exit+0x64/0x100
>> [c00006b0645b7e00] [c0000000000fd44c] sys_exit_group+0x2c/0x30
>> [c00006b0645b7e30] [c00000000000b960] system_call+0x58/0x6c
>> Instruction dump:
>> 60000000 f8810028 7ca42b78 7cc53378 7ce63b78 7d074378 7d284b78 7d495378
>> e9410060 e9610068 e9810070 44000022 <7d806378> e9810028 f88c0000 f8ac0008
>>
>> This happens when removing the PTE by calling the hypervisor using the
>> H_BULK_REMOVE call. This call is processing up to 4 PTEs but is doing a
>> tlbie for each PTE it is processing. This could lead to long time spent in
>> the hypervisor (sometimes up to 4s) and soft lockup being raised because
>> the scheduler is not called in zap_pte_range().
>>
>> Since the Power7's time, the hypervisor is providing a new hcall
>> H_BLOCK_REMOVE allowing processing up to 8 PTEs with one call to
>> tlbie. By limiting the amount of tlbie generated, this reduces the time
>> spent invalidating the PTEs.
>
> Oh that's a nice feature. I must have an ancient PAPR because I don't
> have it.
The only public document is LoPAPR, which is a stripped down version of
the 2012 PAPR.
cheers
^ permalink raw reply
* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Murilo Opsfelder Araujo @ 2018-08-03 0:42 UTC (permalink / raw)
To: Christophe LEROY
Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
Eric W . Biederman, Joe Perches, Michael Ellerman,
Michael Neuling, Nicholas Piggin, Paul Mackerras, Simon Guo,
Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev,
Segher Boessenkool
In-Reply-To: <7209fa95-8d40-14ca-f27a-ce3edb64191e@c-s.fr>
Hi, Christophe.
On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>
>
> Le 01/08/2018 à 23:33, Murilo Opsfelder Araujo a écrit :
> > show_user_instructions() is a slightly modified version of
> > show_instructions() that allows userspace instruction dump.
> >
> > This will be useful within show_signal_msg() to dump userspace
> > instructions of the faulty location.
> >
> > Here is a sample of what show_user_instructions() outputs:
> >
> > pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
> > pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
> >
> > The current->comm and current->pid printed can serve as a glue that
> > links the instructions dump to its originator, allowing messages to be
> > interleaved in the logs.
> >
> > Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
> > ---
> > arch/powerpc/include/asm/stacktrace.h | 13 +++++++++
> > arch/powerpc/kernel/process.c | 40 +++++++++++++++++++++++++++
> > 2 files changed, 53 insertions(+)
> > create mode 100644 arch/powerpc/include/asm/stacktrace.h
> >
> > diff --git a/arch/powerpc/include/asm/stacktrace.h b/arch/powerpc/include/asm/stacktrace.h
> > new file mode 100644
> > index 000000000000..6149b53b3bc8
> > --- /dev/null
> > +++ b/arch/powerpc/include/asm/stacktrace.h
> > @@ -0,0 +1,13 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Stack trace functions.
> > + *
> > + * Copyright 2018, Murilo Opsfelder Araujo, IBM Corporation.
> > + */
> > +
> > +#ifndef _ASM_POWERPC_STACKTRACE_H
> > +#define _ASM_POWERPC_STACKTRACE_H
> > +
> > +void show_user_instructions(struct pt_regs *regs);
> > +
> > +#endif /* _ASM_POWERPC_STACKTRACE_H */
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index e9533b4d2f08..364645ac732c 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> > pr_cont("\n");
> > }
> > +void show_user_instructions(struct pt_regs *regs)
> > +{
> > + int i;
> > + const char *prefix = KERN_INFO "%s[%d]: code: ";
> > + unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > + sizeof(int));
> > +
> > + printk(prefix, current->comm, current->pid);
>
> Why not use pr_info() and remove KERN_INFO from *prefix ?
Because it doesn't compile:
arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
pr_info(prefix, current->comm, current->pid);
^
./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
#define pr_fmt(fmt) fmt
^
`pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
which is an invalid string concatenation.
`pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
valid.
> > +
> > + for (i = 0; i < instructions_to_print; i++) {
> > + int instr;
> > +
> > + if (!(i % 8) && (i > 0)) {
> > + pr_cont("\n");
> > + printk(prefix, current->comm, current->pid);
> > + }
> > +
> > +#if !defined(CONFIG_BOOKE)
> > + /* If executing with the IMMU off, adjust pc rather
> > + * than print XXXXXXXX.
> > + */
> > + if (!(regs->msr & MSR_IR))
> > + pc = (unsigned long)phys_to_virt(pc);
>
> Shouldn't this be done outside of the loop, only once ?
I don't think so.
pc gets incremented at the bottom of the loop:
pc += sizeof(int);
Adjusting pc is necessary at each iteration. Leaving this block inside
the loop seems correct.
Cheers
Murilo
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Jason Wang @ 2018-08-03 2:41 UTC (permalink / raw)
To: Michael S. Tsirkin, Anshuman Khandual
Cc: virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
elfring, david, benh, mpe, hch, linuxram, haren, paulus, srikar
In-Reply-To: <20180802235332-mutt-send-email-mst@kernel.org>
On 2018年08月03日 04:55, Michael S. Tsirkin wrote:
> On Fri, Jul 20, 2018 at 09:29:37AM +0530, Anshuman Khandual wrote:
>> This patch series is the follow up on the discussions we had before about
>> the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
>> for virito devices (https://patchwork.kernel.org/patch/10417371/). There
>> were suggestions about doing away with two different paths of transactions
>> with the host/QEMU, first being the direct GPA and the other being the DMA
>> API based translations.
>>
>> First patch attempts to create a direct GPA mapping based DMA operations
>> structure called 'virtio_direct_dma_ops' with exact same implementation
>> of the direct GPA path which virtio core currently has but just wrapped in
>> a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
>> the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
>> existing semantics. The second patch does exactly that inside the function
>> virtio_finalize_features(). The third patch removes the default direct GPA
>> path from virtio core forcing it to use DMA API callbacks for all devices.
>> Now with that change, every device must have a DMA operations structure
>> associated with it. The fourth patch adds an additional hook which gives
>> the platform an opportunity to do yet another override if required. This
>> platform hook can be used on POWER Ultravisor based protected guests to
>> load up SWIOTLB DMA callbacks to do the required (as discussed previously
>> in the above mentioned thread how host is allowed to access only parts of
>> the guest GPA range) bounce buffering into the shared memory for all I/O
>> scatter gather buffers to be consumed on the host side.
>>
>> Please go through these patches and review whether this approach broadly
>> makes sense. I will appreciate suggestions, inputs, comments regarding
>> the patches or the approach in general. Thank you.
> Jason did some work on profiling this. Unfortunately he reports
> about 4% extra overhead from this switch on x86 with no vIOMMU.
The test is rather simple, just run pktgen (pktgen_sample01_simple.sh)
in guest and measure PPS on tap on host.
Thanks
>
> I expect he's writing up the data in more detail, but
> just wanted to let you know this would be one more
> thing to debug before we can just switch to DMA APIs.
>
>
>> Anshuman Khandual (4):
>> virtio: Define virtio_direct_dma_ops structure
>> virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
>> virtio: Force virtio core to use DMA API callbacks for all virtio devices
>> virtio: Add platform specific DMA API translation for virito devices
>>
>> arch/powerpc/include/asm/dma-mapping.h | 6 +++
>> arch/powerpc/platforms/pseries/iommu.c | 6 +++
>> drivers/virtio/virtio.c | 72 ++++++++++++++++++++++++++++++++++
>> drivers/virtio/virtio_pci_common.h | 3 ++
>> drivers/virtio/virtio_ring.c | 65 +-----------------------------
>> 5 files changed, 89 insertions(+), 63 deletions(-)
>>
>> --
>> 2.9.3
^ permalink raw reply
* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Joe Perches @ 2018-08-03 1:22 UTC (permalink / raw)
To: Murilo Opsfelder Araujo, Christophe LEROY
Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
Eric W . Biederman, Michael Ellerman, Michael Neuling,
Nicholas Piggin, Paul Mackerras, Simon Guo, Sukadev Bhattiprolu,
Tobin C . Harding, linuxppc-dev, Segher Boessenkool
In-Reply-To: <20180803004201.GA5891@kermit-br-ibm-com>
On Thu, 2018-08-02 at 21:42 -0300, Murilo Opsfelder Araujo wrote:
> > > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
[]
> > > @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
> > > pr_cont("\n");
> > > }
> > > +void show_user_instructions(struct pt_regs *regs)
> > > +{
> > > + int i;
> > > + const char *prefix = KERN_INFO "%s[%d]: code: ";
> > > + unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
> > > + sizeof(int));
> > > +
> > > + printk(prefix, current->comm, current->pid);
> >
> > Why not use pr_info() and remove KERN_INFO from *prefix ?
>
> Because it doesn't compile:
>
> arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
> pr_info(prefix, current->comm, current->pid);
> ^
> ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
> #define pr_fmt(fmt) fmt
> ^
What being suggested is using:
pr_info("%s[%d]: code: ", current->comm, current->pid);
^ permalink raw reply
* [PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage
From: Reza Arbab @ 2018-08-03 4:03 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alistair Popple
We've encountered a performance issue when multiple processors stress
{get,put}_mmio_atsd_reg(). These functions contend for mmio_atsd_usage,
an unsigned long used as a bitmask.
The accesses to mmio_atsd_usage are done using test_and_set_bit_lock()
and clear_bit_unlock(). As implemented, both of these will require a
(successful) stwcx to that same cache line.
What we end up with is thread A, attempting to unlock, being slowed by
other threads repeatedly attempting to lock. A's stwcx instructions fail
and retry because the memory reservation is lost every time a different
thread beats it to the punch.
There may be a long-term way to fix this at a larger scale, but for now
resolve the immediate problem by gating our call to
test_and_set_bit_lock() with one to test_bit(), which is obviously
implemented without using a store.
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 8cdf91f..c773465 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
int i;
for (i = 0; i < npu->mmio_atsd_count; i++) {
- if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
- return i;
+ if (!test_bit(i, &npu->mmio_atsd_usage))
+ if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
+ return i;
}
return -ENOSPC;
--
1.8.3.1
^ permalink raw reply related
* [PATCH 1/2] powerpc/64s: move machine check SLB flushing to mm/slb.c
From: Nicholas Piggin @ 2018-08-03 4:13 UTC (permalink / raw)
To: linuxppc-dev
Cc: Nicholas Piggin, kvm-ppc, Gautham R . Shenoy,
Mahesh Jagannath Salgaonkar, Aneesh Kumar K.V, Akshay Adiga
The machine check code that flushes and restores bolted segments in
real mode belongs in mm/slb.c. This will be used by pseries machine
check and idle code.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
arch/powerpc/include/asm/book3s/64/mmu-hash.h | 3 ++
arch/powerpc/kernel/mce_power.c | 21 ++--------
arch/powerpc/mm/slb.c | 38 +++++++++++++++++++
3 files changed, 44 insertions(+), 18 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 2f74bdc805e0..d4e398185b3a 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -497,6 +497,9 @@ extern void hpte_init_native(void);
extern void slb_initialize(void);
extern void slb_flush_and_rebolt(void);
+extern void slb_flush_all_realmode(void);
+extern void __slb_restore_bolted_realmode(void);
+extern void slb_restore_bolted_realmode(void);
extern void slb_vmalloc_update(void);
extern void slb_set_size(u16 size);
diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index d6756af6ec78..50f7b9817246 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -62,11 +62,8 @@ static unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
#ifdef CONFIG_PPC_BOOK3S_64
static void flush_and_reload_slb(void)
{
- struct slb_shadow *slb;
- unsigned long i, n;
-
/* Invalidate all SLBs */
- asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+ slb_flush_all_realmode();
#ifdef CONFIG_KVM_BOOK3S_HANDLER
/*
@@ -76,22 +73,10 @@ static void flush_and_reload_slb(void)
if (get_paca()->kvm_hstate.in_guest)
return;
#endif
-
- /* For host kernel, reload the SLBs from shadow SLB buffer. */
- slb = get_slb_shadow();
- if (!slb)
+ if (early_radix_enabled())
return;
- n = min_t(u32, be32_to_cpu(slb->persistent), SLB_MIN_SIZE);
-
- /* Load up the SLB entries from shadow SLB */
- for (i = 0; i < n; i++) {
- unsigned long rb = be64_to_cpu(slb->save_area[i].esid);
- unsigned long rs = be64_to_cpu(slb->save_area[i].vsid);
-
- rb = (rb & ~0xFFFul) | i;
- asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
- }
+ slb_restore_bolted_realmode();
}
#endif
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index cb796724a6fc..136db8652577 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -90,6 +90,44 @@ static inline void create_shadowed_slbe(unsigned long ea, int ssize,
: "memory" );
}
+/*
+ * Insert bolted entries into SLB (which may not be empty).
+ */
+void __slb_restore_bolted_realmode(void)
+{
+ struct slb_shadow *p = get_slb_shadow();
+ enum slb_index index;
+
+ /* No isync needed because realmode. */
+ for (index = 0; index < SLB_NUM_BOLTED; index++) {
+ asm volatile("slbmte %0,%1" :
+ : "r" (be64_to_cpu(p->save_area[index].vsid)),
+ "r" (be64_to_cpu(p->save_area[index].esid)));
+ }
+}
+
+/*
+ * Insert bolted entries into an empty SLB.
+ * This is not the same as rebolt because the bolted segments
+ * (e.g., kstack) are not changed (rebolted).
+ */
+void slb_restore_bolted_realmode(void)
+{
+ __slb_restore_bolted_realmode();
+ get_paca()->slb_cache_ptr = 0;
+}
+
+/*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+void slb_flush_all_realmode(void)
+{
+ /*
+ * This flushes all SLB entries including 0, so it must be realmode.
+ */
+ asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+}
+
static void __slb_flush_and_rebolt(void)
{
/* If you change this make sure you change SLB_NUM_BOLTED
--
2.17.0
^ permalink raw reply related
* [PATCH 2/2] powerpc/64s: reimplement book3s idle code in C
From: Nicholas Piggin @ 2018-08-03 4:13 UTC (permalink / raw)
To: linuxppc-dev
Cc: Nicholas Piggin, kvm-ppc, Gautham R . Shenoy,
Mahesh Jagannath Salgaonkar, Aneesh Kumar K.V, Akshay Adiga
In-Reply-To: <20180803041350.25493-1-npiggin@gmail.com>
Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
speific HV idle code to the powernv platform code. Generic book3s
assembly stubs are kept in common code and used only to save and
restore the stack frame and non-volatile GPRs just before going to
idle and just after waking up, which can return into C code.
Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
way this stuff will cope with non-trivial new CPU implementation
details, firmware changes, etc., without becoming unmaintainable.
This is not a strict translation to C code, there are some
differences.
- Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
but saves and restores them all explicitly.
- The optimisation where EC=ESL=0 idle modes did not have to save
GPRs or change MSR is restored, because it's now simple to do.
State loss modes that did not actually lose GPRs can use this
optimization too.
- KVM secondary entry code is now more of a call/return style
rather than spaghetti. nap_state_lost is not required beccause
KVM always returns via NVGPR restorig path.
This seems pretty solid, so needs more review and testing now. The
KVM changes are pretty significant and complicated. POWER7 needs to
be tested too.
Open question:
- Why does P9 restore some of the PMU SPRs (but not others), and
P8 only zeroes them?
Since RFC v1:
- Now tested and working with POWER9 hash and radix.
- KVM support added. This took a bit of work to untangle and might
still have some issues, but POWER9 seems to work including hash on
radix with dependent threads mode.
- This snowballed a bit because of KVM and other details making it
not feasible to leave POWER7/8 code alone. That's only half done
at the moment.
- So far this trades about 800 lines of asm for 500 of C. With POWER7/8
support done it might be another hundred or so lines of C.
Since RFC v2:
- Fixed deep state SLB reloading
- Now tested and working with POWER8.
- Accounted for most feedback.
Since RFC v3:
- Rebased to powerpc merge + idle state bugfix
- Split SLB flush/restore code out and shared with MCE code (pseries
MCE patches can also use).
- More testing on POWER8 including KVM with secondaries.
- Performance testing looks good. EC=ESL=0 is about 5% faster, other
stop states look a little faster too.
- Adjusted SPR saving to handler POWER7, haven't tested it.
---
arch/powerpc/include/asm/cpuidle.h | 19 +-
arch/powerpc/include/asm/paca.h | 40 +-
arch/powerpc/include/asm/processor.h | 9 +-
arch/powerpc/include/asm/reg.h | 7 +-
arch/powerpc/kernel/asm-offsets.c | 18 -
arch/powerpc/kernel/dt_cpu_ftrs.c | 21 +-
arch/powerpc/kernel/exceptions-64s.S | 17 +-
arch/powerpc/kernel/idle_book3s.S | 998 ++---------------------
arch/powerpc/kernel/setup-common.c | 4 +-
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 90 +-
arch/powerpc/platforms/powernv/idle.c | 816 ++++++++++++++----
arch/powerpc/platforms/powernv/subcore.c | 2 +-
arch/powerpc/xmon/xmon.c | 25 +-
13 files changed, 859 insertions(+), 1207 deletions(-)
diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h
index 43e5f31fe64d..9844b3ded187 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -27,10 +27,11 @@
* the THREAD_WINKLE_BITS are set, which indicate which threads have not
* yet woken from the winkle state.
*/
-#define PNV_CORE_IDLE_LOCK_BIT 0x10000000
+#define NR_PNV_CORE_IDLE_LOCK_BIT 28
+#define PNV_CORE_IDLE_LOCK_BIT (1ULL << NR_PNV_CORE_IDLE_LOCK_BIT)
+#define PNV_CORE_IDLE_WINKLE_COUNT_SHIFT 16
#define PNV_CORE_IDLE_WINKLE_COUNT 0x00010000
-#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x00080000
#define PNV_CORE_IDLE_WINKLE_COUNT_BITS 0x000F0000
#define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
#define PNV_CORE_IDLE_THREAD_WINKLE_BITS 0x0000FF00
@@ -68,16 +69,6 @@
#define ERR_DEEP_STATE_ESL_MISMATCH -2
#ifndef __ASSEMBLY__
-/* Additional SPRs that need to be saved/restored during stop */
-struct stop_sprs {
- u64 pid;
- u64 ldbar;
- u64 fscr;
- u64 hfscr;
- u64 mmcr1;
- u64 mmcr2;
- u64 mmcra;
-};
#define PNV_IDLE_NAME_LEN 16
struct pnv_idle_states_t {
@@ -92,10 +83,6 @@ struct pnv_idle_states_t {
extern struct pnv_idle_states_t *pnv_idle_states;
extern int nr_pnv_idle_states;
-extern u32 pnv_fastsleep_workaround_at_entry[];
-extern u32 pnv_fastsleep_workaround_at_exit[];
-
-extern u64 pnv_first_deep_stop_state;
unsigned long pnv_cpu_offline(unsigned int cpu);
int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags);
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 4e9cede5a7e7..d2cee5ebaaa1 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -168,7 +168,6 @@ struct paca_struct {
u8 irq_happened; /* irq happened while soft-disabled */
u8 io_sync; /* writel() needs spin_unlock sync */
u8 irq_work_pending; /* IRQ_WORK interrupt while soft-disable */
- u8 nap_state_lost; /* NV GPR values lost in power7_idle */
#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
u8 pmcregs_in_use; /* pseries puts this in lppaca */
#endif
@@ -178,23 +177,30 @@ struct paca_struct {
#endif
#ifdef CONFIG_PPC_POWERNV
- /* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
- u32 *core_idle_state_ptr;
- u8 thread_idle_state; /* PNV_THREAD_RUNNING/NAP/SLEEP */
- /* Mask to indicate thread id in core */
- u8 thread_mask;
- /* Mask to denote subcore sibling threads */
- u8 subcore_sibling_mask;
- /* Flag to request this thread not to stop */
- atomic_t dont_stop;
- /* The PSSCR value that the kernel requested before going to stop */
- u64 requested_psscr;
+ /* PowerNV idle fields */
+ /* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */
+ unsigned long idle_state;
+ union {
+ /* P7/P8 specific fields */
+ struct {
+ /* PNV_THREAD_RUNNING/NAP/SLEEP */
+ u8 thread_idle_state;
+ /* Mask to indicate thread id in core */
+ u8 thread_mask;
+ /* Mask to denote subcore sibling threads */
+ u8 subcore_sibling_mask;
+ };
- /*
- * Save area for additional SPRs that need to be
- * saved/restored during cpuidle stop.
- */
- struct stop_sprs stop_sprs;
+ /* P9 specific fields */
+ struct {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ /* The PSSCR value that the kernel requested before going to stop */
+ u64 requested_psscr;
+ /* Flag to request this thread not to stop */
+ atomic_t dont_stop;
+#endif
+ };
+ };
#endif
#ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index 52fadded5c1e..3e00d959f628 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -506,14 +506,17 @@ static inline unsigned long get_clean_sp(unsigned long sp, int is_32)
}
#endif
+/* asm stubs */
+extern unsigned long isa3_idle_stop_noloss(unsigned long psscr_val);
+extern unsigned long isa3_idle_stop_mayloss(unsigned long psscr_val);
+extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
+
extern unsigned long cpuidle_disable;
enum idle_boot_override {IDLE_NO_OVERRIDE = 0, IDLE_POWERSAVE_OFF};
extern int powersave_nap; /* set if nap mode can be used in idle loop */
-extern unsigned long power7_idle_insn(unsigned long type); /* PNV_THREAD_NAP/etc*/
+
extern void power7_idle_type(unsigned long type);
-extern unsigned long power9_idle_stop(unsigned long psscr_val);
-extern unsigned long power9_offline_stop(unsigned long psscr_val);
extern void power9_idle_type(unsigned long stop_psscr_val,
unsigned long stop_psscr_mask);
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 486b7c83b8c5..6dd294b1b216 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -752,10 +752,9 @@
#define SRR1_WAKERESET 0x00100000 /* System reset */
#define SRR1_WAKEHDBELL 0x000c0000 /* Hypervisor doorbell on P8 */
#define SRR1_WAKESTATE 0x00030000 /* Powersave exit mask [46:47] */
-#define SRR1_WS_DEEPEST 0x00030000 /* Some resources not maintained,
- * may not be recoverable */
-#define SRR1_WS_DEEPER 0x00020000 /* Some resources not maintained */
-#define SRR1_WS_DEEP 0x00010000 /* All resources maintained */
+#define SRR1_WS_HVLOSS 0x00030000 /* HV resources not maintained */
+#define SRR1_WS_GPRLOSS 0x00020000 /* GPRs not maintained */
+#define SRR1_WS_NOLOSS 0x00010000 /* All resources maintained */
#define SRR1_PROGTM 0x00200000 /* TM Bad Thing */
#define SRR1_PROGFPE 0x00100000 /* Floating Point Enabled */
#define SRR1_PROGILL 0x00080000 /* Illegal instruction */
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 89cf15566c4e..7834256585f1 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -256,7 +256,6 @@ int main(void)
OFFSET(ACCOUNT_USER_TIME, paca_struct, accounting.utime);
OFFSET(ACCOUNT_SYSTEM_TIME, paca_struct, accounting.stime);
OFFSET(PACA_TRAP_SAVE, paca_struct, trap_save);
- OFFSET(PACA_NAPSTATELOST, paca_struct, nap_state_lost);
OFFSET(PACA_SPRG_VDSO, paca_struct, sprg_vdso);
#else /* CONFIG_PPC64 */
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
@@ -761,23 +760,6 @@ int main(void)
OFFSET(VCPU_TIMING_LAST_ENTER_TBL, kvm_vcpu, arch.timing_last_enter.tv32.tbl);
#endif
-#ifdef CONFIG_PPC_POWERNV
- OFFSET(PACA_CORE_IDLE_STATE_PTR, paca_struct, core_idle_state_ptr);
- OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
- OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
- OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
- OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
- OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
-#define STOP_SPR(x, f) OFFSET(x, paca_struct, stop_sprs.f)
- STOP_SPR(STOP_PID, pid);
- STOP_SPR(STOP_LDBAR, ldbar);
- STOP_SPR(STOP_FSCR, fscr);
- STOP_SPR(STOP_HFSCR, hfscr);
- STOP_SPR(STOP_MMCR1, mmcr1);
- STOP_SPR(STOP_MMCR2, mmcr2);
- STOP_SPR(STOP_MMCRA, mmcra);
-#endif
-
DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
DEFINE(PPC_DBELL_MSGTYPE, PPC_DBELL_MSGTYPE);
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index f432054234a4..d635d78facdc 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -71,7 +71,6 @@ static int hv_mode;
static struct {
u64 lpcr;
- u64 lpcr_clear;
u64 hfscr;
u64 fscr;
} system_registers;
@@ -80,24 +79,7 @@ static void (*init_pmu_registers)(void);
static void __restore_cpu_cpufeatures(void)
{
- u64 lpcr;
-
- /*
- * LPCR is restored by the power on engine already. It can be changed
- * after early init e.g., by radix enable, and we have no unified API
- * for saving and restoring such SPRs.
- *
- * This ->restore hook should really be removed from idle and register
- * restore moved directly into the idle restore code, because this code
- * doesn't know how idle is implemented or what it needs restored here.
- *
- * The best we can do to accommodate secondary boot and idle restore
- * for now is "or" LPCR with existing.
- */
- lpcr = mfspr(SPRN_LPCR);
- lpcr |= system_registers.lpcr;
- lpcr &= ~system_registers.lpcr_clear;
- mtspr(SPRN_LPCR, lpcr);
+ mtspr(SPRN_LPCR, system_registers.lpcr);
if (hv_mode) {
mtspr(SPRN_LPID, 0);
mtspr(SPRN_HFSCR, system_registers.hfscr);
@@ -318,7 +300,6 @@ static int __init feat_enable_mmu_hash_v3(struct dt_cpu_feature *f)
{
u64 lpcr;
- system_registers.lpcr_clear |= (LPCR_ISL | LPCR_UPRT | LPCR_HR);
lpcr = mfspr(SPRN_LPCR);
lpcr &= ~(LPCR_ISL | LPCR_UPRT | LPCR_HR);
mtspr(SPRN_LPCR, lpcr);
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 7a672dafd94f..fe1d9b091ac7 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -136,8 +136,9 @@ TRAMP_KVM(PACA_EXNMI, 0x100)
#ifdef CONFIG_PPC_P7_NAP
EXC_COMMON_BEGIN(system_reset_idle_common)
- mfspr r12,SPRN_SRR1
- b pnv_powersave_wakeup
+ mfspr r3,SPRN_SRR1
+ bltlr cr3 /* no state loss, return to idle caller */
+ b idle_return_gpr_loss
#endif
/*
@@ -416,17 +417,17 @@ EXC_COMMON_BEGIN(machine_check_idle_common)
* Then decrement MCE nesting after finishing with the stack.
*/
ld r3,_MSR(r1)
+ ld r4,_LINK(r1)
lhz r11,PACA_IN_MCE(r13)
subi r11,r11,1
sth r11,PACA_IN_MCE(r13)
- /* Turn off the RI bit because SRR1 is used by idle wakeup code. */
- /* Recoverability could be improved by reducing the use of SRR1. */
- li r11,0
- mtmsrd r11,1
-
- b pnv_powersave_wakeup_mce
+ mtlr r4
+ rlwinm r10,r3,47-31,30,31
+ cmpwi cr3,r10,2
+ bltlr cr3 /* no state loss, return to idle caller */
+ b idle_return_gpr_loss
#endif
/*
* Handle machine check early in real mode. We come here with
diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S
index 7f5ac2e8581b..6375b29d1c12 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -1,6 +1,7 @@
/*
- * This file contains idle entry/exit functions for POWER7,
- * POWER8 and POWER9 CPUs.
+ * This file contains general idle entry/exit functions. The platform / CPU
+ * must call the correct save/restore functions and ensure SPRs are saved
+ * and restored correctly, handle KVM, interrupts, etc.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -8,220 +9,67 @@
* 2 of the License, or (at your option) any later version.
*/
-#include <linux/threads.h>
-#include <asm/processor.h>
-#include <asm/page.h>
-#include <asm/cputable.h>
-#include <asm/thread_info.h>
#include <asm/ppc_asm.h>
#include <asm/asm-offsets.h>
#include <asm/ppc-opcode.h>
-#include <asm/hw_irq.h>
-#include <asm/kvm_book3s_asm.h>
-#include <asm/opal.h>
#include <asm/cpuidle.h>
-#include <asm/exception-64s.h>
-#include <asm/book3s/64/mmu-hash.h>
-#include <asm/mmu.h>
-#include <asm/asm-compat.h>
-#include <asm/feature-fixups.h>
-
-#undef DEBUG
-
-/*
- * Use unused space in the interrupt stack to save and restore
- * registers for winkle support.
- */
-#define _MMCR0 GPR0
-#define _SDR1 GPR3
-#define _PTCR GPR3
-#define _RPR GPR4
-#define _SPURR GPR5
-#define _PURR GPR6
-#define _TSCR GPR7
-#define _DSCR GPR8
-#define _AMOR GPR9
-#define _WORT GPR10
-#define _WORC GPR11
-#define _LPCR GPR12
-
-#define PSSCR_EC_ESL_MASK_SHIFTED (PSSCR_EC | PSSCR_ESL) >> 16
-
- .text
/*
- * Used by threads before entering deep idle states. Saves SPRs
- * in interrupt stack frame
- */
-save_sprs_to_stack:
- /*
- * Note all register i.e per-core, per-subcore or per-thread is saved
- * here since any thread in the core might wake up first
- */
-BEGIN_FTR_SECTION
- /*
- * Note - SDR1 is dropped in Power ISA v3. Hence not restoring
- * SDR1 here
- */
- mfspr r3,SPRN_PTCR
- std r3,_PTCR(r1)
- mfspr r3,SPRN_LPCR
- std r3,_LPCR(r1)
-FTR_SECTION_ELSE
- mfspr r3,SPRN_SDR1
- std r3,_SDR1(r1)
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
- mfspr r3,SPRN_RPR
- std r3,_RPR(r1)
- mfspr r3,SPRN_SPURR
- std r3,_SPURR(r1)
- mfspr r3,SPRN_PURR
- std r3,_PURR(r1)
- mfspr r3,SPRN_TSCR
- std r3,_TSCR(r1)
- mfspr r3,SPRN_DSCR
- std r3,_DSCR(r1)
- mfspr r3,SPRN_AMOR
- std r3,_AMOR(r1)
- mfspr r3,SPRN_WORT
- std r3,_WORT(r1)
- mfspr r3,SPRN_WORC
- std r3,_WORC(r1)
-/*
- * On POWER9, there are idle states such as stop4, invoked via cpuidle,
- * that lose hypervisor resources. In such cases, we need to save
- * additional SPRs before entering those idle states so that they can
- * be restored to their older values on wakeup from the idle state.
+ * Desired PSSCR in r3
*
- * On POWER8, the only such deep idle state is winkle which is used
- * only in the context of CPU-Hotplug, where these additional SPRs are
- * reinitiazed to a sane value. Hence there is no need to save/restore
- * these SPRs.
+ * No state will be lost regardless of wakeup mechanism (interrupt or NIA).
+ * Interrupt driven wakeup may clobber volatiles, and should blr (with LR
+ * unchanged) to return to caller with r3 set according to caller's expected
+ * return code (for Book3S/64 that is SRR1).
*/
-BEGIN_FTR_SECTION
- blr
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-
-power9_save_additional_sprs:
- mfspr r3, SPRN_PID
- mfspr r4, SPRN_LDBAR
- std r3, STOP_PID(r13)
- std r4, STOP_LDBAR(r13)
-
- mfspr r3, SPRN_FSCR
- mfspr r4, SPRN_HFSCR
- std r3, STOP_FSCR(r13)
- std r4, STOP_HFSCR(r13)
-
- mfspr r3, SPRN_MMCRA
- mfspr r4, SPRN_MMCR0
- std r3, STOP_MMCRA(r13)
- std r4, _MMCR0(r1)
-
- mfspr r3, SPRN_MMCR1
- mfspr r4, SPRN_MMCR2
- std r3, STOP_MMCR1(r13)
- std r4, STOP_MMCR2(r13)
- blr
-
-power9_restore_additional_sprs:
- ld r3,_LPCR(r1)
- ld r4, STOP_PID(r13)
- mtspr SPRN_LPCR,r3
- mtspr SPRN_PID, r4
-
- ld r3, STOP_LDBAR(r13)
- ld r4, STOP_FSCR(r13)
- mtspr SPRN_LDBAR, r3
- mtspr SPRN_FSCR, r4
-
- ld r3, STOP_HFSCR(r13)
- ld r4, STOP_MMCRA(r13)
- mtspr SPRN_HFSCR, r3
- mtspr SPRN_MMCRA, r4
-
- ld r3, _MMCR0(r1)
- ld r4, STOP_MMCR1(r13)
- mtspr SPRN_MMCR0, r3
- mtspr SPRN_MMCR1, r4
-
- ld r3, STOP_MMCR2(r13)
- ld r4, PACA_SPRG_VDSO(r13)
- mtspr SPRN_MMCR2, r3
- mtspr SPRN_SPRG3, r4
+_GLOBAL(isa3_idle_stop_noloss)
+ mtspr SPRN_PSSCR,r3
+ PPC_STOP
+ li r3,0
blr
/*
- * Used by threads when the lock bit of core_idle_state is set.
- * Threads will spin in HMT_LOW until the lock bit is cleared.
- * r14 - pointer to core_idle_state
- * r15 - used to load contents of core_idle_state
- * r9 - used as a temporary variable
+ * Desired PSSCR in r3
+ *
+ * GPRs may be lost, so they are saved here. Wakeup is by interrupt only.
+ * Wakeup interrupt returns to the caller by calling isa3_idle_wake_gpr_loss
+ * with r3 set to desired return value.
+ *
+ * A wakeup without GPR loss may alteratively be handled as in
+ * isa3_idle_stop_noloss as an optimisation.
+ *
+ * Caller is responsible for restoring SPRs, MSR, etc.
*/
-
-core_idle_lock_held:
- HMT_LOW
-3: lwz r15,0(r14)
- andis. r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
- bne 3b
- HMT_MEDIUM
- lwarx r15,0,r14
- andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
- bne- core_idle_lock_held
- blr
+_GLOBAL(isa3_idle_stop_mayloss)
+ mtspr SPRN_PSSCR,r3
+ std r1,PACAR1(r13)
+ mflr r4
+ mfcr r5
+ /* use stack red zone rather than a new frame */
+ addi r6,r1,-INT_FRAME_SIZE
+ SAVE_GPR(2, r6)
+ SAVE_NVGPRS(r6)
+ std r4,_LINK(r6)
+ std r5,_CCR(r6)
+ PPC_STOP
+ b . /* catch bugs */
/*
- * Pass requested state in r3:
- * r3 - PNV_THREAD_NAP/SLEEP/WINKLE in POWER8
- * - Requested PSSCR value in POWER9
+ * Desired return value in r3
*
- * Address of idle handler to branch to in realmode in r4
+ * Idle wakeup can call this after calling isa3_idle_stop_loss to
+ * return to its caller with r3 as return code.
*/
-pnv_powersave_common:
- /* Use r3 to pass state nap/sleep/winkle */
- /* NAP is a state loss, we create a regs frame on the
- * stack, fill it up with the state we care about and
- * stick a pointer to it in PACAR1. We really only
- * need to save PC, some CR bits and the NV GPRs,
- * but for now an interrupt frame will do.
- */
- mtctr r4
-
- mflr r0
- std r0,16(r1)
- stdu r1,-INT_FRAME_SIZE(r1)
- std r0,_LINK(r1)
- std r0,_NIP(r1)
-
- /* We haven't lost state ... yet */
- li r0,0
- stb r0,PACA_NAPSTATELOST(r13)
-
- /* Continue saving state */
- SAVE_GPR(2, r1)
- SAVE_NVGPRS(r1)
- mfcr r5
- std r5,_CCR(r1)
- std r1,PACAR1(r13)
-
-BEGIN_FTR_SECTION
- /*
- * POWER9 does not require real mode to stop, and presently does not
- * set hwthread_state for KVM (threads don't share MMU context), so
- * we can remain in virtual mode for this.
- */
- bctr
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
- /*
- * POWER8
- * Go to real mode to do the nap, as required by the architecture.
- * Also, we need to be in real mode before setting hwthread_state,
- * because as soon as we do that, another thread can switch
- * the MMU context to the guest.
- */
- LOAD_REG_IMMEDIATE(r7, MSR_IDLE)
- mtmsrd r7,0
- bctr
+_GLOBAL(idle_return_gpr_loss)
+ ld r1,PACAR1(r13)
+ addi r6,r1,-INT_FRAME_SIZE
+ ld r4,_LINK(r6)
+ ld r5,_CCR(r6)
+ REST_NVGPRS(r6)
+ REST_GPR(2, r6)
+ mtlr r4
+ mtcr r5
+ blr
/*
* This is the sequence required to execute idle instructions, as
@@ -234,723 +82,53 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
ld r0,0(r1); \
236: cmpd cr0,r0,r0; \
bne 236b; \
- IDLE_INST;
-
-
- .globl pnv_enter_arch207_idle_mode
-pnv_enter_arch207_idle_mode:
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
- /* Tell KVM we're entering idle */
- li r4,KVM_HWTHREAD_IN_IDLE
- /******************************************************/
- /* N O T E W E L L ! ! ! N O T E W E L L */
- /* The following store to HSTATE_HWTHREAD_STATE(r13) */
- /* MUST occur in real mode, i.e. with the MMU off, */
- /* and the MMU must stay off until we clear this flag */
- /* and test HSTATE_HWTHREAD_REQ(r13) in */
- /* pnv_powersave_wakeup in this file. */
- /* The reason is that another thread can switch the */
- /* MMU to a guest context whenever this flag is set */
- /* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on, */
- /* that would potentially cause this thread to start */
- /* executing instructions from guest memory in */
- /* hypervisor mode, leading to a host crash or data */
- /* corruption, or worse. */
- /******************************************************/
- stb r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
- stb r3,PACA_THREAD_IDLE_STATE(r13)
- cmpwi cr3,r3,PNV_THREAD_SLEEP
- bge cr3,2f
- IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
- /* No return */
-2:
- /* Sleep or winkle */
- lbz r7,PACA_THREAD_MASK(r13)
- ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
- li r5,0
- beq cr3,3f
- lis r5,PNV_CORE_IDLE_WINKLE_COUNT@h
-3:
-lwarx_loop1:
- lwarx r15,0,r14
-
- andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
- bnel- core_idle_lock_held
-
- add r15,r15,r5 /* Add if winkle */
- andc r15,r15,r7 /* Clear thread bit */
-
- andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS
-
-/*
- * If cr0 = 0, then current thread is the last thread of the core entering
- * sleep. Last thread needs to execute the hardware bug workaround code if
- * required by the platform.
- * Make the workaround call unconditionally here. The below branch call is
- * patched out when the idle states are discovered if the platform does not
- * require it.
- */
-.global pnv_fastsleep_workaround_at_entry
-pnv_fastsleep_workaround_at_entry:
- beq fastsleep_workaround_at_entry
-
- stwcx. r15,0,r14
- bne- lwarx_loop1
- isync
-
-common_enter: /* common code for all the threads entering sleep or winkle */
- bgt cr3,enter_winkle
- IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
-
-fastsleep_workaround_at_entry:
- oris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
- stwcx. r15,0,r14
- bne- lwarx_loop1
- isync
-
- /* Fast sleep workaround */
- li r3,1
- li r4,1
- bl opal_config_cpu_idle_state
-
- /* Unlock */
- xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
- lwsync
- stw r15,0(r14)
- b common_enter
-
-enter_winkle:
- bl save_sprs_to_stack
-
- IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
-
-/*
- * r3 - PSSCR value corresponding to the requested stop state.
- */
-power_enter_stop:
-/*
- * Check if we are executing the lite variant with ESL=EC=0
- */
- andis. r4,r3,PSSCR_EC_ESL_MASK_SHIFTED
- clrldi r3,r3,60 /* r3 = Bits[60:63] = Requested Level (RL) */
- bne .Lhandle_esl_ec_set
- PPC_STOP
- li r3,0 /* Since we didn't lose state, return 0 */
- std r3, PACA_REQ_PSSCR(r13)
-
- /*
- * pnv_wakeup_noloss() expects r12 to contain the SRR1 value so
- * it can determine if the wakeup reason is an HMI in
- * CHECK_HMI_INTERRUPT.
- *
- * However, when we wakeup with ESL=0, SRR1 will not contain the wakeup
- * reason, so there is no point setting r12 to SRR1.
- *
- * Further, we clear r12 here, so that we don't accidentally enter the
- * HMI in pnv_wakeup_noloss() if the value of r12[42:45] == WAKE_HMI.
- */
- li r12, 0
- b pnv_wakeup_noloss
-
-.Lhandle_esl_ec_set:
-BEGIN_FTR_SECTION
- /*
- * POWER9 DD2.0 or earlier can incorrectly set PMAO when waking up after
- * a state-loss idle. Saving and restoring MMCR0 over idle is a
- * workaround.
- */
- mfspr r4,SPRN_MMCR0
- std r4,_MMCR0(r1)
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
-
-/*
- * Check if the requested state is a deep idle state.
- */
- LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
- ld r4,ADDROFF(pnv_first_deep_stop_state)(r5)
- cmpd r3,r4
- bge .Lhandle_deep_stop
- PPC_STOP /* Does not return (system reset interrupt) */
-
-.Lhandle_deep_stop:
-/*
- * Entering deep idle state.
- * Clear thread bit in PACA_CORE_IDLE_STATE, save SPRs to
- * stack and enter stop
- */
- lbz r7,PACA_THREAD_MASK(r13)
- ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
-
-lwarx_loop_stop:
- lwarx r15,0,r14
- andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
- bnel- core_idle_lock_held
- andc r15,r15,r7 /* Clear thread bit */
-
- stwcx. r15,0,r14
- bne- lwarx_loop_stop
- isync
-
- bl save_sprs_to_stack
-
- PPC_STOP /* Does not return (system reset interrupt) */
-
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired idle state (PNV_THREAD_NAP/SLEEP/WINKLE).
- */
-_GLOBAL(power7_idle_insn)
- /* Now check if user or arch enabled NAP mode */
- LOAD_REG_ADDR(r4, pnv_enter_arch207_idle_mode)
- b pnv_powersave_common
-
-#define CHECK_HMI_INTERRUPT \
-BEGIN_FTR_SECTION_NESTED(66); \
- rlwinm r0,r12,45-31,0xf; /* extract wake reason field (P8) */ \
-FTR_SECTION_ELSE_NESTED(66); \
- rlwinm r0,r12,45-31,0xe; /* P7 wake reason field is 3 bits */ \
-ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66); \
- cmpwi r0,0xa; /* Hypervisor maintenance ? */ \
- bne+ 20f; \
- /* Invoke opal call to handle hmi */ \
- ld r2,PACATOC(r13); \
- ld r1,PACAR1(r13); \
- std r3,ORIG_GPR3(r1); /* Save original r3 */ \
- li r3,0; /* NULL argument */ \
- bl hmi_exception_realmode; \
- nop; \
- ld r3,ORIG_GPR3(r1); /* Restore original r3 */ \
-20: nop;
-
-/*
- * Entered with MSR[EE]=0 and no soft-masked interrupts pending.
- * r3 contains desired PSSCR register value.
- *
- * Offline (CPU unplug) case also must notify KVM that the CPU is
- * idle.
- */
-_GLOBAL(power9_offline_stop)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
- /*
- * Tell KVM we're entering idle.
- * This does not have to be done in real mode because the P9 MMU
- * is independent per-thread. Some steppings share radix/hash mode
- * between threads, but in that case KVM has a barrier sync in real
- * mode before and after switching between radix and hash.
- */
- li r4,KVM_HWTHREAD_IN_IDLE
- stb r4,HSTATE_HWTHREAD_STATE(r13)
-#endif
- /* fall through */
-
-_GLOBAL(power9_idle_stop)
- std r3, PACA_REQ_PSSCR(r13)
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-BEGIN_FTR_SECTION
- sync
- lwz r5, PACA_DONT_STOP(r13)
- cmpwi r5, 0
- bne 1f
-END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
-#endif
- mtspr SPRN_PSSCR,r3
- LOAD_REG_ADDR(r4,power_enter_stop)
- b pnv_powersave_common
- /* No return */
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-1:
- /*
- * We get here when TM / thread reconfiguration bug workaround
- * code wants to get the CPU into SMT4 mode, and therefore
- * we are being asked not to stop.
- */
- li r3, 0
- std r3, PACA_REQ_PSSCR(r13)
- blr /* return 0 for wakeup cause / SRR1 value */
-#endif
+ IDLE_INST; \
+ b . /* catch bugs */
/*
- * Called from machine check handler for powersave wakeups.
- * Low level machine check processing has already been done. Now just
- * go through the wake up path to get everything in order.
+ * Desired instruction type in r3
*
- * r3 - The original SRR1 value.
- * Original SRR[01] have been clobbered.
- * MSR_RI is clear.
- */
-.global pnv_powersave_wakeup_mce
-pnv_powersave_wakeup_mce:
- /* Set cr3 for pnv_powersave_wakeup */
- rlwinm r11,r3,47-31,30,31
- cmpwi cr3,r11,2
-
- /*
- * Now put the original SRR1 with SRR1_WAKEMCE_RESVD as the wake
- * reason into r12, which allows reuse of the system reset wakeup
- * code without being mistaken for another type of wakeup.
- */
- oris r12,r3,SRR1_WAKEMCE_RESVD@h
-
- b pnv_powersave_wakeup
-
-/*
- * Called from reset vector for powersave wakeups.
- * cr3 - set to gt if waking up with partial/complete hypervisor state loss
- * r12 - SRR1
- */
-.global pnv_powersave_wakeup
-pnv_powersave_wakeup:
- ld r2, PACATOC(r13)
-
-BEGIN_FTR_SECTION
- bl pnv_restore_hyp_resource_arch300
-FTR_SECTION_ELSE
- bl pnv_restore_hyp_resource_arch207
-ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
-
- li r0,PNV_THREAD_RUNNING
- stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */
-
- mr r3,r12
-
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
- lbz r0,HSTATE_HWTHREAD_STATE(r13)
- cmpwi r0,KVM_HWTHREAD_IN_KERNEL
- beq 0f
- li r0,KVM_HWTHREAD_IN_KERNEL
- stb r0,HSTATE_HWTHREAD_STATE(r13)
- /* Order setting hwthread_state vs. testing hwthread_req */
- sync
-0: lbz r0,HSTATE_HWTHREAD_REQ(r13)
- cmpwi r0,0
- beq 1f
- b kvm_start_guest
-1:
-#endif
-
- /* Return SRR1 from power7_nap() */
- blt cr3,pnv_wakeup_noloss
- b pnv_wakeup_loss
-
-/*
- * Check whether we have woken up with hypervisor state loss.
- * If yes, restore hypervisor state and return back to link.
- *
- * cr3 - set to gt if waking up with partial/complete hypervisor state loss
- */
-pnv_restore_hyp_resource_arch300:
- /*
- * Workaround for POWER9, if we lost resources, the ERAT
- * might have been mixed up and needs flushing. We also need
- * to reload MMCR0 (see comment above). We also need to set
- * then clear bit 60 in MMCRA to ensure the PMU starts running.
- */
- blt cr3,1f
-BEGIN_FTR_SECTION
- PPC_INVALIDATE_ERAT
- ld r1,PACAR1(r13)
- ld r4,_MMCR0(r1)
- mtspr SPRN_MMCR0,r4
-END_FTR_SECTION_IFCLR(CPU_FTR_POWER9_DD2_1)
- mfspr r4,SPRN_MMCRA
- ori r4,r4,(1 << (63-60))
- mtspr SPRN_MMCRA,r4
- xori r4,r4,(1 << (63-60))
- mtspr SPRN_MMCRA,r4
-1:
- /*
- * POWER ISA 3. Use PSSCR to determine if we
- * are waking up from deep idle state
- */
- LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
- ld r4,ADDROFF(pnv_first_deep_stop_state)(r5)
-
- /*
- * 0-3 bits correspond to Power-Saving Level Status
- * which indicates the idle state we are waking up from
- */
- mfspr r5, SPRN_PSSCR
- rldicl r5,r5,4,60
- li r0, 0 /* clear requested_psscr to say we're awake */
- std r0, PACA_REQ_PSSCR(r13)
- cmpd cr4,r5,r4
- bge cr4,pnv_wakeup_tb_loss /* returns to caller */
-
- blr /* Waking up without hypervisor state loss. */
-
-/* Same calling convention as arch300 */
-pnv_restore_hyp_resource_arch207:
- /*
- * POWER ISA 2.07 or less.
- * Check if we slept with sleep or winkle.
- */
- lbz r4,PACA_THREAD_IDLE_STATE(r13)
- cmpwi cr2,r4,PNV_THREAD_NAP
- bgt cr2,pnv_wakeup_tb_loss /* Either sleep or Winkle */
-
- /*
- * We fall through here if PACA_THREAD_IDLE_STATE shows we are waking
- * up from nap. At this stage CR3 shouldn't contains 'gt' since that
- * indicates we are waking with hypervisor state loss from nap.
- */
- bgt cr3,.
-
- blr /* Waking up without hypervisor state loss */
-
-/*
- * Called if waking up from idle state which can cause either partial or
- * complete hyp state loss.
- * In POWER8, called if waking up from fastsleep or winkle
- * In POWER9, called if waking up from stop state >= pnv_first_deep_stop_state
+ * GPRs may be lost, so they are saved here. Wakeup is by interrupt only.
+ * Wakeup interrupt returns to the caller by calling isa3_idle_wake_gpr_loss
+ * with r3 set to desired return value.
*
- * r13 - PACA
- * cr3 - gt if waking up with partial/complete hypervisor state loss
+ * A wakeup without GPR loss may alteratively be handled as in
+ * isa3_idle_stop_noloss as an optimisation.
*
- * If ISA300:
- * cr4 - gt or eq if waking up from complete hypervisor state loss.
+ * Caller is responsible for restoring SPRs, MSR, etc.
*
- * If ISA207:
- * r4 - PACA_THREAD_IDLE_STATE
+ * This must be called in real-mode.
*/
-pnv_wakeup_tb_loss:
- ld r1,PACAR1(r13)
- /*
- * Before entering any idle state, the NVGPRs are saved in the stack.
- * If there was a state loss, or PACA_NAPSTATELOST was set, then the
- * NVGPRs are restored. If we are here, it is likely that state is lost,
- * but not guaranteed -- neither ISA207 nor ISA300 tests to reach
- * here are the same as the test to restore NVGPRS:
- * PACA_THREAD_IDLE_STATE test for ISA207, PSSCR test for ISA300,
- * and SRR1 test for restoring NVGPRs.
- *
- * We are about to clobber NVGPRs now, so set NAPSTATELOST to
- * guarantee they will always be restored. This might be tightened
- * with careful reading of specs (particularly for ISA300) but this
- * is already a slow wakeup path and it's simpler to be safe.
- */
- li r0,1
- stb r0,PACA_NAPSTATELOST(r13)
-
- /*
- *
- * Save SRR1 and LR in NVGPRs as they might be clobbered in
- * opal_call() (called in CHECK_HMI_INTERRUPT). SRR1 is required
- * to determine the wakeup reason if we branch to kvm_start_guest. LR
- * is required to return back to reset vector after hypervisor state
- * restore is complete.
- */
- mr r19,r12
- mr r18,r4
- mflr r17
-BEGIN_FTR_SECTION
- CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
-
- ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
- lbz r7,PACA_THREAD_MASK(r13)
-
- /*
- * Take the core lock to synchronize against other threads.
- *
- * Lock bit is set in one of the 2 cases-
- * a. In the sleep/winkle enter path, the last thread is executing
- * fastsleep workaround code.
- * b. In the wake up path, another thread is executing fastsleep
- * workaround undo code or resyncing timebase or restoring context
- * In either case loop until the lock bit is cleared.
- */
-1:
- lwarx r15,0,r14
- andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
- bnel- core_idle_lock_held
- oris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
- stwcx. r15,0,r14
- bne- 1b
- isync
-
- andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS
- cmpwi cr2,r9,0
-
- /*
- * At this stage
- * cr2 - eq if first thread to wakeup in core
- * cr3- gt if waking up with partial/complete hypervisor state loss
- * ISA300:
- * cr4 - gt or eq if waking up from complete hypervisor state loss.
- */
-
-BEGIN_FTR_SECTION
- /*
- * Were we in winkle?
- * If yes, check if all threads were in winkle, decrement our
- * winkle count, set all thread winkle bits if all were in winkle.
- * Check if our thread has a winkle bit set, and set cr4 accordingly
- * (to match ISA300, above). Pseudo-code for core idle state
- * transitions for ISA207 is as follows (everything happens atomically
- * due to store conditional and/or lock bit):
- *
- * nap_idle() { }
- * nap_wake() { }
- *
- * sleep_idle()
- * {
- * core_idle_state &= ~thread_in_core
- * }
- *
- * sleep_wake()
- * {
- * bool first_in_core, first_in_subcore;
- *
- * first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
- * first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
- *
- * core_idle_state |= thread_in_core;
- * }
- *
- * winkle_idle()
- * {
- * core_idle_state &= ~thread_in_core;
- * core_idle_state += 1 << WINKLE_COUNT_SHIFT;
- * }
- *
- * winkle_wake()
- * {
- * bool first_in_core, first_in_subcore, winkle_state_lost;
- *
- * first_in_core = (core_idle_state & IDLE_THREAD_BITS) == 0;
- * first_in_subcore = (core_idle_state & SUBCORE_SIBLING_MASK) == 0;
- *
- * core_idle_state |= thread_in_core;
- *
- * if ((core_idle_state & WINKLE_MASK) == (8 << WINKLE_COUNT_SIHFT))
- * core_idle_state |= THREAD_WINKLE_BITS;
- * core_idle_state -= 1 << WINKLE_COUNT_SHIFT;
- *
- * winkle_state_lost = core_idle_state &
- * (thread_in_core << WINKLE_THREAD_SHIFT);
- * core_idle_state &= ~(thread_in_core << WINKLE_THREAD_SHIFT);
- * }
- *
- */
- cmpwi r18,PNV_THREAD_WINKLE
+_GLOBAL(isa206_idle_insn_mayloss)
+ std r1,PACAR1(r13)
+ mflr r4
+ mfcr r5
+ /* use stack red zone rather than a new frame */
+ addi r6,r1,-INT_FRAME_SIZE
+ SAVE_GPR(2, r6)
+ SAVE_NVGPRS(r6)
+ std r4,_LINK(r6)
+ std r5,_CCR(r6)
+ cmpwi r3,PNV_THREAD_NAP
+ bne 1f
+ IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
+1: cmpwi r3,PNV_THREAD_SLEEP
bne 2f
- andis. r9,r15,PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT@h
- subis r15,r15,PNV_CORE_IDLE_WINKLE_COUNT@h
- beq 2f
- ori r15,r15,PNV_CORE_IDLE_THREAD_WINKLE_BITS /* all were winkle */
-2:
- /* Shift thread bit to winkle mask, then test if this thread is set,
- * and remove it from the winkle bits */
- slwi r8,r7,8
- and r8,r8,r15
- andc r15,r15,r8
- cmpwi cr4,r8,1 /* cr4 will be gt if our bit is set, lt if not */
-
- lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
- and r4,r4,r15
- cmpwi r4,0 /* Check if first in subcore */
-
- or r15,r15,r7 /* Set thread bit */
- beq first_thread_in_subcore
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-
- or r15,r15,r7 /* Set thread bit */
- beq cr2,first_thread_in_core
-
- /* Not first thread in core or subcore to wake up */
- b clear_lock
-
-first_thread_in_subcore:
- /*
- * If waking up from sleep, subcore state is not lost. Hence
- * skip subcore state restore
- */
- blt cr4,subcore_state_restored
-
- /* Restore per-subcore state */
- ld r4,_SDR1(r1)
- mtspr SPRN_SDR1,r4
-
- ld r4,_RPR(r1)
- mtspr SPRN_RPR,r4
- ld r4,_AMOR(r1)
- mtspr SPRN_AMOR,r4
-
-subcore_state_restored:
- /*
- * Check if the thread is also the first thread in the core. If not,
- * skip to clear_lock.
- */
- bne cr2,clear_lock
-
-first_thread_in_core:
-
- /*
- * First thread in the core waking up from any state which can cause
- * partial or complete hypervisor state loss. It needs to
- * call the fastsleep workaround code if the platform requires it.
- * Call it unconditionally here. The below branch instruction will
- * be patched out if the platform does not have fastsleep or does not
- * require the workaround. Patching will be performed during the
- * discovery of idle-states.
- */
-.global pnv_fastsleep_workaround_at_exit
-pnv_fastsleep_workaround_at_exit:
- b fastsleep_workaround_at_exit
-
-timebase_resync:
- /*
- * Use cr3 which indicates that we are waking up with atleast partial
- * hypervisor state loss to determine if TIMEBASE RESYNC is needed.
- */
- ble cr3,.Ltb_resynced
- /* Time base re-sync */
- bl opal_resync_timebase;
- /*
- * If waking up from sleep (POWER8), per core state
- * is not lost, skip to clear_lock.
- */
-.Ltb_resynced:
- blt cr4,clear_lock
-
- /*
- * First thread in the core to wake up and its waking up with
- * complete hypervisor state loss. Restore per core hypervisor
- * state.
- */
-BEGIN_FTR_SECTION
- ld r4,_PTCR(r1)
- mtspr SPRN_PTCR,r4
- ld r4,_RPR(r1)
- mtspr SPRN_RPR,r4
- ld r4,_AMOR(r1)
- mtspr SPRN_AMOR,r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-
- ld r4,_TSCR(r1)
- mtspr SPRN_TSCR,r4
- ld r4,_WORC(r1)
- mtspr SPRN_WORC,r4
-
-clear_lock:
- xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
- lwsync
- stw r15,0(r14)
-
-common_exit:
- /*
- * Common to all threads.
- *
- * If waking up from sleep, hypervisor state is not lost. Hence
- * skip hypervisor state restore.
- */
- blt cr4,hypervisor_state_restored
-
- /* Waking up from winkle */
-
-BEGIN_MMU_FTR_SECTION
- b no_segments
-END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
- /* Restore SLB from PACA */
- ld r8,PACA_SLBSHADOWPTR(r13)
-
- .rept SLB_NUM_BOLTED
- li r3, SLBSHADOW_SAVEAREA
- LDX_BE r5, r8, r3
- addi r3, r3, 8
- LDX_BE r6, r8, r3
- andis. r7,r5,SLB_ESID_V@h
- beq 1f
- slbmte r6,r5
-1: addi r8,r8,16
- .endr
-no_segments:
-
- /* Restore per thread state */
-
- ld r4,_SPURR(r1)
- mtspr SPRN_SPURR,r4
- ld r4,_PURR(r1)
- mtspr SPRN_PURR,r4
- ld r4,_DSCR(r1)
- mtspr SPRN_DSCR,r4
- ld r4,_WORT(r1)
- mtspr SPRN_WORT,r4
+ IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
+ b . /* catch bugs */
+2: IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
+ b . /* catch bugs */
- /* Call cur_cpu_spec->cpu_restore() */
- LOAD_REG_ADDR(r4, cur_cpu_spec)
- ld r4,0(r4)
- ld r12,CPU_SPEC_RESTORE(r4)
-#ifdef PPC64_ELF_ABI_v1
- ld r12,0(r12)
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+_GLOBAL(idle_kvm_start_guest)
+ std r1,PACAR1(r13)
+ mflr r4
+ mfcr r5
+ /* use stack red zone rather than a new frame */
+ addi r6,r1,-INT_FRAME_SIZE
+ SAVE_GPR(2, r6)
+ SAVE_NVGPRS(r6)
+ std r4,_LINK(r6)
+ std r5,_CCR(r6)
+ b kvm_start_guest
#endif
- mtctr r12
- bctrl
-
-/*
- * On POWER9, we can come here on wakeup from a cpuidle stop state.
- * Hence restore the additional SPRs to the saved value.
- *
- * On POWER8, we come here only on winkle. Since winkle is used
- * only in the case of CPU-Hotplug, we don't need to restore
- * the additional SPRs.
- */
-BEGIN_FTR_SECTION
- bl power9_restore_additional_sprs
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-hypervisor_state_restored:
-
- mr r12,r19
- mtlr r17
- blr /* return to pnv_powersave_wakeup */
-
-fastsleep_workaround_at_exit:
- li r3,1
- li r4,0
- bl opal_config_cpu_idle_state
- b timebase_resync
-
-/*
- * R3 here contains the value that will be returned to the caller
- * of power7_nap.
- * R12 contains SRR1 for CHECK_HMI_INTERRUPT.
- */
-.global pnv_wakeup_loss
-pnv_wakeup_loss:
- ld r1,PACAR1(r13)
-BEGIN_FTR_SECTION
- CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
- REST_NVGPRS(r1)
- REST_GPR(2, r1)
- ld r4,PACAKMSR(r13)
- ld r5,_LINK(r1)
- ld r6,_CCR(r1)
- addi r1,r1,INT_FRAME_SIZE
- mtlr r5
- mtcr r6
- mtmsrd r4
- blr
-
-/*
- * R3 here contains the value that will be returned to the caller
- * of power7_nap.
- * R12 contains SRR1 for CHECK_HMI_INTERRUPT.
- */
-pnv_wakeup_noloss:
- lbz r0,PACA_NAPSTATELOST(r13)
- cmpwi r0,0
- bne pnv_wakeup_loss
- ld r1,PACAR1(r13)
-BEGIN_FTR_SECTION
- CHECK_HMI_INTERRUPT
-END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
- ld r4,PACAKMSR(r13)
- ld r5,_NIP(r1)
- ld r6,_CCR(r1)
- addi r1,r1,INT_FRAME_SIZE
- mtlr r5
- mtcr r6
- mtmsrd r4
- blr
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index 40b44bb53a4e..e089da156ef3 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -401,8 +401,8 @@ void __init check_for_initrd(void)
#ifdef CONFIG_SMP
-int threads_per_core, threads_per_subcore, threads_shift;
-cpumask_t threads_core_mask;
+int threads_per_core, threads_per_subcore, threads_shift __read_mostly;
+cpumask_t threads_core_mask __read_mostly;
EXPORT_SYMBOL_GPL(threads_per_core);
EXPORT_SYMBOL_GPL(threads_per_subcore);
EXPORT_SYMBOL_GPL(threads_shift);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 1d14046124a0..8bcbe5ac2754 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -34,6 +34,7 @@
#include <asm/thread_info.h>
#include <asm/asm-compat.h>
#include <asm/feature-fixups.h>
+#include <asm/cpuidle.h>
/* Sign-extend HDEC if not on POWER9 */
#define EXTEND_HDEC(reg) \
@@ -322,43 +323,32 @@ kvm_novcpu_exit:
b kvmhv_switch_to_host
/*
- * We come in here when wakened from nap mode.
- * Relocation is off and most register values are lost.
- * r13 points to the PACA.
+ * We come in here when wakened from Linux offline idle code.
+ * Relocation is off
* r3 contains the SRR1 wakeup value, SRR1 is trashed.
*/
.globl kvm_start_guest
kvm_start_guest:
- /* Set runlatch bit the minute you wake up from nap */
- mfspr r0, SPRN_CTRLF
- ori r0, r0, 1
- mtspr SPRN_CTRLT, r0
-
/*
* Could avoid this and pass it through in r3. For now,
* code expects it to be in SRR1.
*/
mtspr SPRN_SRR1,r3
- ld r2,PACATOC(r13)
-
li r0,0
stb r0,PACA_FTRACE_ENABLED(r13)
li r0,KVM_HWTHREAD_IN_KVM
stb r0,HSTATE_HWTHREAD_STATE(r13)
- /* NV GPR values from power7_idle() will no longer be valid */
- li r0,1
- stb r0,PACA_NAPSTATELOST(r13)
-
- /* were we napping due to cede? */
+ /* cede napping should not come through here */
lbz r0,HSTATE_NAPPING(r13)
- cmpwi r0,NAPPING_CEDE
- beq kvm_end_cede
- cmpwi r0,NAPPING_NOVCPU
- beq kvm_novcpu_wakeup
+ twnei r0,0
+/*
+ * We can come in at this point from KVM nap.
+ */
+do_start_guest:
ld r1,PACAEMERGSP(r13)
subi r1,r1,STACK_FRAME_OVERHEAD
@@ -469,19 +459,17 @@ kvm_no_guest:
lbz r3, HSTATE_HWTHREAD_REQ(r13)
cmpwi r3, 0
bne 54f
-/*
- * We jump to pnv_wakeup_loss, which will return to the caller
- * of power7_nap in the powernv cpu offline loop. The value we
- * put in r3 becomes the return value for power7_nap. pnv_wakeup_loss
- * requires SRR1 in r12.
- */
+
+ /*
+ * Jump to idle_return_gpr_loss, which returns to the
+ * idle_kvm_start_guest caller.
+ */
li r3, LPCR_PECE0
mfspr r4, SPRN_LPCR
rlwimi r4, r3, 0, LPCR_PECE0 | LPCR_PECE1
mtspr SPRN_LPCR, r4
li r3, 0
- mfspr r12,SPRN_SRR1
- b pnv_wakeup_loss
+ b idle_return_gpr_loss
53: HMT_LOW
ld r5, HSTATE_KVM_VCORE(r13)
@@ -2760,21 +2748,47 @@ BEGIN_FTR_SECTION
li r4, LPCR_PECE_HVEE@higher
sldi r4, r4, 32
or r5, r5, r4
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+FTR_SECTION_ELSE
+ li r3, PNV_THREAD_NAP
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
mtspr SPRN_LPCR,r5
isync
- li r0, 0
- std r0, HSTATE_SCRATCH0(r13)
- ptesync
- ld r0, HSTATE_SCRATCH0(r13)
-1: cmpd r0, r0
- bne 1b
+
+ mr r0, r1
+ ld r1, PACAEMERGSP(r13)
+ subi r1, r1, STACK_FRAME_OVERHEAD
+ std r0, 0(r1)
+ ld r0, PACAR1(r13)
+ std r0, 8(r1)
+
BEGIN_FTR_SECTION
- nap
+ bl isa3_idle_stop_mayloss
FTR_SECTION_ELSE
- PPC_STOP
-ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_300)
- b .
+ bl isa206_idle_insn_mayloss
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
+
+ mfspr r0, SPRN_CTRLF
+ ori r0, r0, 1
+ mtspr SPRN_CTRLT, r0
+
+ ld r0, 8(r1)
+ std r0, PACAR1(r13)
+ ld r1, 0(r1)
+
+ mtspr SPRN_SRR1, r3
+
+ li r0, 0
+ stb r0, PACA_FTRACE_ENABLED(r13)
+
+ li r0, KVM_HWTHREAD_IN_KVM
+ stb r0, HSTATE_HWTHREAD_STATE(r13)
+
+ lbz r0, HSTATE_NAPPING(r13)
+ cmpwi r0, NAPPING_CEDE
+ beq kvm_end_cede
+ cmpwi r0, NAPPING_NOVCPU
+ beq kvm_novcpu_wakeup
+ b do_start_guest
33: mr r4, r3
li r3, 0
diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
index ecb002c5db83..446f8cc6070f 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -16,6 +16,7 @@
#include <linux/device.h>
#include <linux/cpu.h>
+#include <asm/asm-prototypes.h>
#include <asm/firmware.h>
#include <asm/machdep.h>
#include <asm/opal.h>
@@ -48,10 +49,10 @@ static u64 pnv_default_stop_mask;
static bool default_stop_found;
/*
- * First deep stop state. Used to figure out when to save/restore
- * hypervisor context.
+ * First stop state levels when HV and TB loss can occur.
*/
-u64 pnv_first_deep_stop_state = MAX_STOP_STATE;
+static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
/*
* psscr value and mask of the deepest stop idle state.
@@ -62,6 +63,8 @@ static u64 pnv_deepest_stop_psscr_mask;
static u64 pnv_deepest_stop_flag;
static bool deepest_stop_found;
+static unsigned long power7_offline_type;
+
static int pnv_save_sprs_for_deep_states(void)
{
int cpu;
@@ -72,12 +75,12 @@ static int pnv_save_sprs_for_deep_states(void)
* all cpus at boot. Get these reg values of current cpu and use the
* same across all cpus.
*/
- uint64_t lpcr_val = mfspr(SPRN_LPCR);
- uint64_t hid0_val = mfspr(SPRN_HID0);
- uint64_t hid1_val = mfspr(SPRN_HID1);
- uint64_t hid4_val = mfspr(SPRN_HID4);
- uint64_t hid5_val = mfspr(SPRN_HID5);
- uint64_t hmeer_val = mfspr(SPRN_HMEER);
+ uint64_t lpcr_val = mfspr(SPRN_LPCR);
+ uint64_t hid0_val = mfspr(SPRN_HID0);
+ uint64_t hid1_val = mfspr(SPRN_HID1);
+ uint64_t hid4_val = mfspr(SPRN_HID4);
+ uint64_t hid5_val = mfspr(SPRN_HID5);
+ uint64_t hmeer_val = mfspr(SPRN_HMEER);
uint64_t msr_val = MSR_IDLE;
uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -137,89 +140,6 @@ static int pnv_save_sprs_for_deep_states(void)
return 0;
}
-static void pnv_alloc_idle_core_states(void)
-{
- int i, j;
- int nr_cores = cpu_nr_cores();
- u32 *core_idle_state;
-
- /*
- * core_idle_state - The lower 8 bits track the idle state of
- * each thread of the core.
- *
- * The most significant bit is the lock bit.
- *
- * Initially all the bits corresponding to threads_per_core
- * are set. They are cleared when the thread enters deep idle
- * state like sleep and winkle/stop.
- *
- * Initially the lock bit is cleared. The lock bit has 2
- * purposes:
- * a. While the first thread in the core waking up from
- * idle is restoring core state, it prevents other
- * threads in the core from switching to process
- * context.
- * b. While the last thread in the core is saving the
- * core state, it prevents a different thread from
- * waking up.
- */
- for (i = 0; i < nr_cores; i++) {
- int first_cpu = i * threads_per_core;
- int node = cpu_to_node(first_cpu);
- size_t paca_ptr_array_size;
-
- core_idle_state = kmalloc_node(sizeof(u32), GFP_KERNEL, node);
- *core_idle_state = (1 << threads_per_core) - 1;
- paca_ptr_array_size = (threads_per_core *
- sizeof(struct paca_struct *));
-
- for (j = 0; j < threads_per_core; j++) {
- int cpu = first_cpu + j;
-
- paca_ptrs[cpu]->core_idle_state_ptr = core_idle_state;
- paca_ptrs[cpu]->thread_idle_state = PNV_THREAD_RUNNING;
- paca_ptrs[cpu]->thread_mask = 1 << j;
- }
- }
-
- update_subcore_sibling_mask();
-
- if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
- int rc = pnv_save_sprs_for_deep_states();
-
- if (likely(!rc))
- return;
-
- /*
- * The stop-api is unable to restore hypervisor
- * resources on wakeup from platform idle states which
- * lose full context. So disable such states.
- */
- supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
- pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
- pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
-
- if (cpu_has_feature(CPU_FTR_ARCH_300) &&
- (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
- /*
- * Use the default stop state for CPU-Hotplug
- * if available.
- */
- if (default_stop_found) {
- pnv_deepest_stop_psscr_val =
- pnv_default_stop_val;
- pnv_deepest_stop_psscr_mask =
- pnv_default_stop_mask;
- pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
- pnv_deepest_stop_psscr_val);
- } else { /* Fallback to snooze loop for CPU-Hotplug */
- deepest_stop_found = false;
- pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
- }
- }
- }
-}
-
u32 pnv_get_supported_cpuidle_states(void)
{
return supported_cpuidle_states;
@@ -238,6 +158,9 @@ static void pnv_fastsleep_workaround_apply(void *info)
*err = 1;
}
+static bool power7_fastsleep_workaround_entry = true;
+static bool power7_fastsleep_workaround_exit = true;
+
/*
* Used to store fastsleep workaround state
* 0 - Workaround applied/undone at fastsleep entry/exit path (Default)
@@ -277,13 +200,7 @@ static ssize_t store_fastsleep_workaround_applyonce(struct device *dev,
* offlined, as last thread of the core entering fastsleep or deeper
* state would have applied workaround.
*/
- err = patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_exit,
- PPC_INST_NOP);
- if (err) {
- pr_err("fastsleep_workaround_applyonce change failed while patching pnv_fastsleep_workaround_at_exit");
- goto fail;
- }
+ power7_fastsleep_workaround_exit = false;
get_online_cpus();
primary_thread_mask = cpu_online_cores_map();
@@ -296,13 +213,7 @@ static ssize_t store_fastsleep_workaround_applyonce(struct device *dev,
goto fail;
}
- err = patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_entry,
- PPC_INST_NOP);
- if (err) {
- pr_err("fastsleep_workaround_applyonce change failed while patching pnv_fastsleep_workaround_at_entry");
- goto fail;
- }
+ power7_fastsleep_workaround_entry = false;
fastsleep_workaround_applyonce = 1;
@@ -315,6 +226,303 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
show_fastsleep_workaround_applyonce,
store_fastsleep_workaround_applyonce);
+static inline void atomic_start_thread_idle(void)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ int thread_nr = cpu_thread_in_core(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+
+ clear_bit(thread_nr, state);
+}
+
+static inline void atomic_stop_thread_idle(void)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ int thread_nr = cpu_thread_in_core(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+
+ set_bit(thread_nr, state);
+}
+
+static inline void atomic_lock_thread_idle(void)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+
+ while (unlikely(test_and_set_bit_lock(NR_PNV_CORE_IDLE_LOCK_BIT, state)))
+ barrier();
+ isync();
+}
+
+static inline void atomic_unlock_and_stop_thread_idle(void)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ unsigned long thread = 1UL << cpu_thread_in_core(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+ u64 s = READ_ONCE(*state);
+ u64 new, tmp;
+
+ isync();
+
+ BUG_ON(!(s & PNV_CORE_IDLE_LOCK_BIT));
+ BUG_ON(s & thread);
+
+again:
+ new = (s | thread) & ~PNV_CORE_IDLE_LOCK_BIT;
+ tmp = cmpxchg(state, s, new);
+ if (unlikely(tmp != s)) {
+ s = tmp;
+ goto again;
+ }
+}
+
+static inline void atomic_unlock_thread_idle(void)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+
+ BUG_ON(!test_bit(NR_PNV_CORE_IDLE_LOCK_BIT, state));
+ clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
+}
+
+/* P7 and P8 */
+struct p7_sprs {
+ /* per core */
+ u64 tscr;
+ u64 worc;
+
+ /* per subcore */
+ u64 sdr1;
+ u64 rpr;
+ u64 amor;
+
+ /* per thread */
+ u64 lpcr;
+ u64 hfscr;
+ u64 fscr;
+ u64 purr;
+ u64 spurr;
+ u64 dscr;
+ u64 wort;
+};
+
+static unsigned long power7_idle_insn(unsigned long type)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ unsigned long thread = 1UL << cpu_thread_in_core(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+ unsigned long srr1;
+ bool full_winkle;
+ struct p7_sprs sprs;
+ bool sprs_saved = false;
+ int rc;
+
+ memset(&sprs, 0, sizeof(sprs));
+
+ if (unlikely(type != PNV_THREAD_NAP)) {
+ atomic_lock_thread_idle();
+
+ BUG_ON(!(*state & thread));
+ *state &= ~thread;
+
+ if (power7_fastsleep_workaround_entry) {
+ if ((*state & ((1 << threads_per_core) - 1)) == 0) {
+ rc = opal_config_cpu_idle_state(
+ OPAL_CONFIG_IDLE_FASTSLEEP,
+ OPAL_CONFIG_IDLE_APPLY);
+ BUG_ON(rc);
+ }
+ }
+
+ if (type == PNV_THREAD_WINKLE) {
+ sprs.tscr = mfspr(SPRN_TSCR);
+ sprs.worc = mfspr(SPRN_WORC);
+
+ sprs.sdr1 = mfspr(SPRN_SDR1);
+ sprs.rpr = mfspr(SPRN_RPR);
+ sprs.amor = mfspr(SPRN_AMOR);
+
+ sprs.lpcr = mfspr(SPRN_LPCR);
+ if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
+ sprs.hfscr = mfspr(SPRN_HFSCR);
+ sprs.fscr = mfspr(SPRN_FSCR);
+ }
+ sprs.purr = mfspr(SPRN_PURR);
+ sprs.spurr = mfspr(SPRN_SPURR);
+ sprs.dscr = mfspr(SPRN_DSCR);
+ sprs.wort = mfspr(SPRN_WORT);
+
+ sprs_saved = true;
+
+ /*
+ * Increment winkle counter and set all winkle bits if
+ * all threads are winkling. This allows wakeup side to
+ * distinguish between fast sleep and winkle state
+ * loss. Fast sleep still has to resync the timebase so
+ * this may not be a really big win.
+ */
+ *state += 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+ if ((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) >> PNV_CORE_IDLE_WINKLE_COUNT_SHIFT == threads_per_core)
+ *state |= PNV_CORE_IDLE_THREAD_WINKLE_BITS;
+ WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+ }
+
+ atomic_unlock_thread_idle();
+ }
+
+ srr1 = isa206_idle_insn_mayloss(type);
+
+ WARN_ON_ONCE(!srr1);
+ WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
+
+ if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
+ hmi_exception_realmode(NULL);
+
+ if (likely((srr1 & SRR1_WAKESTATE) != SRR1_WS_HVLOSS)) {
+ if (unlikely(type != PNV_THREAD_NAP)) {
+ atomic_lock_thread_idle();
+ if (type == PNV_THREAD_WINKLE) {
+ WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+ *state -= 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+ *state &= ~(thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT);
+ }
+ atomic_unlock_and_stop_thread_idle();
+ }
+ return srr1;
+ }
+
+ /* HV state loss */
+ BUG_ON(type == PNV_THREAD_NAP);
+
+ atomic_lock_thread_idle();
+
+ full_winkle = false;
+ if (type == PNV_THREAD_WINKLE) {
+ WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0);
+ *state -= 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT;
+ if (*state & (thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT)) {
+ *state &= ~(thread << PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT);
+ full_winkle = true;
+ BUG_ON(!sprs_saved);
+ }
+ }
+
+ WARN_ON(*state & thread);
+
+ if ((*state & ((1 << threads_per_core) - 1)) != 0)
+ goto core_woken;
+
+ /* Per-core SPRs */
+ if (full_winkle) {
+ mtspr(SPRN_TSCR, sprs.tscr);
+ mtspr(SPRN_WORC, sprs.worc);
+ }
+
+ if (power7_fastsleep_workaround_exit) {
+ rc = opal_config_cpu_idle_state(OPAL_CONFIG_IDLE_FASTSLEEP,
+ OPAL_CONFIG_IDLE_UNDO);
+ BUG_ON(rc);
+ }
+
+ /* TB */
+ if (opal_resync_timebase() != OPAL_SUCCESS)
+ BUG();
+
+core_woken:
+ if (!full_winkle)
+ goto subcore_woken;
+
+ if ((*state & local_paca->subcore_sibling_mask) != 0)
+ goto subcore_woken;
+
+ /* Per-subcore SPRs */
+ mtspr(SPRN_SDR1, sprs.sdr1);
+ mtspr(SPRN_RPR, sprs.rpr);
+ mtspr(SPRN_AMOR, sprs.amor);
+
+subcore_woken:
+ atomic_unlock_and_stop_thread_idle();
+
+ /* Fast sleep does not lose SPRs */
+ if (!full_winkle)
+ return srr1;
+
+ /* Per-thread SPRs */
+ mtspr(SPRN_LPCR, sprs.lpcr);
+ if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
+ mtspr(SPRN_HFSCR, sprs.hfscr);
+ mtspr(SPRN_FSCR, sprs.fscr);
+ }
+ mtspr(SPRN_PURR, sprs.purr);
+ mtspr(SPRN_SPURR, sprs.spurr);
+ mtspr(SPRN_DSCR, sprs.dscr);
+ mtspr(SPRN_WORT, sprs.wort);
+
+ mtspr(SPRN_SPRG3, local_paca->sprg_vdso);
+
+ /*
+ * The SLB has to be restored here, but it sometimes still
+ * contain entries, so the __ variant must be used to prevent
+ * multi hits.
+ */
+ __slb_restore_bolted_realmode();
+
+ return srr1;
+}
+
+extern unsigned long idle_kvm_start_guest(unsigned long srr1);
+
+static unsigned long power7_offline(void)
+{
+ unsigned long srr1;
+
+ mtmsr(MSR_IDLE);
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ /* Tell KVM we're entering idle. */
+ /******************************************************/
+ /* N O T E W E L L ! ! ! N O T E W E L L */
+ /* The following store to HSTATE_HWTHREAD_STATE(r13) */
+ /* MUST occur in real mode, i.e. with the MMU off, */
+ /* and the MMU must stay off until we clear this flag */
+ /* and test HSTATE_HWTHREAD_REQ(r13) in */
+ /* pnv_powersave_wakeup in this file. */
+ /* The reason is that another thread can switch the */
+ /* MMU to a guest context whenever this flag is set */
+ /* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on, */
+ /* that would potentially cause this thread to start */
+ /* executing instructions from guest memory in */
+ /* hypervisor mode, leading to a host crash or data */
+ /* corruption, or worse. */
+ /******************************************************/
+ local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
+#endif
+
+ __ppc64_runlatch_off();
+ srr1 = power7_idle_insn(power7_offline_type);
+ __ppc64_runlatch_on();
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
+ local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
+ /* Order setting hwthread_state vs. testing hwthread_req */
+ smp_mb();
+ }
+ if (local_paca->kvm_hstate.hwthread_req)
+ srr1 = idle_kvm_start_guest(srr1);
+#endif
+
+ mtmsr(MSR_KERNEL);
+
+ return srr1;
+}
+
static unsigned long __power7_idle_type(unsigned long type)
{
unsigned long srr1;
@@ -322,9 +530,11 @@ static unsigned long __power7_idle_type(unsigned long type)
if (!prep_irq_for_idle_irqsoff())
return 0;
+ mtmsr(MSR_IDLE);
__ppc64_runlatch_off();
srr1 = power7_idle_insn(type);
__ppc64_runlatch_on();
+ mtmsr(MSR_KERNEL);
fini_irq_for_idle_irqsoff();
@@ -347,6 +557,244 @@ void power7_idle(void)
power7_idle_type(PNV_THREAD_NAP);
}
+struct p9_sprs {
+ /* per core */
+ u64 ptcr;
+ u64 rpr;
+ u64 tscr;
+ u64 ldbar;
+ u64 amor;
+
+ /* per thread */
+ u64 lpcr;
+ u64 hfscr;
+ u64 fscr;
+ u64 pid;
+ u64 purr;
+ u64 spurr;
+ u64 dscr;
+ u64 wort;
+
+ u64 mmcra;
+ u32 mmcr0;
+ u32 mmcr1;
+ u64 mmcr2;
+};
+
+static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
+{
+ int cpu = raw_smp_processor_id();
+ int first = cpu_first_thread_sibling(cpu);
+ unsigned long *state = &paca_ptrs[first]->idle_state;
+ unsigned long srr1;
+ unsigned long mmcr0 = 0;
+ struct p9_sprs sprs;
+ bool sprs_saved = false;
+
+ memset(&sprs, 0, sizeof(sprs));
+
+ if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
+ BUG_ON(!mmu_on);
+
+ /*
+ * Wake synchronously. SRESET via xscom may still cause
+ * a 0x100 powersave wakeup with SRR1 reason!
+ */
+ srr1 = isa3_idle_stop_noloss(psscr);
+ if (likely(!srr1))
+ return 0;
+
+ /*
+ * Registers not saved, can't recover!
+ * This would be a hardware bug
+ */
+ BUG_ON((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS);
+
+ goto out;
+ }
+
+ /* EC=ESL=1 case */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ if (cpu_has_feature(CPU_FTR_P9_TM_XER_SO_BUG)) {
+ local_paca->requested_psscr = psscr;
+ /* order setting requested_psscr vs testing dont_stop */
+ smp_mb();
+ if (atomic_read(&local_paca->dont_stop)) {
+ local_paca->requested_psscr = 0;
+ return 0;
+ }
+ }
+#endif
+
+ if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+ /*
+ * POWER9 DD2 can incorrectly set PMAO when waking up
+ * after a state-loss idle. Saving and restoring MMCR0
+ * over idle is a workaround.
+ */
+ mmcr0 = mfspr(SPRN_MMCR0);
+ }
+ if ((psscr & PSSCR_RL_MASK) >= pnv_first_hv_loss_level) {
+ sprs.lpcr = mfspr(SPRN_LPCR);
+ sprs.hfscr = mfspr(SPRN_HFSCR);
+ sprs.fscr = mfspr(SPRN_FSCR);
+ sprs.pid = mfspr(SPRN_PID);
+ sprs.purr = mfspr(SPRN_PURR);
+ sprs.spurr = mfspr(SPRN_SPURR);
+ sprs.dscr = mfspr(SPRN_DSCR);
+ sprs.wort = mfspr(SPRN_WORT);
+
+ sprs.mmcra = mfspr(SPRN_MMCRA);
+ sprs.mmcr0 = mfspr(SPRN_MMCR0);
+ sprs.mmcr1 = mfspr(SPRN_MMCR1);
+ sprs.mmcr2 = mfspr(SPRN_MMCR2);
+
+ sprs.ptcr = mfspr(SPRN_PTCR);
+ sprs.rpr = mfspr(SPRN_RPR);
+ sprs.tscr = mfspr(SPRN_TSCR);
+ sprs.ldbar = mfspr(SPRN_LDBAR);
+ sprs.amor = mfspr(SPRN_AMOR);
+
+ sprs_saved = true;
+
+ atomic_start_thread_idle();
+ }
+
+ srr1 = isa3_idle_stop_mayloss(psscr);
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ local_paca->requested_psscr = 0;
+#endif
+
+ psscr = mfspr(SPRN_PSSCR);
+
+ WARN_ON_ONCE(!srr1);
+ WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
+
+ if ((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS) {
+ unsigned long mmcra;
+
+ /*
+ * Workaround for POWER9 DD2.0, if we lost resources, the ERAT
+ * might have been corrupted and needs flushing. We also need
+ * to reload MMCR0 (see mmcr0 comment above).
+ */
+ if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
+ asm volatile(PPC_INVALIDATE_ERAT);
+ mtspr(SPRN_MMCR0, mmcr0);
+ }
+
+ /*
+ * DD2.2 and earlier need to set then clear bit 60 in MMCRA
+ * to ensure the PMU starts running.
+ */
+ mmcra = mfspr(SPRN_MMCRA);
+ mmcra |= PPC_BIT(60);
+ mtspr(SPRN_MMCRA, mmcra);
+ mmcra &= ~PPC_BIT(60);
+ mtspr(SPRN_MMCRA, mmcra);
+ }
+
+ if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
+ hmi_exception_realmode(NULL);
+
+ /*
+ * On POWER9, SRR1 bits do not match exactly as expected.
+ * SRR1_WS_GPRLOSS (10b) can also result in SPR loss, so
+ * always test PSSCR if there is any state loss.
+ */
+ if (likely((psscr & PSSCR_RL_MASK) < pnv_first_hv_loss_level)) {
+ if (sprs_saved)
+ atomic_stop_thread_idle();
+ goto out;
+ }
+
+ /* HV state loss */
+ BUG_ON(!sprs_saved);
+
+ atomic_lock_thread_idle();
+
+ if ((*state & ((1 << threads_per_core) - 1)) != 0)
+ goto core_woken;
+
+ /* Per-core SPRs */
+ mtspr(SPRN_PTCR, sprs.ptcr);
+ mtspr(SPRN_RPR, sprs.rpr);
+ mtspr(SPRN_TSCR, sprs.tscr);
+ mtspr(SPRN_LDBAR, sprs.ldbar);
+ mtspr(SPRN_AMOR, sprs.amor);
+
+ if ((psscr & PSSCR_RL_MASK) >= pnv_first_tb_loss_level) {
+ /* TB loss */
+ if (opal_resync_timebase() != OPAL_SUCCESS)
+ BUG();
+ }
+
+core_woken:
+ atomic_unlock_and_stop_thread_idle();
+
+ /* Per-thread SPRs */
+ mtspr(SPRN_LPCR, sprs.lpcr);
+ mtspr(SPRN_HFSCR, sprs.hfscr);
+ mtspr(SPRN_FSCR, sprs.fscr);
+ mtspr(SPRN_PID, sprs.pid);
+ mtspr(SPRN_PURR, sprs.purr);
+ mtspr(SPRN_SPURR, sprs.spurr);
+ mtspr(SPRN_DSCR, sprs.dscr);
+ mtspr(SPRN_WORT, sprs.wort);
+
+ mtspr(SPRN_MMCRA, sprs.mmcra);
+ mtspr(SPRN_MMCR0, sprs.mmcr0);
+ mtspr(SPRN_MMCR1, sprs.mmcr1);
+ mtspr(SPRN_MMCR2, sprs.mmcr2);
+
+ mtspr(SPRN_SPRG3, local_paca->sprg_vdso);
+
+ if (!radix_enabled())
+ __slb_restore_bolted_realmode();
+
+out:
+ if (mmu_on)
+ mtmsr(MSR_KERNEL);
+
+ return srr1;
+}
+
+static unsigned long power9_offline_stop(unsigned long psscr)
+{
+ unsigned long srr1;
+
+#ifndef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ __ppc64_runlatch_off();
+ srr1 = power9_idle_stop(psscr, true);
+ __ppc64_runlatch_on();
+#else
+ /*
+ * Tell KVM we're entering idle.
+ * This does not have to be done in real mode because the P9 MMU
+ * is independent per-thread. Some steppings share radix/hash mode
+ * between threads, but in that case KVM has a barrier sync in real
+ * mode before and after switching between radix and hash.
+ */
+ local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
+
+ __ppc64_runlatch_off();
+ srr1 = power9_idle_stop(psscr, false);
+ __ppc64_runlatch_on();
+
+ if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
+ local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
+ /* Order setting hwthread_state vs. testing hwthread_req */
+ smp_mb();
+ }
+ if (local_paca->kvm_hstate.hwthread_req)
+ srr1 = idle_kvm_start_guest(srr1);
+ mtmsr(MSR_KERNEL);
+#endif
+
+ return srr1;
+}
+
static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
unsigned long stop_psscr_mask)
{
@@ -360,7 +808,7 @@ static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
__ppc64_runlatch_off();
- srr1 = power9_idle_stop(psscr);
+ srr1 = power9_idle_stop(psscr, true);
__ppc64_runlatch_on();
fini_irq_for_idle_irqsoff();
@@ -409,7 +857,7 @@ void pnv_power9_force_smt4_catch(void)
atomic_inc(&paca_ptrs[cpu0+thr]->dont_stop);
}
/* order setting dont_stop vs testing requested_psscr */
- mb();
+ smp_mb();
for (thr = 0; thr < threads_per_core; ++thr) {
if (!paca_ptrs[cpu0+thr]->requested_psscr)
++awake_threads;
@@ -480,7 +928,6 @@ static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
unsigned long pnv_cpu_offline(unsigned int cpu)
{
unsigned long srr1;
- u32 idle_states = pnv_get_supported_cpuidle_states();
u64 lpcr_val;
/*
@@ -505,15 +952,8 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
psscr = (psscr & ~pnv_deepest_stop_psscr_mask) |
pnv_deepest_stop_psscr_val;
srr1 = power9_offline_stop(psscr);
-
- } else if ((idle_states & OPAL_PM_WINKLE_ENABLED) &&
- (idle_states & OPAL_PM_LOSE_FULL_CONTEXT)) {
- srr1 = power7_idle_insn(PNV_THREAD_WINKLE);
- } else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
- (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
- srr1 = power7_idle_insn(PNV_THREAD_SLEEP);
- } else if (idle_states & OPAL_PM_NAP_ENABLED) {
- srr1 = power7_idle_insn(PNV_THREAD_NAP);
+ } else if (cpu_has_feature(CPU_FTR_ARCH_206) && power7_offline_type) {
+ srr1 = power7_offline();
} else {
/* This is the fallback method. We emulate snooze */
while (!generic_check_cpu_restart(cpu)) {
@@ -619,33 +1059,32 @@ int validate_psscr_val_mask(u64 *psscr_val, u64 *psscr_mask, u32 flags)
* @dt_idle_states: Number of idle state entries
* Returns 0 on success
*/
-static int __init pnv_power9_idle_init(void)
+static void __init pnv_power9_idle_init(void)
{
u64 max_residency_ns = 0;
int i;
/*
- * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
- * and the pnv_default_stop_{val,mask}.
- *
- * pnv_first_deep_stop_state should be set to the first stop
- * level to cause hypervisor state loss.
- *
* pnv_deepest_stop_{val,mask} should be set to values corresponding to
* the deepest stop state.
*
* pnv_default_stop_{val,mask} should be set to values corresponding to
- * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
+ * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
*/
- pnv_first_deep_stop_state = MAX_STOP_STATE;
+ pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
+ pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
for (i = 0; i < nr_pnv_idle_states; i++) {
int err;
struct pnv_idle_states_t *state = &pnv_idle_states[i];
u64 psscr_rl = state->psscr_val & PSSCR_RL_MASK;
+ if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
+ (pnv_first_tb_loss_level > psscr_rl))
+ pnv_first_tb_loss_level = psscr_rl;
+
if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
- pnv_first_deep_stop_state > psscr_rl)
- pnv_first_deep_stop_state = psscr_rl;
+ (pnv_first_hv_loss_level > psscr_rl))
+ pnv_first_hv_loss_level = psscr_rl;
err = validate_psscr_val_mask(&state->psscr_val,
&state->psscr_mask,
@@ -670,6 +1109,7 @@ static int __init pnv_power9_idle_init(void)
pnv_default_stop_val = state->psscr_val;
pnv_default_stop_mask = state->psscr_mask;
default_stop_found = true;
+ WARN_ON(state->flags & OPAL_PM_LOSE_FULL_CONTEXT);
}
}
@@ -689,10 +1129,40 @@ static int __init pnv_power9_idle_init(void)
pnv_deepest_stop_psscr_mask);
}
- pr_info("cpuidle-powernv: Requested Level (RL) value of first deep stop = 0x%llx\n",
- pnv_first_deep_stop_state);
+ pr_info("cpuidle-powernv: First stop level that may lose SPRs = 0x%lld\n",
+ pnv_first_hv_loss_level);
- return 0;
+ pr_info("cpuidle-powernv: First stop level that may lose timebase = 0x%lld\n",
+ pnv_first_tb_loss_level);
+}
+
+static void __init pnv_disable_deep_states(void)
+{
+ /*
+ * The stop-api is unable to restore hypervisor
+ * resources on wakeup from platform idle states which
+ * lose full context. So disable such states.
+ */
+ supported_cpuidle_states &= ~OPAL_PM_LOSE_FULL_CONTEXT;
+ pr_warn("cpuidle-powernv: Disabling idle states that lose full context\n");
+ pr_warn("cpuidle-powernv: Idle power-savings, CPU-Hotplug affected\n");
+
+ if (cpu_has_feature(CPU_FTR_ARCH_300) &&
+ (pnv_deepest_stop_flag & OPAL_PM_LOSE_FULL_CONTEXT)) {
+ /*
+ * Use the default stop state for CPU-Hotplug
+ * if available.
+ */
+ if (default_stop_found) {
+ pnv_deepest_stop_psscr_val = pnv_default_stop_val;
+ pnv_deepest_stop_psscr_mask = pnv_default_stop_mask;
+ pr_warn("cpuidle-powernv: Offlined CPUs will stop with psscr = 0x%016llx\n",
+ pnv_deepest_stop_psscr_val);
+ } else { /* Fallback to snooze loop for CPU-Hotplug */
+ deepest_stop_found = false;
+ pr_warn("cpuidle-powernv: Offlined CPUs will busy wait\n");
+ }
+ }
}
/*
@@ -707,10 +1177,8 @@ static void __init pnv_probe_idle_states(void)
return;
}
- if (cpu_has_feature(CPU_FTR_ARCH_300)) {
- if (pnv_power9_idle_init())
- return;
- }
+ if (cpu_has_feature(CPU_FTR_ARCH_300))
+ pnv_power9_idle_init();
for (i = 0; i < nr_pnv_idle_states; i++)
supported_cpuidle_states |= pnv_idle_states[i].flags;
@@ -818,7 +1286,7 @@ static int pnv_parse_cpuidle_dt(void)
}
for (i = 0; i < nr_idle_states; i++)
strncpy(pnv_idle_states[i].name, temp_string[i],
- PNV_IDLE_NAME_LEN);
+ PNV_IDLE_NAME_LEN - 1);
nr_pnv_idle_states = nr_idle_states;
rc = 0;
out:
@@ -830,11 +1298,34 @@ static int pnv_parse_cpuidle_dt(void)
static int __init pnv_init_idle_states(void)
{
+ int cpu;
int rc = 0;
- supported_cpuidle_states = 0;
+
+ /* Set up PACA fields */
+ for_each_present_cpu(cpu) {
+ struct paca_struct *p = paca_ptrs[cpu];
+
+ p->idle_state = 0;
+ if (cpu == cpu_first_thread_sibling(cpu))
+ p->idle_state = (1 << threads_per_core) - 1;
+
+ if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+ /* P7/P8 nap */
+ p->thread_idle_state = PNV_THREAD_RUNNING;
+ p->thread_mask = 1 << cpu_thread_in_core(cpu);
+ } else {
+ /* P9 stop */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ p->requested_psscr = 0;
+ atomic_set(&p->dont_stop, 0);
+#endif
+ }
+ }
/* In case we error out nr_pnv_idle_states will be zero */
nr_pnv_idle_states = 0;
+ supported_cpuidle_states = 0;
+
if (cpuidle_disable != IDLE_NO_OVERRIDE)
goto out;
rc = pnv_parse_cpuidle_dt();
@@ -842,27 +1333,40 @@ static int __init pnv_init_idle_states(void)
return rc;
pnv_probe_idle_states();
- if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
- patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_entry,
- PPC_INST_NOP);
- patch_instruction(
- (unsigned int *)pnv_fastsleep_workaround_at_exit,
- PPC_INST_NOP);
- } else {
- /*
- * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
- * workaround is needed to use fastsleep. Provide sysfs
- * control to choose how this workaround has to be applied.
- */
- device_create_file(cpu_subsys.dev_root,
+ if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+ if (!(supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
+ power7_fastsleep_workaround_entry = false;
+ power7_fastsleep_workaround_exit = false;
+ } else {
+ /*
+ * OPAL_PM_SLEEP_ENABLED_ER1 is set. It indicates that
+ * workaround is needed to use fastsleep. Provide sysfs
+ * control to choose how this workaround has to be
+ * applied.
+ */
+ device_create_file(cpu_subsys.dev_root,
&dev_attr_fastsleep_workaround_applyonce);
- }
+ }
+
+ update_subcore_sibling_mask();
- pnv_alloc_idle_core_states();
+ if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED) {
+ ppc_md.power_save = power7_idle;
+ power7_offline_type = PNV_THREAD_NAP;
+ }
- if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
- ppc_md.power_save = power7_idle;
+ if ((supported_cpuidle_states & OPAL_PM_WINKLE_ENABLED) &&
+ (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT))
+ power7_offline_type = PNV_THREAD_WINKLE;
+ else if ((supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED) ||
+ (supported_cpuidle_states & OPAL_PM_SLEEP_ENABLED_ER1))
+ power7_offline_type = PNV_THREAD_SLEEP;
+ }
+
+ if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT) {
+ if (pnv_save_sprs_for_deep_states())
+ pnv_disable_deep_states();
+ }
out:
return 0;
diff --git a/arch/powerpc/platforms/powernv/subcore.c b/arch/powerpc/platforms/powernv/subcore.c
index 45563004feda..1d7a9fd30dd1 100644
--- a/arch/powerpc/platforms/powernv/subcore.c
+++ b/arch/powerpc/platforms/powernv/subcore.c
@@ -183,7 +183,7 @@ static void unsplit_core(void)
cpu = smp_processor_id();
if (cpu_thread_in_core(cpu) != 0) {
while (mfspr(SPRN_HID0) & mask)
- power7_idle_insn(PNV_THREAD_NAP);
+ power7_idle_type(PNV_THREAD_NAP);
per_cpu(split_state, cpu).step = SYNC_STEP_UNSPLIT;
return;
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index c7324e8469ee..b8167f4d53ef 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2417,7 +2417,6 @@ static void dump_one_paca(int cpu)
DUMP(p, irq_happened, "%#-*x");
DUMP(p, io_sync, "%#-*x");
DUMP(p, irq_work_pending, "%#-*x");
- DUMP(p, nap_state_lost, "%#-*x");
DUMP(p, sprg_vdso, "%#-*llx");
#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
@@ -2425,19 +2424,17 @@ static void dump_one_paca(int cpu)
#endif
#ifdef CONFIG_PPC_POWERNV
- DUMP(p, core_idle_state_ptr, "%-*px");
- DUMP(p, thread_idle_state, "%#-*x");
- DUMP(p, thread_mask, "%#-*x");
- DUMP(p, subcore_sibling_mask, "%#-*x");
- DUMP(p, requested_psscr, "%#-*llx");
- DUMP(p, stop_sprs.pid, "%#-*llx");
- DUMP(p, stop_sprs.ldbar, "%#-*llx");
- DUMP(p, stop_sprs.fscr, "%#-*llx");
- DUMP(p, stop_sprs.hfscr, "%#-*llx");
- DUMP(p, stop_sprs.mmcr1, "%#-*llx");
- DUMP(p, stop_sprs.mmcr2, "%#-*llx");
- DUMP(p, stop_sprs.mmcra, "%#-*llx");
- DUMP(p, dont_stop.counter, "%#-*x");
+ DUMP(p, idle_state, "%#-*lx");
+ if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+ DUMP(p, thread_idle_state, "%#-*x");
+ DUMP(p, thread_mask, "%#-*x");
+ DUMP(p, subcore_sibling_mask, "%#-*x");
+ } else {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+ DUMP(p, requested_psscr, "%#-*llx");
+ DUMP(p, dont_stop.counter, "%#-*x");
+#endif
+ }
#endif
DUMP(p, accounting.utime, "%#-*lx");
--
2.17.0
^ permalink raw reply related
* Re: [PATCH] powerpc/powernv: Fix concurrency issue with npu->mmio_atsd_usage
From: Alistair Popple @ 2018-08-03 4:29 UTC (permalink / raw)
To: Reza Arbab; +Cc: linuxppc-dev
In-Reply-To: <1533269016-16238-1-git-send-email-arbab@linux.ibm.com>
> There may be a long-term way to fix this at a larger scale, but for now
> resolve the immediate problem by gating our call to
> test_and_set_bit_lock() with one to test_bit(), which is obviously
> implemented without using a store.
I am less sure of this now but am continuing to investigate. However this patch
looks good.
Acked-by: Alistair Popple <alistair@popple.id.au>
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
> arch/powerpc/platforms/powernv/npu-dma.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
> index 8cdf91f..c773465 100644
> --- a/arch/powerpc/platforms/powernv/npu-dma.c
> +++ b/arch/powerpc/platforms/powernv/npu-dma.c
> @@ -437,8 +437,9 @@ static int get_mmio_atsd_reg(struct npu *npu)
> int i;
>
> for (i = 0; i < npu->mmio_atsd_count; i++) {
> - if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
> - return i;
> + if (!test_bit(i, &npu->mmio_atsd_usage))
> + if (!test_and_set_bit_lock(i, &npu->mmio_atsd_usage))
> + return i;
> }
>
> return -ENOSPC;
>
^ permalink raw reply
* Re: [PATCH v5 09/11] hugetlb: Introduce generic version of huge_ptep_set_wrprotect
From: Alex Ghiti @ 2018-08-03 5:24 UTC (permalink / raw)
To: Michael Ellerman, linux-mm, mike.kravetz, linux, catalin.marinas,
will.deacon, tony.luck, fenghua.yu, ralf, paul.burton, jhogan,
jejb, deller, benh, ysato, dalias, davem, tglx, mingo, hpa, x86,
arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch,
aneesh.kumar@linux.ibm.com
In-Reply-To: <6acb1389-6998-bafb-cf69-174fd522c04c@ghiti.fr>
Ok, I tried every defconfig available:
- for the nohash/32, I found that I could use mpc885_ads_defconfig and I
activated HUGETLBFS.
I removed the definition of huge_ptep_set_wrprotect from
nohash/32/pgtable.h, add an #error in
include/asm-generic/hugetlb.h right before the generic definition of
huge_ptep_set_wrprotect,
and fell onto it at compile-time:
=> I'm pretty confident then that removing the definition of
huge_ptep_set_wrprotect does not
break anythingin this case.
- regardind book3s/32, I did not find any defconfig with
CONFIG_PPC_BOOK3S_32, CONFIG_PPC32
allowing to enable huge page support (ie CONFIG_SYS_SUPPORTS_HUGETLBFS)
=> Do you have a defconfig that would allow me to try the same as above ?
Thanks,
Alex
On 07/31/2018 11:17 AM, Alexandre Ghiti wrote:
>
> On 07/31/2018 12:06 PM, Michael Ellerman wrote:
>> Alexandre Ghiti <alex@ghiti.fr> writes:
>>
>>> arm, ia64, mips, sh, x86 architectures use the same version
>>> of huge_ptep_set_wrprotect, so move this generic implementation into
>>> asm-generic/hugetlb.h.
>>> Note: powerpc uses twice for book3s/32 and nohash/32 the same
>>> version as
>>> the above architectures, but the modification was not straightforward
>>> and hence has not been done.
>> Do you remember what the problem was there?
>>
>> It looks like you should just be able to drop them like the others. I
>> assume there's some header spaghetti that causes problems though?
>
> Yes, the header spaghetti frightened me a bit. Maybe I should have
> tried harder: I can try to remove them and find the right defconfigs
> to compile both to begin with. And to guarantee the functionality is
> preserved, can I use the testsuite of libhugetlbfs with qemu ?
>
> Alex
>
>>
>> cheers
>>
>>
>>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>>> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>> arch/arm/include/asm/hugetlb-3level.h | 6 ------
>>> arch/arm64/include/asm/hugetlb.h | 1 +
>>> arch/ia64/include/asm/hugetlb.h | 6 ------
>>> arch/mips/include/asm/hugetlb.h | 6 ------
>>> arch/parisc/include/asm/hugetlb.h | 1 +
>>> arch/powerpc/include/asm/book3s/32/pgtable.h | 2 ++
>>> arch/powerpc/include/asm/book3s/64/pgtable.h | 1 +
>>> arch/powerpc/include/asm/nohash/32/pgtable.h | 2 ++
>>> arch/powerpc/include/asm/nohash/64/pgtable.h | 1 +
>>> arch/sh/include/asm/hugetlb.h | 6 ------
>>> arch/sparc/include/asm/hugetlb.h | 1 +
>>> arch/x86/include/asm/hugetlb.h | 6 ------
>>> include/asm-generic/hugetlb.h | 8 ++++++++
>>> 13 files changed, 17 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/arch/arm/include/asm/hugetlb-3level.h
>>> b/arch/arm/include/asm/hugetlb-3level.h
>>> index b897541520ef..8247cd6a2ac6 100644
>>> --- a/arch/arm/include/asm/hugetlb-3level.h
>>> +++ b/arch/arm/include/asm/hugetlb-3level.h
>>> @@ -37,12 +37,6 @@ static inline pte_t huge_ptep_get(pte_t *ptep)
>>> return retval;
>>> }
>>> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> - unsigned long addr, pte_t *ptep)
>>> -{
>>> - ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>> static inline int huge_ptep_set_access_flags(struct vm_area_struct
>>> *vma,
>>> unsigned long addr, pte_t *ptep,
>>> pte_t pte, int dirty)
>>> diff --git a/arch/arm64/include/asm/hugetlb.h
>>> b/arch/arm64/include/asm/hugetlb.h
>>> index 3e7f6e69b28d..f4f69ae5466e 100644
>>> --- a/arch/arm64/include/asm/hugetlb.h
>>> +++ b/arch/arm64/include/asm/hugetlb.h
>>> @@ -48,6 +48,7 @@ extern int huge_ptep_set_access_flags(struct
>>> vm_area_struct *vma,
>>> #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
>>> extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep);
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep);
>>> #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
>>> diff --git a/arch/ia64/include/asm/hugetlb.h
>>> b/arch/ia64/include/asm/hugetlb.h
>>> index cbe296271030..49d1f7949f3a 100644
>>> --- a/arch/ia64/include/asm/hugetlb.h
>>> +++ b/arch/ia64/include/asm/hugetlb.h
>>> @@ -27,12 +27,6 @@ static inline void huge_ptep_clear_flush(struct
>>> vm_area_struct *vma,
>>> {
>>> }
>>> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> - unsigned long addr, pte_t *ptep)
>>> -{
>>> - ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>> static inline int huge_ptep_set_access_flags(struct vm_area_struct
>>> *vma,
>>> unsigned long addr, pte_t *ptep,
>>> pte_t pte, int dirty)
>>> diff --git a/arch/mips/include/asm/hugetlb.h
>>> b/arch/mips/include/asm/hugetlb.h
>>> index 6ff2531cfb1d..3dcf5debf8c4 100644
>>> --- a/arch/mips/include/asm/hugetlb.h
>>> +++ b/arch/mips/include/asm/hugetlb.h
>>> @@ -63,12 +63,6 @@ static inline int huge_pte_none(pte_t pte)
>>> return !val || (val == (unsigned long)invalid_pte_table);
>>> }
>>> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> - unsigned long addr, pte_t *ptep)
>>> -{
>>> - ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>> static inline int huge_ptep_set_access_flags(struct vm_area_struct
>>> *vma,
>>> unsigned long addr,
>>> pte_t *ptep, pte_t pte,
>>> diff --git a/arch/parisc/include/asm/hugetlb.h
>>> b/arch/parisc/include/asm/hugetlb.h
>>> index fb7e0fd858a3..9c3950ca2974 100644
>>> --- a/arch/parisc/include/asm/hugetlb.h
>>> +++ b/arch/parisc/include/asm/hugetlb.h
>>> @@ -39,6 +39,7 @@ static inline void huge_ptep_clear_flush(struct
>>> vm_area_struct *vma,
>>> {
>>> }
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep);
>>> diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> index 02f5acd7ccc4..d2cd1d0226e9 100644
>>> --- a/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
>>> @@ -228,6 +228,8 @@ static inline void ptep_set_wrprotect(struct
>>> mm_struct *mm, unsigned long addr,
>>> {
>>> pte_update(ptep, (_PAGE_RW | _PAGE_HWWRITE), _PAGE_RO);
>>> }
>>> +
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> index 42aafba7a308..7d957f7c47cd 100644
>>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>>> @@ -451,6 +451,7 @@ static inline void ptep_set_wrprotect(struct
>>> mm_struct *mm, unsigned long addr,
>>> pte_update(mm, addr, ptep, 0, _PAGE_PRIVILEGED, 0);
>>> }
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> b/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> index 7c46a98cc7f4..f39e200d9591 100644
>>> --- a/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> +++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
>>> @@ -249,6 +249,8 @@ static inline void ptep_set_wrprotect(struct
>>> mm_struct *mm, unsigned long addr,
>>> {
>>> pte_update(ptep, (_PAGE_RW | _PAGE_HWWRITE), _PAGE_RO);
>>> }
>>> +
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> b/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> index dd0c7236208f..69fbf7e9b4db 100644
>>> --- a/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> +++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
>>> @@ -238,6 +238,7 @@ static inline void ptep_set_wrprotect(struct
>>> mm_struct *mm, unsigned long addr,
>>> pte_update(mm, addr, ptep, _PAGE_RW, 0, 0);
>>> }
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/sh/include/asm/hugetlb.h
>>> b/arch/sh/include/asm/hugetlb.h
>>> index f1bbd255ee43..8df4004977b9 100644
>>> --- a/arch/sh/include/asm/hugetlb.h
>>> +++ b/arch/sh/include/asm/hugetlb.h
>>> @@ -32,12 +32,6 @@ static inline void huge_ptep_clear_flush(struct
>>> vm_area_struct *vma,
>>> {
>>> }
>>> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> - unsigned long addr, pte_t *ptep)
>>> -{
>>> - ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>> static inline int huge_ptep_set_access_flags(struct vm_area_struct
>>> *vma,
>>> unsigned long addr, pte_t *ptep,
>>> pte_t pte, int dirty)
>>> diff --git a/arch/sparc/include/asm/hugetlb.h
>>> b/arch/sparc/include/asm/hugetlb.h
>>> index 2101ea217f33..c41754a113f3 100644
>>> --- a/arch/sparc/include/asm/hugetlb.h
>>> +++ b/arch/sparc/include/asm/hugetlb.h
>>> @@ -32,6 +32,7 @@ static inline void huge_ptep_clear_flush(struct
>>> vm_area_struct *vma,
>>> {
>>> }
>>> +#define __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> unsigned long addr, pte_t *ptep)
>>> {
>>> diff --git a/arch/x86/include/asm/hugetlb.h
>>> b/arch/x86/include/asm/hugetlb.h
>>> index 59c056adb3c9..a3f781f7a264 100644
>>> --- a/arch/x86/include/asm/hugetlb.h
>>> +++ b/arch/x86/include/asm/hugetlb.h
>>> @@ -13,12 +13,6 @@ static inline int is_hugepage_only_range(struct
>>> mm_struct *mm,
>>> return 0;
>>> }
>>> -static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> - unsigned long addr, pte_t *ptep)
>>> -{
>>> - ptep_set_wrprotect(mm, addr, ptep);
>>> -}
>>> -
>>> static inline int huge_ptep_set_access_flags(struct vm_area_struct
>>> *vma,
>>> unsigned long addr, pte_t *ptep,
>>> pte_t pte, int dirty)
>>> diff --git a/include/asm-generic/hugetlb.h
>>> b/include/asm-generic/hugetlb.h
>>> index 6c0c8b0c71e0..9b9039845278 100644
>>> --- a/include/asm-generic/hugetlb.h
>>> +++ b/include/asm-generic/hugetlb.h
>>> @@ -102,4 +102,12 @@ static inline int prepare_hugepage_range(struct
>>> file *file,
>>> }
>>> #endif
>>> +#ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>>> +static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>>> + unsigned long addr, pte_t *ptep)
>>> +{
>>> + ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +#endif
>>> +
>>> #endif /* _ASM_GENERIC_HUGETLB_H */
>>> --
>>> 2.16.2
>
^ permalink raw reply
* [PATCH 1/2] powerpc: Allow memory that has been hot-removed to be hot-added
From: Rashmica Gupta @ 2018-08-03 6:06 UTC (permalink / raw)
To: linuxppc-dev, mpe, mikey, benh, paulus, bsingharora; +Cc: Rashmica Gupta
This patch allows the memory removed by memtrace to be readded to the
kernel. So now you don't have to reboot your system to add the memory
back to the kernel or to have a different amount of memory removed.
Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
---
To remove 1GB from each node:
echo 1073741824 > /sys/kernel/debug/powerpc/memtrace/enable
To add this memory back and remove 2GB:
echo 2147483648 > /sys/kernel/debug/powerpc/memtrace/enable
To just re-add memory:
echo 0 > /sys/kernel/debug/powerpc/memtrace/enable
arch/powerpc/platforms/powernv/memtrace.c | 93 ++++++++++++++++++++++++++++---
1 file changed, 86 insertions(+), 7 deletions(-)
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index b99283df8584..51fe0862dcab 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -206,8 +206,11 @@ static int memtrace_init_debugfs(void)
snprintf(ent->name, 16, "%08x", ent->nid);
dir = debugfs_create_dir(ent->name, memtrace_debugfs_dir);
- if (!dir)
+ if (!dir) {
+ pr_err("Failed to create debugfs directory for node %d\n",
+ ent->nid);
return -1;
+ }
ent->dir = dir;
debugfs_create_file("trace", 0400, dir, ent, &memtrace_fops);
@@ -218,18 +221,94 @@ static int memtrace_init_debugfs(void)
return ret;
}
+static int online_mem_block(struct memory_block *mem, void *arg)
+{
+ return device_online(&mem->dev);
+}
+
+/*
+ * Iterate through the chunks of memory we have removed from the kernel
+ * and attempt to add them back to the kernel.
+ */
+static int memtrace_online(void)
+{
+ int i, ret = 0;
+ struct memtrace_entry *ent;
+
+ for (i = memtrace_array_nr - 1; i >= 0; i--) {
+ ent = &memtrace_array[i];
+
+ /* We have onlined this chunk previously */
+ if (ent->nid == -1)
+ continue;
+
+ /* Remove from io mappings */
+ if (ent->mem) {
+ iounmap(ent->mem);
+ ent->mem = 0;
+ }
+
+ if (add_memory(ent->nid, ent->start, ent->size)) {
+ pr_err("Failed to add trace memory to node %d\n",
+ ent->nid);
+ ret += 1;
+ continue;
+ }
+
+ /*
+ * If kernel isn't compiled with the auto online option
+ * we need to online the memory ourselves.
+ */
+ if (!memhp_auto_online) {
+ walk_memory_range(PFN_DOWN(ent->start),
+ PFN_UP(ent->start + ent->size - 1),
+ NULL, online_mem_block);
+ }
+
+ /*
+ * Memory was added successfully so clean up references to it
+ * so on reentry we can tell that this chunk was added.
+ */
+ debugfs_remove_recursive(ent->dir);
+ pr_info("Added trace memory back to node %d\n", ent->nid);
+ ent->size = ent->start = ent->nid = -1;
+ }
+ if (ret)
+ return ret;
+
+ /* If all chunks of memory were added successfully, reset globals */
+ kfree(memtrace_array);
+ memtrace_array = NULL;
+ memtrace_size = 0;
+ memtrace_array_nr = 0;
+ return 0;
+
+}
+
static int memtrace_enable_set(void *data, u64 val)
{
- if (memtrace_size)
+ uint64_t bytes;
+
+ /*
+ * Don't attempt to do anything if size isn't aligned to a memory
+ * block or equal to zero.
+ */
+ bytes = memory_block_size_bytes();
+ if (val & (bytes - 1)) {
+ pr_err("Value must be aligned with 0x%llx\n", bytes);
return -EINVAL;
+ }
- if (!val)
- return -EINVAL;
+ /* Re-add/online previously removed/offlined memory */
+ if (memtrace_size) {
+ if (memtrace_online())
+ return -EAGAIN;
+ }
- /* Make sure size is aligned to a memory block */
- if (val & (memory_block_size_bytes() - 1))
- return -EINVAL;
+ if (!val)
+ return 0;
+ /* Offline and remove memory */
if (memtrace_init_regions_runtime(val))
return -EINVAL;
--
2.14.4
^ permalink raw reply related
* [PATCH 2/2] Update documentation on ppc-memtrace
From: Rashmica Gupta @ 2018-08-03 6:06 UTC (permalink / raw)
To: linuxppc-dev, mpe, mikey, benh, paulus, bsingharora; +Cc: Rashmica Gupta
In-Reply-To: <20180803060601.724-1-rashmica.g@gmail.com>
Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
---
Documentation/ABI/testing/ppc-memtrace | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/Documentation/ABI/testing/ppc-memtrace b/Documentation/ABI/testing/ppc-memtrace
index 2e8b93741270..9606aed33137 100644
--- a/Documentation/ABI/testing/ppc-memtrace
+++ b/Documentation/ABI/testing/ppc-memtrace
@@ -13,10 +13,11 @@ Contact: linuxppc-dev@lists.ozlabs.org
Description: Write an integer containing the size in bytes of the memory
you want removed from each NUMA node to this file - it must be
aligned to the memblock size. This amount of RAM will be removed
- from the kernel mappings and the following debugfs files will be
- created. This can only be successfully done once per boot. Once
- memory is successfully removed from each node, the following
- files are created.
+ from each NUMA node in the kernel mappings and the following
+ debugfs files will be created. Once memory is successfully
+ removed from each node, the following files are created. To
+ re-add memory to the kernel, echo 0 into this file (it will be
+ automatically onlined).
What: /sys/kernel/debug/powerpc/memtrace/<node-id>
Date: Aug 2017
--
2.14.4
^ permalink raw reply related
* Re: [PATCH] powernv/cpuidle: Fix idle states all being marked invalid
From: Akshay Adiga @ 2018-08-03 6:14 UTC (permalink / raw)
To: Nicholas Piggin; +Cc: linuxppc-dev, Gautham R . Shenoy
In-Reply-To: <20180802153951.3576-1-npiggin@gmail.com>
On Fri, Aug 03, 2018 at 01:39:51AM +1000, Nicholas Piggin wrote:
> Commit 9c7b185ab2 ("powernv/cpuidle: Parse dt idle properties into
> global structure") parses dt idle states into structs, but never
> marks them valid. This results in all idle states being lost.
>
My bad. Thanks nick for fixing this. We definatetely need this.
> Cc: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> Cc: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Acked-by: Akshay Adiga <akshay.adiga@linux.vnet.ibm.com>
> ---
> arch/powerpc/platforms/powernv/idle.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 3116bab10aa3..ecb002c5db83 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -651,11 +651,12 @@ static int __init pnv_power9_idle_init(void)
> &state->psscr_mask,
> state->flags);
> if (err) {
> - state->valid = false;
> report_invalid_psscr_val(state->psscr_val, err);
> continue;
> }
>
> + state->valid = true;
> +
> if (max_residency_ns < state->residency_ns) {
> max_residency_ns = state->residency_ns;
> pnv_deepest_stop_psscr_val = state->psscr_val;
> --
> 2.17.0
>
^ permalink raw reply
* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Christophe LEROY @ 2018-08-03 6:38 UTC (permalink / raw)
To: Murilo Opsfelder Araujo
Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
Eric W . Biederman, Joe Perches, Michael Ellerman,
Michael Neuling, Nicholas Piggin, Paul Mackerras, Simon Guo,
Sukadev Bhattiprolu, Tobin C . Harding, linuxppc-dev,
Segher Boessenkool
In-Reply-To: <20180803004201.GA5891@kermit-br-ibm-com>
Hi Murilo,
Le 03/08/2018 à 02:42, Murilo Opsfelder Araujo a écrit :
> Hi, Christophe.
>
> On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>>
>>
>> Le 01/08/2018 à 23:33, Murilo Opsfelder Araujo a écrit :
>>> show_user_instructions() is a slightly modified version of
>>> show_instructions() that allows userspace instruction dump.
>>>
>>> This will be useful within show_signal_msg() to dump userspace
>>> instructions of the faulty location.
>>>
>>> Here is a sample of what show_user_instructions() outputs:
>>>
>>> pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
>>> pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <99490000> 39200000 7d234b78 383f0040
>>>
>>> The current->comm and current->pid printed can serve as a glue that
>>> links the instructions dump to its originator, allowing messages to be
>>> interleaved in the logs.
>>>
>>> Signed-off-by: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
>>> ---
>>> arch/powerpc/include/asm/stacktrace.h | 13 +++++++++
>>> arch/powerpc/kernel/process.c | 40 +++++++++++++++++++++++++++
>>> 2 files changed, 53 insertions(+)
>>> create mode 100644 arch/powerpc/include/asm/stacktrace.h
>>>
>>> diff --git a/arch/powerpc/include/asm/stacktrace.h b/arch/powerpc/include/asm/stacktrace.h
>>> new file mode 100644
>>> index 000000000000..6149b53b3bc8
>>> --- /dev/null
>>> +++ b/arch/powerpc/include/asm/stacktrace.h
>>> @@ -0,0 +1,13 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +/*
>>> + * Stack trace functions.
>>> + *
>>> + * Copyright 2018, Murilo Opsfelder Araujo, IBM Corporation.
>>> + */
>>> +
>>> +#ifndef _ASM_POWERPC_STACKTRACE_H
>>> +#define _ASM_POWERPC_STACKTRACE_H
>>> +
>>> +void show_user_instructions(struct pt_regs *regs);
>>> +
>>> +#endif /* _ASM_POWERPC_STACKTRACE_H */
>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>>> index e9533b4d2f08..364645ac732c 100644
>>> --- a/arch/powerpc/kernel/process.c
>>> +++ b/arch/powerpc/kernel/process.c
>>> @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *regs)
>>> pr_cont("\n");
>>> }
>>> +void show_user_instructions(struct pt_regs *regs)
>>> +{
>>> + int i;
>>> + const char *prefix = KERN_INFO "%s[%d]: code: ";
>>> + unsigned long pc = regs->nip - (instructions_to_print * 3 / 4 *
>>> + sizeof(int));
>>> +
>>> + printk(prefix, current->comm, current->pid);
>>
>> Why not use pr_info() and remove KERN_INFO from *prefix ?
>
> Because it doesn't compile:
>
> arch/powerpc/kernel/process.c:1317:10: error: expected ‘)’ before ‘prefix’
> pr_info(prefix, current->comm, current->pid);
> ^
> ./include/linux/printk.h:288:21: note: in definition of macro ‘pr_fmt’
> #define pr_fmt(fmt) fmt
> ^
>
> `pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
> which is an invalid string concatenation.
>
> `pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
> valid.
Then what about using directly:
pr_info("%s[%d]: code: ", ...);
>
>>> +
>>> + for (i = 0; i < instructions_to_print; i++) {
>>> + int instr;
>>> +
>>> + if (!(i % 8) && (i > 0)) {
>>> + pr_cont("\n");
>>> + printk(prefix, current->comm, current->pid);
>>> + }
>>> +
>>> +#if !defined(CONFIG_BOOKE)
>>> + /* If executing with the IMMU off, adjust pc rather
>>> + * than print XXXXXXXX.
>>> + */
>>> + if (!(regs->msr & MSR_IR))
>>> + pc = (unsigned long)phys_to_virt(pc);
>>
>> Shouldn't this be done outside of the loop, only once ?
>
> I don't think so.
>
> pc gets incremented at the bottom of the loop:
>
> pc += sizeof(int);
>
> Adjusting pc is necessary at each iteration. Leaving this block inside
> the loop seems correct.
This looks pretty strange.
The first time, pc is a physical address, that you change to a virtual
address. Then when you increment it it is still a virtual address.
So when you call phys_to_virt(pc) for the second time, pc is already a
virt address, so what happens indeed ?
Christophe
>
> Cheers
> Murilo
>
^ permalink raw reply
* RE: [PATCH v11 00/26] Speculative page faults
From: Song, HaiyanX @ 2018-08-03 6:36 UTC (permalink / raw)
To: Laurent Dufour
Cc: akpm@linux-foundation.org, mhocko@kernel.org,
peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com,
dave@stgolabs.net, jack@suse.cz, Matthew Wilcox,
khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
Thomas Gleixner, Ingo Molnar, hpa@zytor.com, Will Deacon,
Sergey Senozhatsky, sergey.senozhatsky.work@gmail.com,
Andrea Arcangeli, Alexei Starovoitov, Wang, Kemi, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
haren@linux.vnet.ibm.com, npiggin@gmail.com,
bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen,
linuxppc-dev@lists.ozlabs.org, x86@kernel.org
In-Reply-To: <166434ae-ecaf-05d8-3cc7-7aa75bc3737b@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 42188 bytes --]
Hi Laurent,
Thanks for your analysis for the last perf results.
Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported
on the head side", which is a bug of 0-day,and it caused the item is not counted in perf.
I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times).
I checked the perf report have no above mentioned problem.
I have compared them, found some items have difference, such as below case:
page_fault2-thp-always: handle_mm_fault, base: 45.22% head: 29.41%
page_fault3-thp-always: handle_mm_fault, base: 22.95% head: 14.15%
So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit.
Thanks,
Haiyan, Song
________________________________________
From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Song, HaiyanX
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults
On 13/07/2018 05:56, Song, HaiyanX wrote:
> Hi Laurent,
Hi Haiyan,
Thanks a lot for sharing this perf reports.
I looked at them closely, and I've to admit that I was not able to found a
major difference between the base and the head report, except that
handle_pte_fault() is no more in-lined in the head one.
As expected, __handle_speculative_fault() is never traced since these tests are
dealing with file mapping, not handled in the speculative way.
When running these test did you seen a major differences in the test's result
between base and head ?
>From the number of cycles counted, the biggest difference is page_fault3 when
run with the THP enabled:
BASE HEAD Delta
page_fault2_base_thp_never 1142252426747 1065866197589 -6.69%
page_fault2_base_THP-Alwasys 1124844374523 1076312228927 -4.31%
page_fault3_base_thp_never 1099387298152 1134118402345 3.16%
page_fault3_base_THP-Always 1059370178101 853985561949 -19.39%
The very weird thing is the difference of the delta cycles reported between
thp never and thp always, because the speculative way is aborted when checking
for the vma->ops field, which is the same in both case, and the thp is never
checked. So there is no code covering differnce, on the speculative path,
between these 2 cases. This leads me to think that there are other interactions
interfering in the measure.
Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at
the head of the perf report is the 92% testcase which is weirdly not reported
on the head side :
92.02% 22.33% page_fault3_processes [.] testcase
92.02% testcase
Then the base reported 37.67% for __do_page_fault() where the head reported
48.41%, but the only difference in this function, between base and head, is the
call to handle_speculative_fault(). But this is a macro checking for the fault
flags, and mm->users and then calling __handle_speculative_fault() if needed.
So this can't explain this difference, except if __handle_speculative_fault()
is inlined in __do_page_fault().
Is this the case on your build ?
Haiyan, do you still have the output of the test to check those numbers too ?
Cheers,
Laurent
> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case.
> Please help to check on these data if it can help you to find the higher change. Thanks.
>
> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2
> tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration.
>
> Best regards,
> Haiyan Song
>
> ________________________________________
> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
> Sent: Thursday, July 12, 2018 1:05 AM
> To: Song, HaiyanX
> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
>
> Hi Haiyan,
>
> Do you get a chance to capture some performance cycles on your system ?
> I still can't get these numbers on my hardware.
>
> Thanks,
> Laurent.
>
> On 04/07/2018 09:51, Laurent Dufour wrote:
>> On 04/07/2018 05:23, Song, HaiyanX wrote:
>>> Hi Laurent,
>>>
>>>
>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times.
>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev.
>>
>> Repeating the test only 3 times seems a bit too low to me.
>>
>> I'll focus on the higher change for the moment, but I don't have access to such
>> a hardware.
>>
>> Is possible to provide a diff between base and SPF of the performance cycles
>> measured when running page_fault3 and page_fault2 when the 20% change is detected.
>>
>> Please stay focus on the test case process to see exactly where the series is
>> impacting.
>>
>> Thanks,
>> Laurent.
>>
>>>
>>> And I did not find other high variation on test case result.
>>>
>>> a). Enable THP
>>> testcase base stddev change head stddev metric
>>> page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops
>>> page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops
>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops
>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops
>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops
>>>
>>> b). Disable THP
>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops
>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops
>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops
>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops
>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops
>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops
>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops
>>>
>>>
>>> Best regards,
>>> Haiyan Song
>>> ________________________________________
>>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>> Sent: Monday, July 02, 2018 4:59 PM
>>> To: Song, HaiyanX
>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>
>>> On 11/06/2018 09:49, Song, HaiyanX wrote:
>>>> Hi Laurent,
>>>>
>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
>>>> V9 patch serials.
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
>>>> commit id:
>>>> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>>>> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
>>>> Benchmark: will-it-scale
>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
>>>>
>>>> Metrics:
>>>> will-it-scale.per_process_ops=processes/nr_cpu
>>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task:100%
>>>>
>>>> 1. Regressions:
>>>>
>>>> a). Enable THP
>>>> testcase base change head metric
>>>> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops
>>>> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops
>>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops
>>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops
>>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops
>>>>
>>>> b). Disable THP
>>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops
>>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops
>>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops
>>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops
>>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops
>>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops
>>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops
>>>>
>>>> Notes: for the above values of test result, the higher is better.
>>>
>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
>>> get reproducible results. The results have huge variation, even on the vanilla
>>> kernel, and I can't state on any changes due to that.
>>>
>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
>>> measure any changes between the vanilla and the SPF patched ones:
>>>
>>> test THP enabled 4.17.0-rc4-mm1 spf delta
>>> page_fault3_threads 2697.7 2683.5 -0.53%
>>> page_fault2_threads 170660.6 169574.1 -0.64%
>>> context_switch1_threads 6915269.2 6877507.3 -0.55%
>>> context_switch1_processes 6478076.2 6529493.5 0.79%
>>> brk1 243391.2 238527.5 -2.00%
>>>
>>> Tests were run 10 times, no high variation detected.
>>>
>>> Did you see high variation on your side ? How many times the test were run to
>>> compute the average values ?
>>>
>>> Thanks,
>>> Laurent.
>>>
>>>
>>>>
>>>> 2. Improvement: not found improvement based on the selected test cases.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>> ________________________________________
>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>> Sent: Monday, May 28, 2018 4:54 PM
>>>> To: Song, HaiyanX
>>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> On 28/05/2018 10:22, Haiyan Song wrote:
>>>>> Hi Laurent,
>>>>>
>>>>> Yes, these tests are done on V9 patch.
>>>>
>>>> Do you plan to give this V11 a run ?
>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Haiyan Song
>>>>>
>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>>>>
>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>>>>> tested on Intel 4s Skylake platform.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>>>>> series" while responding to the v11 header series...
>>>>>> Were these tests done on v9 or v11 ?
>>>>>>
>>>>>> Cheers,
>>>>>> Laurent.
>>>>>>
>>>>>>>
>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>>>>> Commit id:
>>>>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>>>> head commit: 0355322b3577eeab7669066df42c550a56801110
>>>>>>> Benchmark suite: will-it-scale
>>>>>>> Download link:
>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>>>>> Metrics:
>>>>>>> will-it-scale.per_process_ops=processes/nr_cpu
>>>>>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>>>>> THP: enable / disable
>>>>>>> nr_task: 100%
>>>>>>>
>>>>>>> 1. Regressions:
>>>>>>> a) THP enabled:
>>>>>>> testcase base change head metric
>>>>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops
>>>>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops
>>>>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops
>>>>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops
>>>>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase base change head metric
>>>>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops
>>>>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops
>>>>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops
>>>>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops
>>>>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops
>>>>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops
>>>>>>>
>>>>>>> 2. Improvements:
>>>>>>> a) THP enabled:
>>>>>>> testcase base change head metric
>>>>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops
>>>>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops
>>>>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase base change head metric
>>>>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops
>>>>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops
>>>>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops
>>>>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops
>>>>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops
>>>>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>>>>> on head commit is better than that on base commit for this benchmark.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Haiyan Song
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>>>>
>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>>>>> page fault without holding the mm semaphore [1].
>>>>>>>
>>>>>>> The idea is to try to handle user space page faults without holding the
>>>>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>>>>> process since the page fault handler will not wait for other threads memory
>>>>>>> layout change to be done, assuming that this change is done in another part
>>>>>>> of the process's memory space. This type page fault is named speculative
>>>>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>>>>> is failing its processing and a classic page fault is then tried.
>>>>>>>
>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>>>>> freeing operation which was hitting the performance by 20% as reported by
>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>>>>> limiting the locking contention to these operations which are expected to
>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>>>>> our back a reference count is added and 2 services (get_vma() and
>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>>>>> benchmark anymore.
>>>>>>>
>>>>>>> The VMA's attributes checked during the speculative page fault processing
>>>>>>> have to be protected against parallel changes. This is done by using a per
>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>>>>> handler to fast check for parallel changes in progress and to abort the
>>>>>>> speculative page fault in that case.
>>>>>>>
>>>>>>> Once the VMA has been found, the speculative page fault handler would check
>>>>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>>>>> is tried. VMA sequence lockings are added when VMA attributes which are
>>>>>>> checked during the page fault are modified.
>>>>>>>
>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>>>>> leading to touching this PTE will need to lock the page table, so no
>>>>>>> parallel change is possible at this time.
>>>>>>>
>>>>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>>>>> different from the one recorded at the beginning of the SPF operation, the
>>>>>>> classic page fault handler will be called to handle the operation while
>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>>>>> PTE.
>>>>>>>
>>>>>>> In pseudo code, this could be seen as:
>>>>>>> speculative_page_fault()
>>>>>>> {
>>>>>>> vma = get_vma()
>>>>>>> check vma sequence count
>>>>>>> check vma's support
>>>>>>> disable interrupt
>>>>>>> check pgd,p4d,...,pte
>>>>>>> save pmd and pte in vmf
>>>>>>> save vma sequence counter in vmf
>>>>>>> enable interrupt
>>>>>>> check vma sequence count
>>>>>>> handle_pte_fault(vma)
>>>>>>> ..
>>>>>>> page = alloc_page()
>>>>>>> pte_map_lock()
>>>>>>> disable interrupt
>>>>>>> abort if sequence counter has changed
>>>>>>> abort if pmd or pte has changed
>>>>>>> pte map and lock
>>>>>>> enable interrupt
>>>>>>> if abort
>>>>>>> free page
>>>>>>> abort
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> arch_fault_handler()
>>>>>>> {
>>>>>>> if (speculative_page_fault(&vma))
>>>>>>> goto done
>>>>>>> again:
>>>>>>> lock(mmap_sem)
>>>>>>> vma = find_vma();
>>>>>>> handle_pte_fault(vma);
>>>>>>> if retry
>>>>>>> unlock(mmap_sem)
>>>>>>> goto again;
>>>>>>> done:
>>>>>>> handle fault error
>>>>>>> }
>>>>>>>
>>>>>>> Support for THP is not done because when checking for the PMD, we can be
>>>>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>>>>
>>>>>>> This series add a new software performance event named 'speculative-faults'
>>>>>>> or 'spf'. It counts the number of successful page fault event handled
>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>>>>> counting the total number of page fault events while 'spf' is only counting
>>>>>>> the part of the faults processed speculatively.
>>>>>>>
>>>>>>> There are some trace events introduced by this series. They allow
>>>>>>> identifying why the page faults were not processed speculatively. This
>>>>>>> doesn't take in account the faults generated by a monothreaded process
>>>>>>> which directly processed while holding the mmap_sem. This trace events are
>>>>>>> grouped in a system named 'pagefault', they are:
>>>>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>>>> - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>>>> back.
>>>>>>>
>>>>>>> To record all the related events, the easier is to run perf with the
>>>>>>> following arguments :
>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>>>>
>>>>>>> There is also a dedicated vmstat counter showing the number of successful
>>>>>>> page fault handled speculatively. I can be seen this way:
>>>>>>> $ grep speculative_pgfault /proc/vmstat
>>>>>>>
>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>>>>> on x86, PowerPC and arm64.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Real Workload results
>>>>>>>
>>>>>>> As mentioned in previous email, we did non official runs using a "popular
>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>>>>> which showed a 30% improvements in the number of transaction processed per
>>>>>>> second. This run has been done on the v6 series, but changes introduced in
>>>>>>> this new version should not impact the performance boost seen.
>>>>>>>
>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>>>>> series:
>>>>>>> vanilla spf
>>>>>>> faults 89.418 101.364 +13%
>>>>>>> spf n/a 97.989
>>>>>>>
>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>>>>> way.
>>>>>>>
>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>>>>> it a try on an android device. He reported that the application launch time
>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>>>>> 20%.
>>>>>>>
>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>>>>
>>>>>>> Application 4.9 4.9+spf delta
>>>>>>> com.tencent.mm 416 389 -7%
>>>>>>> com.eg.android.AlipayGphone 1135 986 -13%
>>>>>>> com.tencent.mtt 455 454 0%
>>>>>>> com.qqgame.hlddz 1497 1409 -6%
>>>>>>> com.autonavi.minimap 711 701 -1%
>>>>>>> com.tencent.tmgp.sgame 788 748 -5%
>>>>>>> com.immomo.momo 501 487 -3%
>>>>>>> com.tencent.peng 2145 2112 -2%
>>>>>>> com.smile.gifmaker 491 461 -6%
>>>>>>> com.baidu.BaiduMap 479 366 -23%
>>>>>>> com.taobao.taobao 1341 1198 -11%
>>>>>>> com.baidu.searchbox 333 314 -6%
>>>>>>> com.tencent.mobileqq 394 384 -3%
>>>>>>> com.sina.weibo 907 906 0%
>>>>>>> com.youku.phone 816 731 -11%
>>>>>>> com.happyelements.AndroidAnimal.qq 763 717 -6%
>>>>>>> com.UCMobile 415 411 -1%
>>>>>>> com.tencent.tmgp.ak 1464 1431 -2%
>>>>>>> com.tencent.qqmusic 336 329 -2%
>>>>>>> com.sankuai.meituan 1661 1302 -22%
>>>>>>> com.netease.cloudmusic 1193 1200 1%
>>>>>>> air.tv.douyu.android 4257 4152 -2%
>>>>>>>
>>>>>>> ------------------
>>>>>>> Benchmarks results
>>>>>>>
>>>>>>> Base kernel is v4.17.0-rc4-mm1
>>>>>>> SPF is BASE + this series
>>>>>>>
>>>>>>> Kernbench:
>>>>>>> ----------
>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>>>>> kernel (kernel is build 5 times):
>>>>>>>
>>>>>>> Average Half load -j 8
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50%
>>>>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13%
>>>>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53%
>>>>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16%
>>>>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74%
>>>>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01%
>>>>>>>
>>>>>>> Average Optimal load -j 16
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75%
>>>>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18%
>>>>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17%
>>>>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31%
>>>>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79%
>>>>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16%
>>>>>>>
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>> Performance counter stats for '../kernbench -M':
>>>>>>> 526743764 faults
>>>>>>> 210 spf
>>>>>>> 3 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 2278 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Very few speculative page faults were recorded as most of the processes
>>>>>>> involved are monothreaded (sounds that on this architecture some threads
>>>>>>> were created during the kernel build processing).
>>>>>>>
>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>>>>
>>>>>>> Average Half load -j 40
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01%
>>>>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03%
>>>>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24%
>>>>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08%
>>>>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50%
>>>>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17%
>>>>>>>
>>>>>>> Average Optimal load -j 80
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39%
>>>>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03%
>>>>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23%
>>>>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33%
>>>>>>> Context Switches 223861 (138865) 225032 (139632) 0.52%
>>>>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45%
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>> Performance counter stats for '../kernbench -M':
>>>>>>> 116730856 faults
>>>>>>> 0 spf
>>>>>>> 3 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 476 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>>>>> there is no impact on the performance.
>>>>>>>
>>>>>>> Ebizzy:
>>>>>>> -------
>>>>>>> The test is counting the number of records per second it can manage, the
>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>>>>> consistent result I repeated the test 100 times and measure the average
>>>>>>> result. The number is the record processes per second, the higher is the
>>>>>>> best.
>>>>>>>
>>>>>>> BASE SPF delta
>>>>>>> 16 CPUs x86 VM 742.57 1490.24 100.69%
>>>>>>> 80 CPUs P8 node 13105.4 24174.23 84.46%
>>>>>>>
>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>>>> Performance counter stats for './ebizzy -mTt 16':
>>>>>>> 1706379 faults
>>>>>>> 1674599 spf
>>>>>>> 30588 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 363 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>>>> Performance counter stats for './ebizzy -mTt 80':
>>>>>>> 1874773 faults
>>>>>>> 1461153 spf
>>>>>>> 413293 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 200 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>>>>> leading the ebizzy performance boost.
>>>>>>>
>>>>>>> ------------------
>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>>>> and Minchan Kim, hopefully.
>>>>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>>>> __do_page_fault().
>>>>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>>>> instead
>>>>>>> of aborting the speculative page fault handling. Dropping the now
>>>>>>> useless
>>>>>>> trace event pagefault:spf_pte_lock.
>>>>>>> - No more try to reuse the fetched VMA during the speculative page fault
>>>>>>> handling when retrying is needed. This adds a lot of complexity and
>>>>>>> additional tests done didn't show a significant performance improvement.
>>>>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>>>>
>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>>>>
>>>>>>>
>>>>>>> Laurent Dufour (20):
>>>>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>>>> mm: make pte_unmap_same compatible with SPF
>>>>>>> mm: introduce INIT_VMA()
>>>>>>> mm: protect VMA modifications using VMA sequence count
>>>>>>> mm: protect mremap() against SPF hanlder
>>>>>>> mm: protect SPF handler against anon_vma changes
>>>>>>> mm: cache some VMA fields in the vm_fault structure
>>>>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>>>> mm: introduce __lru_cache_add_active_or_unevictable
>>>>>>> mm: introduce __vm_normal_page()
>>>>>>> mm: introduce __page_add_new_anon_rmap()
>>>>>>> mm: protect mm_rb tree with a rwlock
>>>>>>> mm: adding speculative page fault failure trace events
>>>>>>> perf: add a speculative page fault sw event
>>>>>>> perf tools: add support for the SPF perf event
>>>>>>> mm: add speculative page fault vmstats
>>>>>>> powerpc/mm: add speculative page fault
>>>>>>>
>>>>>>> Mahendran Ganesh (2):
>>>>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> arm64/mm: add speculative page fault
>>>>>>>
>>>>>>> Peter Zijlstra (4):
>>>>>>> mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>>>> mm: VMA sequence count
>>>>>>> mm: provide speculative fault infrastructure
>>>>>>> x86/mm: add speculative pagefault handling
>>>>>>>
>>>>>>> arch/arm64/Kconfig | 1 +
>>>>>>> arch/arm64/mm/fault.c | 12 +
>>>>>>> arch/powerpc/Kconfig | 1 +
>>>>>>> arch/powerpc/mm/fault.c | 16 +
>>>>>>> arch/x86/Kconfig | 1 +
>>>>>>> arch/x86/mm/fault.c | 27 +-
>>>>>>> fs/exec.c | 2 +-
>>>>>>> fs/proc/task_mmu.c | 5 +-
>>>>>>> fs/userfaultfd.c | 17 +-
>>>>>>> include/linux/hugetlb_inline.h | 2 +-
>>>>>>> include/linux/migrate.h | 4 +-
>>>>>>> include/linux/mm.h | 136 +++++++-
>>>>>>> include/linux/mm_types.h | 7 +
>>>>>>> include/linux/pagemap.h | 4 +-
>>>>>>> include/linux/rmap.h | 12 +-
>>>>>>> include/linux/swap.h | 10 +-
>>>>>>> include/linux/vm_event_item.h | 3 +
>>>>>>> include/trace/events/pagefault.h | 80 +++++
>>>>>>> include/uapi/linux/perf_event.h | 1 +
>>>>>>> kernel/fork.c | 5 +-
>>>>>>> mm/Kconfig | 22 ++
>>>>>>> mm/huge_memory.c | 6 +-
>>>>>>> mm/hugetlb.c | 2 +
>>>>>>> mm/init-mm.c | 3 +
>>>>>>> mm/internal.h | 20 ++
>>>>>>> mm/khugepaged.c | 5 +
>>>>>>> mm/madvise.c | 6 +-
>>>>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++-----
>>>>>>> mm/mempolicy.c | 51 ++-
>>>>>>> mm/migrate.c | 6 +-
>>>>>>> mm/mlock.c | 13 +-
>>>>>>> mm/mmap.c | 229 ++++++++++---
>>>>>>> mm/mprotect.c | 4 +-
>>>>>>> mm/mremap.c | 13 +
>>>>>>> mm/nommu.c | 2 +-
>>>>>>> mm/rmap.c | 5 +-
>>>>>>> mm/swap.c | 6 +-
>>>>>>> mm/swap_state.c | 8 +-
>>>>>>> mm/vmstat.c | 5 +-
>>>>>>> tools/include/uapi/linux/perf_event.h | 1 +
>>>>>>> tools/perf/util/evsel.c | 1 +
>>>>>>> tools/perf/util/parse-events.c | 4 +
>>>>>>> tools/perf/util/parse-events.l | 1 +
>>>>>>> tools/perf/util/python.c | 1 +
>>>>>>> 44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>>>> create mode 100644 include/trace/events/pagefault.h
>>>>>>>
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
[-- Attachment #2: perf-profile_page_fault2_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12167 bytes --]
[-- Attachment #3: perf-profile_page_fault2_base_thp_never.gz --]
[-- Type: application/gzip, Size: 11543 bytes --]
[-- Attachment #4: perf-profile_page_fault2_head_thp_always.gz --]
[-- Type: application/gzip, Size: 12019 bytes --]
[-- Attachment #5: perf-profile_page_fault3_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12701 bytes --]
[-- Attachment #6: perf-profile_page_fault3_base_thp_always.gz --]
[-- Type: application/gzip, Size: 12701 bytes --]
[-- Attachment #7: perf-profile_page_fault3_base_thp_never.gz --]
[-- Type: application/gzip, Size: 12699 bytes --]
^ permalink raw reply
* RE: [PATCH v11 00/26] Speculative page faults
From: Song, HaiyanX @ 2018-08-03 6:45 UTC (permalink / raw)
To: Laurent Dufour
Cc: akpm@linux-foundation.org, mhocko@kernel.org,
peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com,
dave@stgolabs.net, jack@suse.cz, Matthew Wilcox,
khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com,
benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org,
Thomas Gleixner, Ingo Molnar, hpa@zytor.com, Will Deacon,
Sergey Senozhatsky, sergey.senozhatsky.work@gmail.com,
Andrea Arcangeli, Alexei Starovoitov, Wang, Kemi, Daniel Jordan,
David Rientjes, Jerome Glisse, Ganesh Mahendran, Minchan Kim,
Punit Agrawal, vinayak menon, Yang Shi,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
haren@linux.vnet.ibm.com, npiggin@gmail.com,
bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen,
linuxppc-dev@lists.ozlabs.org, x86@kernel.org
In-Reply-To: <9FE19350E8A7EE45B64D8D63D368C8966B876B4B@SHSMSX101.ccr.corp.intel.com>
[-- Attachment #1: Type: text/plain, Size: 43157 bytes --]
Add another 3 perf file.
________________________________________
From: Song, HaiyanX
Sent: Friday, August 03, 2018 2:36 PM
To: Laurent Dufour
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: RE: [PATCH v11 00/26] Speculative page faults
Hi Laurent,
Thanks for your analysis for the last perf results.
Your mentioned ," the major differences at the head of the perf report is the 92% testcase which is weirdly not reported
on the head side", which is a bug of 0-day,and it caused the item is not counted in perf.
I've triggered the test page_fault2 and page_fault3 again only with thread mode of will-it-scale on 0-day (on the same test box,every case tested 3 times).
I checked the perf report have no above mentioned problem.
I have compared them, found some items have difference, such as below case:
page_fault2-thp-always: handle_mm_fault, base: 45.22% head: 29.41%
page_fault3-thp-always: handle_mm_fault, base: 22.95% head: 14.15%
So i attached the perf result in mail again, could your have a look again for checking the difference between base and head commit.
Thanks,
Haiyan, Song
________________________________________
From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
Sent: Tuesday, July 17, 2018 5:36 PM
To: Song, HaiyanX
Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
Subject: Re: [PATCH v11 00/26] Speculative page faults
On 13/07/2018 05:56, Song, HaiyanX wrote:
> Hi Laurent,
Hi Haiyan,
Thanks a lot for sharing this perf reports.
I looked at them closely, and I've to admit that I was not able to found a
major difference between the base and the head report, except that
handle_pte_fault() is no more in-lined in the head one.
As expected, __handle_speculative_fault() is never traced since these tests are
dealing with file mapping, not handled in the speculative way.
When running these test did you seen a major differences in the test's result
between base and head ?
>From the number of cycles counted, the biggest difference is page_fault3 when
run with the THP enabled:
BASE HEAD Delta
page_fault2_base_thp_never 1142252426747 1065866197589 -6.69%
page_fault2_base_THP-Alwasys 1124844374523 1076312228927 -4.31%
page_fault3_base_thp_never 1099387298152 1134118402345 3.16%
page_fault3_base_THP-Always 1059370178101 853985561949 -19.39%
The very weird thing is the difference of the delta cycles reported between
thp never and thp always, because the speculative way is aborted when checking
for the vma->ops field, which is the same in both case, and the thp is never
checked. So there is no code covering differnce, on the speculative path,
between these 2 cases. This leads me to think that there are other interactions
interfering in the measure.
Looking at the perf-profile_page_fault3_*_THP-Always, the major differences at
the head of the perf report is the 92% testcase which is weirdly not reported
on the head side :
92.02% 22.33% page_fault3_processes [.] testcase
92.02% testcase
Then the base reported 37.67% for __do_page_fault() where the head reported
48.41%, but the only difference in this function, between base and head, is the
call to handle_speculative_fault(). But this is a macro checking for the fault
flags, and mm->users and then calling __handle_speculative_fault() if needed.
So this can't explain this difference, except if __handle_speculative_fault()
is inlined in __do_page_fault().
Is this the case on your build ?
Haiyan, do you still have the output of the test to check those numbers too ?
Cheers,
Laurent
> I attached the perf-profile.gz file for case page_fault2 and page_fault3. These files were captured during test the related test case.
> Please help to check on these data if it can help you to find the higher change. Thanks.
>
> File name perf-profile_page_fault2_head_THP-Always.gz, means the perf-profile result get from page_fault2
> tested for head commit (a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12) with THP_always configuration.
>
> Best regards,
> Haiyan Song
>
> ________________________________________
> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
> Sent: Thursday, July 12, 2018 1:05 AM
> To: Song, HaiyanX
> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
>
> Hi Haiyan,
>
> Do you get a chance to capture some performance cycles on your system ?
> I still can't get these numbers on my hardware.
>
> Thanks,
> Laurent.
>
> On 04/07/2018 09:51, Laurent Dufour wrote:
>> On 04/07/2018 05:23, Song, HaiyanX wrote:
>>> Hi Laurent,
>>>
>>>
>>> For the test result on Intel 4s skylake platform (192 CPUs, 768G Memory), the below test cases all were run 3 times.
>>> I check the test results, only page_fault3_thread/enable THP have 6% stddev for head commit, other tests have lower stddev.
>>
>> Repeating the test only 3 times seems a bit too low to me.
>>
>> I'll focus on the higher change for the moment, but I don't have access to such
>> a hardware.
>>
>> Is possible to provide a diff between base and SPF of the performance cycles
>> measured when running page_fault3 and page_fault2 when the 20% change is detected.
>>
>> Please stay focus on the test case process to see exactly where the series is
>> impacting.
>>
>> Thanks,
>> Laurent.
>>
>>>
>>> And I did not find other high variation on test case result.
>>>
>>> a). Enable THP
>>> testcase base stddev change head stddev metric
>>> page_fault3/enable THP 10519 ± 3% -20.5% 8368 ±6% will-it-scale.per_thread_ops
>>> page_fault2/enalbe THP 8281 ± 2% -18.8% 6728 will-it-scale.per_thread_ops
>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops
>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops
>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops
>>>
>>> b). Disable THP
>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops
>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops
>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops
>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops
>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops
>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops
>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops
>>>
>>>
>>> Best regards,
>>> Haiyan Song
>>> ________________________________________
>>> From: Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>> Sent: Monday, July 02, 2018 4:59 PM
>>> To: Song, HaiyanX
>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>
>>> On 11/06/2018 09:49, Song, HaiyanX wrote:
>>>> Hi Laurent,
>>>>
>>>> Regression test for v11 patch serials have been run, some regression is found by LKP-tools (linux kernel performance)
>>>> tested on Intel 4s skylake platform. This time only test the cases which have been run and found regressions on
>>>> V9 patch serials.
>>>>
>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>> branch: Laurent-Dufour/Speculative-page-faults/20180520-045126
>>>> commit id:
>>>> head commit : a7a8993bfe3ccb54ad468b9f1799649e4ad1ff12
>>>> base commit : ba98a1cdad71d259a194461b3a61471b49b14df1
>>>> Benchmark: will-it-scale
>>>> Download link: https://github.com/antonblanchard/will-it-scale/tree/master
>>>>
>>>> Metrics:
>>>> will-it-scale.per_process_ops=processes/nr_cpu
>>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>> THP: enable / disable
>>>> nr_task:100%
>>>>
>>>> 1. Regressions:
>>>>
>>>> a). Enable THP
>>>> testcase base change head metric
>>>> page_fault3/enable THP 10519 -20.5% 836 will-it-scale.per_thread_ops
>>>> page_fault2/enalbe THP 8281 -18.8% 6728 will-it-scale.per_thread_ops
>>>> brk1/eanble THP 998475 -2.2% 976893 will-it-scale.per_process_ops
>>>> context_switch1/enable THP 223910 -1.3% 220930 will-it-scale.per_process_ops
>>>> context_switch1/enable THP 233722 -1.0% 231288 will-it-scale.per_thread_ops
>>>>
>>>> b). Disable THP
>>>> page_fault3/disable THP 10856 -23.1% 8344 will-it-scale.per_thread_ops
>>>> page_fault2/disable THP 8147 -18.8% 6613 will-it-scale.per_thread_ops
>>>> brk1/disable THP 957 -7.9% 881 will-it-scale.per_thread_ops
>>>> context_switch1/disable THP 237006 -2.2% 231907 will-it-scale.per_thread_ops
>>>> brk1/disable THP 997317 -2.0% 977778 will-it-scale.per_process_ops
>>>> page_fault3/disable THP 467454 -1.8% 459251 will-it-scale.per_process_ops
>>>> context_switch1/disable THP 224431 -1.3% 221567 will-it-scale.per_process_ops
>>>>
>>>> Notes: for the above values of test result, the higher is better.
>>>
>>> I tried the same tests on my PowerPC victim VM (1024 CPUs, 11TB) and I can't
>>> get reproducible results. The results have huge variation, even on the vanilla
>>> kernel, and I can't state on any changes due to that.
>>>
>>> I tried on smaller node (80 CPUs, 32G), and the tests ran better, but I didn't
>>> measure any changes between the vanilla and the SPF patched ones:
>>>
>>> test THP enabled 4.17.0-rc4-mm1 spf delta
>>> page_fault3_threads 2697.7 2683.5 -0.53%
>>> page_fault2_threads 170660.6 169574.1 -0.64%
>>> context_switch1_threads 6915269.2 6877507.3 -0.55%
>>> context_switch1_processes 6478076.2 6529493.5 0.79%
>>> brk1 243391.2 238527.5 -2.00%
>>>
>>> Tests were run 10 times, no high variation detected.
>>>
>>> Did you see high variation on your side ? How many times the test were run to
>>> compute the average values ?
>>>
>>> Thanks,
>>> Laurent.
>>>
>>>
>>>>
>>>> 2. Improvement: not found improvement based on the selected test cases.
>>>>
>>>>
>>>> Best regards
>>>> Haiyan Song
>>>> ________________________________________
>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>> Sent: Monday, May 28, 2018 4:54 PM
>>>> To: Song, HaiyanX
>>>> Cc: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>> Subject: Re: [PATCH v11 00/26] Speculative page faults
>>>>
>>>> On 28/05/2018 10:22, Haiyan Song wrote:
>>>>> Hi Laurent,
>>>>>
>>>>> Yes, these tests are done on V9 patch.
>>>>
>>>> Do you plan to give this V11 a run ?
>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Haiyan Song
>>>>>
>>>>> On Mon, May 28, 2018 at 09:51:34AM +0200, Laurent Dufour wrote:
>>>>>> On 28/05/2018 07:23, Song, HaiyanX wrote:
>>>>>>>
>>>>>>> Some regression and improvements is found by LKP-tools(linux kernel performance) on V9 patch series
>>>>>>> tested on Intel 4s Skylake platform.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for reporting this benchmark results, but you mentioned the "V9 patch
>>>>>> series" while responding to the v11 header series...
>>>>>> Were these tests done on v9 or v11 ?
>>>>>>
>>>>>> Cheers,
>>>>>> Laurent.
>>>>>>
>>>>>>>
>>>>>>> The regression result is sorted by the metric will-it-scale.per_thread_ops.
>>>>>>> Branch: Laurent-Dufour/Speculative-page-faults/20180316-151833 (V9 patch series)
>>>>>>> Commit id:
>>>>>>> base commit: d55f34411b1b126429a823d06c3124c16283231f
>>>>>>> head commit: 0355322b3577eeab7669066df42c550a56801110
>>>>>>> Benchmark suite: will-it-scale
>>>>>>> Download link:
>>>>>>> https://github.com/antonblanchard/will-it-scale/tree/master/tests
>>>>>>> Metrics:
>>>>>>> will-it-scale.per_process_ops=processes/nr_cpu
>>>>>>> will-it-scale.per_thread_ops=threads/nr_cpu
>>>>>>> test box: lkp-skl-4sp1(nr_cpu=192,memory=768G)
>>>>>>> THP: enable / disable
>>>>>>> nr_task: 100%
>>>>>>>
>>>>>>> 1. Regressions:
>>>>>>> a) THP enabled:
>>>>>>> testcase base change head metric
>>>>>>> page_fault3/ enable THP 10092 -17.5% 8323 will-it-scale.per_thread_ops
>>>>>>> page_fault2/ enable THP 8300 -17.2% 6869 will-it-scale.per_thread_ops
>>>>>>> brk1/ enable THP 957.67 -7.6% 885 will-it-scale.per_thread_ops
>>>>>>> page_fault3/ enable THP 172821 -5.3% 163692 will-it-scale.per_process_ops
>>>>>>> signal1/ enable THP 9125 -3.2% 8834 will-it-scale.per_process_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase base change head metric
>>>>>>> page_fault3/ disable THP 10107 -19.1% 8180 will-it-scale.per_thread_ops
>>>>>>> page_fault2/ disable THP 8432 -17.8% 6931 will-it-scale.per_thread_ops
>>>>>>> context_switch1/ disable THP 215389 -6.8% 200776 will-it-scale.per_thread_ops
>>>>>>> brk1/ disable THP 939.67 -6.6% 877.33 will-it-scale.per_thread_ops
>>>>>>> page_fault3/ disable THP 173145 -4.7% 165064 will-it-scale.per_process_ops
>>>>>>> signal1/ disable THP 9162 -3.9% 8802 will-it-scale.per_process_ops
>>>>>>>
>>>>>>> 2. Improvements:
>>>>>>> a) THP enabled:
>>>>>>> testcase base change head metric
>>>>>>> malloc1/ enable THP 66.33 +469.8% 383.67 will-it-scale.per_thread_ops
>>>>>>> writeseek3/ enable THP 2531 +4.5% 2646 will-it-scale.per_thread_ops
>>>>>>> signal1/ enable THP 989.33 +2.8% 1016 will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> b) THP disabled:
>>>>>>> testcase base change head metric
>>>>>>> malloc1/ disable THP 90.33 +417.3% 467.33 will-it-scale.per_thread_ops
>>>>>>> read2/ disable THP 58934 +39.2% 82060 will-it-scale.per_thread_ops
>>>>>>> page_fault1/ disable THP 8607 +36.4% 11736 will-it-scale.per_thread_ops
>>>>>>> read1/ disable THP 314063 +12.7% 353934 will-it-scale.per_thread_ops
>>>>>>> writeseek3/ disable THP 2452 +12.5% 2759 will-it-scale.per_thread_ops
>>>>>>> signal1/ disable THP 971.33 +5.5% 1024 will-it-scale.per_thread_ops
>>>>>>>
>>>>>>> Notes: for above values in column "change", the higher value means that the related testcase result
>>>>>>> on head commit is better than that on base commit for this benchmark.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>> Haiyan Song
>>>>>>>
>>>>>>> ________________________________________
>>>>>>> From: owner-linux-mm@kvack.org [owner-linux-mm@kvack.org] on behalf of Laurent Dufour [ldufour@linux.vnet.ibm.com]
>>>>>>> Sent: Thursday, May 17, 2018 7:06 PM
>>>>>>> To: akpm@linux-foundation.org; mhocko@kernel.org; peterz@infradead.org; kirill@shutemov.name; ak@linux.intel.com; dave@stgolabs.net; jack@suse.cz; Matthew Wilcox; khandual@linux.vnet.ibm.com; aneesh.kumar@linux.vnet.ibm.com; benh@kernel.crashing.org; mpe@ellerman.id.au; paulus@samba.org; Thomas Gleixner; Ingo Molnar; hpa@zytor.com; Will Deacon; Sergey Senozhatsky; sergey.senozhatsky.work@gmail.com; Andrea Arcangeli; Alexei Starovoitov; Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi
>>>>>>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; haren@linux.vnet.ibm.com; npiggin@gmail.com; bsingharora@gmail.com; paulmck@linux.vnet.ibm.com; Tim Chen; linuxppc-dev@lists.ozlabs.org; x86@kernel.org
>>>>>>> Subject: [PATCH v11 00/26] Speculative page faults
>>>>>>>
>>>>>>> This is a port on kernel 4.17 of the work done by Peter Zijlstra to handle
>>>>>>> page fault without holding the mm semaphore [1].
>>>>>>>
>>>>>>> The idea is to try to handle user space page faults without holding the
>>>>>>> mmap_sem. This should allow better concurrency for massively threaded
>>>>>>> process since the page fault handler will not wait for other threads memory
>>>>>>> layout change to be done, assuming that this change is done in another part
>>>>>>> of the process's memory space. This type page fault is named speculative
>>>>>>> page fault. If the speculative page fault fails because of a concurrency is
>>>>>>> detected or because underlying PMD or PTE tables are not yet allocating, it
>>>>>>> is failing its processing and a classic page fault is then tried.
>>>>>>>
>>>>>>> The speculative page fault (SPF) has to look for the VMA matching the fault
>>>>>>> address without holding the mmap_sem, this is done by introducing a rwlock
>>>>>>> which protects the access to the mm_rb tree. Previously this was done using
>>>>>>> SRCU but it was introducing a lot of scheduling to process the VMA's
>>>>>>> freeing operation which was hitting the performance by 20% as reported by
>>>>>>> Kemi Wang [2]. Using a rwlock to protect access to the mm_rb tree is
>>>>>>> limiting the locking contention to these operations which are expected to
>>>>>>> be in a O(log n) order. In addition to ensure that the VMA is not freed in
>>>>>>> our back a reference count is added and 2 services (get_vma() and
>>>>>>> put_vma()) are introduced to handle the reference count. Once a VMA is
>>>>>>> fetched from the RB tree using get_vma(), it must be later freed using
>>>>>>> put_vma(). I can't see anymore the overhead I got while will-it-scale
>>>>>>> benchmark anymore.
>>>>>>>
>>>>>>> The VMA's attributes checked during the speculative page fault processing
>>>>>>> have to be protected against parallel changes. This is done by using a per
>>>>>>> VMA sequence lock. This sequence lock allows the speculative page fault
>>>>>>> handler to fast check for parallel changes in progress and to abort the
>>>>>>> speculative page fault in that case.
>>>>>>>
>>>>>>> Once the VMA has been found, the speculative page fault handler would check
>>>>>>> for the VMA's attributes to verify that the page fault has to be handled
>>>>>>> correctly or not. Thus, the VMA is protected through a sequence lock which
>>>>>>> allows fast detection of concurrent VMA changes. If such a change is
>>>>>>> detected, the speculative page fault is aborted and a *classic* page fault
>>>>>>> is tried. VMA sequence lockings are added when VMA attributes which are
>>>>>>> checked during the page fault are modified.
>>>>>>>
>>>>>>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>>>>>>> so once the page table is locked, the VMA is valid, so any other changes
>>>>>>> leading to touching this PTE will need to lock the page table, so no
>>>>>>> parallel change is possible at this time.
>>>>>>>
>>>>>>> The locking of the PTE is done with interrupts disabled, this allows
>>>>>>> checking for the PMD to ensure that there is not an ongoing collapsing
>>>>>>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>>>>>>> waiting for the other CPU to have caught the IPI interrupt, if the pmd is
>>>>>>> valid at the time the PTE is locked, we have the guarantee that the
>>>>>>> collapsing operation will have to wait on the PTE lock to move forward.
>>>>>>> This allows the SPF handler to map the PTE safely. If the PMD value is
>>>>>>> different from the one recorded at the beginning of the SPF operation, the
>>>>>>> classic page fault handler will be called to handle the operation while
>>>>>>> holding the mmap_sem. As the PTE lock is done with the interrupts disabled,
>>>>>>> the lock is done using spin_trylock() to avoid dead lock when handling a
>>>>>>> page fault while a TLB invalidate is requested by another CPU holding the
>>>>>>> PTE.
>>>>>>>
>>>>>>> In pseudo code, this could be seen as:
>>>>>>> speculative_page_fault()
>>>>>>> {
>>>>>>> vma = get_vma()
>>>>>>> check vma sequence count
>>>>>>> check vma's support
>>>>>>> disable interrupt
>>>>>>> check pgd,p4d,...,pte
>>>>>>> save pmd and pte in vmf
>>>>>>> save vma sequence counter in vmf
>>>>>>> enable interrupt
>>>>>>> check vma sequence count
>>>>>>> handle_pte_fault(vma)
>>>>>>> ..
>>>>>>> page = alloc_page()
>>>>>>> pte_map_lock()
>>>>>>> disable interrupt
>>>>>>> abort if sequence counter has changed
>>>>>>> abort if pmd or pte has changed
>>>>>>> pte map and lock
>>>>>>> enable interrupt
>>>>>>> if abort
>>>>>>> free page
>>>>>>> abort
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> arch_fault_handler()
>>>>>>> {
>>>>>>> if (speculative_page_fault(&vma))
>>>>>>> goto done
>>>>>>> again:
>>>>>>> lock(mmap_sem)
>>>>>>> vma = find_vma();
>>>>>>> handle_pte_fault(vma);
>>>>>>> if retry
>>>>>>> unlock(mmap_sem)
>>>>>>> goto again;
>>>>>>> done:
>>>>>>> handle fault error
>>>>>>> }
>>>>>>>
>>>>>>> Support for THP is not done because when checking for the PMD, we can be
>>>>>>> confused by an in progress collapsing operation done by khugepaged. The
>>>>>>> issue is that pmd_none() could be true either if the PMD is not already
>>>>>>> populated or if the underlying PTE are in the way to be collapsed. So we
>>>>>>> cannot safely allocate a PMD if pmd_none() is true.
>>>>>>>
>>>>>>> This series add a new software performance event named 'speculative-faults'
>>>>>>> or 'spf'. It counts the number of successful page fault event handled
>>>>>>> speculatively. When recording 'faults,spf' events, the faults one is
>>>>>>> counting the total number of page fault events while 'spf' is only counting
>>>>>>> the part of the faults processed speculatively.
>>>>>>>
>>>>>>> There are some trace events introduced by this series. They allow
>>>>>>> identifying why the page faults were not processed speculatively. This
>>>>>>> doesn't take in account the faults generated by a monothreaded process
>>>>>>> which directly processed while holding the mmap_sem. This trace events are
>>>>>>> grouped in a system named 'pagefault', they are:
>>>>>>> - pagefault:spf_vma_changed : if the VMA has been changed in our back
>>>>>>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>>>>>>> - pagefault:spf_vma_notsup : the VMA's type is not supported
>>>>>>> - pagefault:spf_vma_access : the VMA's access right are not respected
>>>>>>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>>>>>>> back.
>>>>>>>
>>>>>>> To record all the related events, the easier is to run perf with the
>>>>>>> following arguments :
>>>>>>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>>>>>>
>>>>>>> There is also a dedicated vmstat counter showing the number of successful
>>>>>>> page fault handled speculatively. I can be seen this way:
>>>>>>> $ grep speculative_pgfault /proc/vmstat
>>>>>>>
>>>>>>> This series builds on top of v4.16-mmotm-2018-04-13-17-28 and is functional
>>>>>>> on x86, PowerPC and arm64.
>>>>>>>
>>>>>>> ---------------------
>>>>>>> Real Workload results
>>>>>>>
>>>>>>> As mentioned in previous email, we did non official runs using a "popular
>>>>>>> in memory multithreaded database product" on 176 cores SMT8 Power system
>>>>>>> which showed a 30% improvements in the number of transaction processed per
>>>>>>> second. This run has been done on the v6 series, but changes introduced in
>>>>>>> this new version should not impact the performance boost seen.
>>>>>>>
>>>>>>> Here are the perf data captured during 2 of these runs on top of the v8
>>>>>>> series:
>>>>>>> vanilla spf
>>>>>>> faults 89.418 101.364 +13%
>>>>>>> spf n/a 97.989
>>>>>>>
>>>>>>> With the SPF kernel, most of the page fault were processed in a speculative
>>>>>>> way.
>>>>>>>
>>>>>>> Ganesh Mahendran had backported the series on top of a 4.9 kernel and gave
>>>>>>> it a try on an android device. He reported that the application launch time
>>>>>>> was improved in average by 6%, and for large applications (~100 threads) by
>>>>>>> 20%.
>>>>>>>
>>>>>>> Here are the launch time Ganesh mesured on Android 8.0 on top of a Qcom
>>>>>>> MSM845 (8 cores) with 6GB (the less is better):
>>>>>>>
>>>>>>> Application 4.9 4.9+spf delta
>>>>>>> com.tencent.mm 416 389 -7%
>>>>>>> com.eg.android.AlipayGphone 1135 986 -13%
>>>>>>> com.tencent.mtt 455 454 0%
>>>>>>> com.qqgame.hlddz 1497 1409 -6%
>>>>>>> com.autonavi.minimap 711 701 -1%
>>>>>>> com.tencent.tmgp.sgame 788 748 -5%
>>>>>>> com.immomo.momo 501 487 -3%
>>>>>>> com.tencent.peng 2145 2112 -2%
>>>>>>> com.smile.gifmaker 491 461 -6%
>>>>>>> com.baidu.BaiduMap 479 366 -23%
>>>>>>> com.taobao.taobao 1341 1198 -11%
>>>>>>> com.baidu.searchbox 333 314 -6%
>>>>>>> com.tencent.mobileqq 394 384 -3%
>>>>>>> com.sina.weibo 907 906 0%
>>>>>>> com.youku.phone 816 731 -11%
>>>>>>> com.happyelements.AndroidAnimal.qq 763 717 -6%
>>>>>>> com.UCMobile 415 411 -1%
>>>>>>> com.tencent.tmgp.ak 1464 1431 -2%
>>>>>>> com.tencent.qqmusic 336 329 -2%
>>>>>>> com.sankuai.meituan 1661 1302 -22%
>>>>>>> com.netease.cloudmusic 1193 1200 1%
>>>>>>> air.tv.douyu.android 4257 4152 -2%
>>>>>>>
>>>>>>> ------------------
>>>>>>> Benchmarks results
>>>>>>>
>>>>>>> Base kernel is v4.17.0-rc4-mm1
>>>>>>> SPF is BASE + this series
>>>>>>>
>>>>>>> Kernbench:
>>>>>>> ----------
>>>>>>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.15
>>>>>>> kernel (kernel is build 5 times):
>>>>>>>
>>>>>>> Average Half load -j 8
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 1448.65 (5.72312) 1455.84 (4.84951) 0.50%
>>>>>>> User Time 10135.4 (30.3699) 10148.8 (31.1252) 0.13%
>>>>>>> System Time 900.47 (2.81131) 923.28 (7.52779) 2.53%
>>>>>>> Percent CPU 761.4 (1.14018) 760.2 (0.447214) -0.16%
>>>>>>> Context Switches 85380 (3419.52) 84748 (1904.44) -0.74%
>>>>>>> Sleeps 105064 (1240.96) 105074 (337.612) 0.01%
>>>>>>>
>>>>>>> Average Optimal load -j 16
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 920.528 (10.1212) 927.404 (8.91789) 0.75%
>>>>>>> User Time 11064.8 (981.142) 11085 (990.897) 0.18%
>>>>>>> System Time 979.904 (84.0615) 1001.14 (82.5523) 2.17%
>>>>>>> Percent CPU 1089.5 (345.894) 1086.1 (343.545) -0.31%
>>>>>>> Context Switches 159488 (78156.4) 158223 (77472.1) -0.79%
>>>>>>> Sleeps 110566 (5877.49) 110388 (5617.75) -0.16%
>>>>>>>
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>> Performance counter stats for '../kernbench -M':
>>>>>>> 526743764 faults
>>>>>>> 210 spf
>>>>>>> 3 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 2278 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Very few speculative page faults were recorded as most of the processes
>>>>>>> involved are monothreaded (sounds that on this architecture some threads
>>>>>>> were created during the kernel build processing).
>>>>>>>
>>>>>>> Here are the kerbench results on a 80 CPUs Power8 system:
>>>>>>>
>>>>>>> Average Half load -j 40
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 117.152 (0.774642) 117.166 (0.476057) 0.01%
>>>>>>> User Time 4478.52 (24.7688) 4479.76 (9.08555) 0.03%
>>>>>>> System Time 131.104 (0.720056) 134.04 (0.708414) 2.24%
>>>>>>> Percent CPU 3934 (19.7104) 3937.2 (19.0184) 0.08%
>>>>>>> Context Switches 92125.4 (576.787) 92581.6 (198.622) 0.50%
>>>>>>> Sleeps 317923 (652.499) 318469 (1255.59) 0.17%
>>>>>>>
>>>>>>> Average Optimal load -j 80
>>>>>>> Run (std deviation)
>>>>>>> BASE SPF
>>>>>>> Elapsed Time 107.73 (0.632416) 107.31 (0.584936) -0.39%
>>>>>>> User Time 5869.86 (1466.72) 5871.71 (1467.27) 0.03%
>>>>>>> System Time 153.728 (23.8573) 157.153 (24.3704) 2.23%
>>>>>>> Percent CPU 5418.6 (1565.17) 5436.7 (1580.91) 0.33%
>>>>>>> Context Switches 223861 (138865) 225032 (139632) 0.52%
>>>>>>> Sleeps 330529 (13495.1) 332001 (14746.2) 0.45%
>>>>>>>
>>>>>>> During a run on the SPF, perf events were captured:
>>>>>>> Performance counter stats for '../kernbench -M':
>>>>>>> 116730856 faults
>>>>>>> 0 spf
>>>>>>> 3 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 476 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> Most of the processes involved are monothreaded so SPF is not activated but
>>>>>>> there is no impact on the performance.
>>>>>>>
>>>>>>> Ebizzy:
>>>>>>> -------
>>>>>>> The test is counting the number of records per second it can manage, the
>>>>>>> higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get
>>>>>>> consistent result I repeated the test 100 times and measure the average
>>>>>>> result. The number is the record processes per second, the higher is the
>>>>>>> best.
>>>>>>>
>>>>>>> BASE SPF delta
>>>>>>> 16 CPUs x86 VM 742.57 1490.24 100.69%
>>>>>>> 80 CPUs P8 node 13105.4 24174.23 84.46%
>>>>>>>
>>>>>>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>>>>>>> Performance counter stats for './ebizzy -mTt 16':
>>>>>>> 1706379 faults
>>>>>>> 1674599 spf
>>>>>>> 30588 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 363 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> And the ones captured during a run on a 80 CPUs Power node:
>>>>>>> Performance counter stats for './ebizzy -mTt 80':
>>>>>>> 1874773 faults
>>>>>>> 1461153 spf
>>>>>>> 413293 pagefault:spf_vma_changed
>>>>>>> 0 pagefault:spf_vma_noanon
>>>>>>> 200 pagefault:spf_vma_notsup
>>>>>>> 0 pagefault:spf_vma_access
>>>>>>> 0 pagefault:spf_pmd_changed
>>>>>>>
>>>>>>> In ebizzy's case most of the page fault were handled in a speculative way,
>>>>>>> leading the ebizzy performance boost.
>>>>>>>
>>>>>>> ------------------
>>>>>>> Changes since v10 (https://lkml.org/lkml/2018/4/17/572):
>>>>>>> - Accounted for all review feedbacks from Punit Agrawal, Ganesh Mahendran
>>>>>>> and Minchan Kim, hopefully.
>>>>>>> - Remove unneeded check on CONFIG_SPECULATIVE_PAGE_FAULT in
>>>>>>> __do_page_fault().
>>>>>>> - Loop in pte_spinlock() and pte_map_lock() when pte try lock fails
>>>>>>> instead
>>>>>>> of aborting the speculative page fault handling. Dropping the now
>>>>>>> useless
>>>>>>> trace event pagefault:spf_pte_lock.
>>>>>>> - No more try to reuse the fetched VMA during the speculative page fault
>>>>>>> handling when retrying is needed. This adds a lot of complexity and
>>>>>>> additional tests done didn't show a significant performance improvement.
>>>>>>> - Convert IS_ENABLED(CONFIG_NUMA) back to #ifdef due to build error.
>>>>>>>
>>>>>>> [1] http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>>>>>>> [2] https://patchwork.kernel.org/patch/9999687/
>>>>>>>
>>>>>>>
>>>>>>> Laurent Dufour (20):
>>>>>>> mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT
>>>>>>> x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>>>>>>> mm: make pte_unmap_same compatible with SPF
>>>>>>> mm: introduce INIT_VMA()
>>>>>>> mm: protect VMA modifications using VMA sequence count
>>>>>>> mm: protect mremap() against SPF hanlder
>>>>>>> mm: protect SPF handler against anon_vma changes
>>>>>>> mm: cache some VMA fields in the vm_fault structure
>>>>>>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>>>>>>> mm: introduce __lru_cache_add_active_or_unevictable
>>>>>>> mm: introduce __vm_normal_page()
>>>>>>> mm: introduce __page_add_new_anon_rmap()
>>>>>>> mm: protect mm_rb tree with a rwlock
>>>>>>> mm: adding speculative page fault failure trace events
>>>>>>> perf: add a speculative page fault sw event
>>>>>>> perf tools: add support for the SPF perf event
>>>>>>> mm: add speculative page fault vmstats
>>>>>>> powerpc/mm: add speculative page fault
>>>>>>>
>>>>>>> Mahendran Ganesh (2):
>>>>>>> arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT
>>>>>>> arm64/mm: add speculative page fault
>>>>>>>
>>>>>>> Peter Zijlstra (4):
>>>>>>> mm: prepare for FAULT_FLAG_SPECULATIVE
>>>>>>> mm: VMA sequence count
>>>>>>> mm: provide speculative fault infrastructure
>>>>>>> x86/mm: add speculative pagefault handling
>>>>>>>
>>>>>>> arch/arm64/Kconfig | 1 +
>>>>>>> arch/arm64/mm/fault.c | 12 +
>>>>>>> arch/powerpc/Kconfig | 1 +
>>>>>>> arch/powerpc/mm/fault.c | 16 +
>>>>>>> arch/x86/Kconfig | 1 +
>>>>>>> arch/x86/mm/fault.c | 27 +-
>>>>>>> fs/exec.c | 2 +-
>>>>>>> fs/proc/task_mmu.c | 5 +-
>>>>>>> fs/userfaultfd.c | 17 +-
>>>>>>> include/linux/hugetlb_inline.h | 2 +-
>>>>>>> include/linux/migrate.h | 4 +-
>>>>>>> include/linux/mm.h | 136 +++++++-
>>>>>>> include/linux/mm_types.h | 7 +
>>>>>>> include/linux/pagemap.h | 4 +-
>>>>>>> include/linux/rmap.h | 12 +-
>>>>>>> include/linux/swap.h | 10 +-
>>>>>>> include/linux/vm_event_item.h | 3 +
>>>>>>> include/trace/events/pagefault.h | 80 +++++
>>>>>>> include/uapi/linux/perf_event.h | 1 +
>>>>>>> kernel/fork.c | 5 +-
>>>>>>> mm/Kconfig | 22 ++
>>>>>>> mm/huge_memory.c | 6 +-
>>>>>>> mm/hugetlb.c | 2 +
>>>>>>> mm/init-mm.c | 3 +
>>>>>>> mm/internal.h | 20 ++
>>>>>>> mm/khugepaged.c | 5 +
>>>>>>> mm/madvise.c | 6 +-
>>>>>>> mm/memory.c | 612 +++++++++++++++++++++++++++++-----
>>>>>>> mm/mempolicy.c | 51 ++-
>>>>>>> mm/migrate.c | 6 +-
>>>>>>> mm/mlock.c | 13 +-
>>>>>>> mm/mmap.c | 229 ++++++++++---
>>>>>>> mm/mprotect.c | 4 +-
>>>>>>> mm/mremap.c | 13 +
>>>>>>> mm/nommu.c | 2 +-
>>>>>>> mm/rmap.c | 5 +-
>>>>>>> mm/swap.c | 6 +-
>>>>>>> mm/swap_state.c | 8 +-
>>>>>>> mm/vmstat.c | 5 +-
>>>>>>> tools/include/uapi/linux/perf_event.h | 1 +
>>>>>>> tools/perf/util/evsel.c | 1 +
>>>>>>> tools/perf/util/parse-events.c | 4 +
>>>>>>> tools/perf/util/parse-events.l | 1 +
>>>>>>> tools/perf/util/python.c | 1 +
>>>>>>> 44 files changed, 1161 insertions(+), 211 deletions(-)
>>>>>>> create mode 100644 include/trace/events/pagefault.h
>>>>>>>
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
[-- Attachment #2: perf-profile_page_fault3_head_thp_always.gz --]
[-- Type: application/gzip, Size: 12909 bytes --]
[-- Attachment #3: perf-profile_page_fault3_head_thp_never.gz --]
[-- Type: application/gzip, Size: 12535 bytes --]
[-- Attachment #4: perf-profile_page_fault2_head_thp_never.gz --]
[-- Type: application/gzip, Size: 11782 bytes --]
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-03 7:06 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Benjamin Herrenschmidt, Christoph Hellwig, Will Deacon,
Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <20180802235233-mutt-send-email-mst@kernel.org>
On Thu, Aug 02, 2018 at 11:53:08PM +0300, Michael S. Tsirkin wrote:
> > We don't need cache flushing tricks.
>
> You don't but do real devices on same platform need them?
IBMs power plaforms are always cache coherent. There are some powerpc
platforms have not cache coherent DMA, but I guess this scheme isn't
intended for them.
^ permalink raw reply
* Re: [RFC 0/4] Virtio uses DMA API for all devices
From: Christoph Hellwig @ 2018-08-03 7:05 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: Michael S. Tsirkin, Christoph Hellwig, Will Deacon,
Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
aik, robh, joe, elfring, david, jasowang, mpe, linuxram, haren,
paulus, srikar, robin.murphy, jean-philippe.brucker, marc.zyngier
In-Reply-To: <de4888b6457e220776e16a9c8958ff0886ffc66c.camel@kernel.crashing.org>
On Thu, Aug 02, 2018 at 04:13:09PM -0500, Benjamin Herrenschmidt wrote:
> So let's differenciate the two problems of having an IOMMU (real or
> emulated) which indeeds adds overhead etc... and using the DMA API.
>
> At the moment, virtio does this all over the place:
>
> if (use_dma_api)
> dma_map/alloc_something(...)
> else
> use_pa
>
> The idea of the patch set is to do two, somewhat orthogonal, changes
> that together achieve what we want. Let me know where you think there
> is "a bunch of issues" because I'm missing it:
>
> 1- Replace the above if/else constructs with just calling the DMA API,
> and have virtio, at initialization, hookup its own dma_ops that just
> "return pa" (roughly) when the IOMMU stuff isn't used.
>
> This adds an indirect function call to the path that previously didn't
> have one (the else case above). Is that a significant/measurable
> overhead ?
If you call it often enough it does:
https://www.spinics.net/lists/netdev/msg495413.html
> 2- Make virtio use the DMA API with our custom platform-provided
> swiotlb callbacks when needed, that is when not using IOMMU *and*
> running on a secure VM in our case.
And total NAK the customer platform-provided part of this. We need
a flag passed in from the hypervisor that the device needs all bus
specific dma api treatment, and then just use the normal plaform
dma mapping setup. To get swiotlb you'll need to then use the DT/ACPI
dma-range property to limit the addressable range, and a swiotlb
capable plaform will use swiotlb automatically.
^ permalink raw reply
* Re: [PATCH v4 5/6] powerpc: Add show_user_instructions()
From: Michael Ellerman @ 2018-08-03 8:44 UTC (permalink / raw)
To: Christophe LEROY, Murilo Opsfelder Araujo
Cc: linux-kernel, Alastair D'Silva, Andrew Donnellan,
Balbir Singh, Benjamin Herrenschmidt, Cyril Bur,
Eric W . Biederman, Joe Perches, Michael Neuling, Nicholas Piggin,
Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
linuxppc-dev, Segher Boessenkool
In-Reply-To: <69cf990b-d4aa-97e7-be3b-7936caa91688@c-s.fr>
Christophe LEROY <christophe.leroy@c-s.fr> writes:
> Le 03/08/2018 =C3=A0 02:42, Murilo Opsfelder Araujo a =C3=A9crit=C2=A0:
>> Hi, Christophe.
>> On Thu, Aug 02, 2018 at 07:26:20AM +0200, Christophe LEROY wrote:
>>> Le 01/08/2018 =C3=A0 23:33, Murilo Opsfelder Araujo a =C3=A9crit=C2=A0:
>>>> show_user_instructions() is a slightly modified version of
>>>> show_instructions() that allows userspace instruction dump.
>>>>
>>>> This will be useful within show_signal_msg() to dump userspace
>>>> instructions of the faulty location.
>>>>
>>>> Here is a sample of what show_user_instructions() outputs:
>>>>
>>>> pandafault[10850]: code: 4bfffeec 4bfffee8 3c401002 38427f00 fbe1f=
ff8 f821ffc1 7c3f0b78 3d22fffe
>>>> pandafault[10850]: code: 392988d0 f93f0020 e93f0020 39400048 <9949=
0000> 39200000 7d234b78 383f0040
>>>>
>>>> The current->comm and current->pid printed can serve as a glue that
>>>> links the instructions dump to its originator, allowing messages to be
>>>> interleaved in the logs.
>>>>
>>>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/proce=
ss.c
>>>> index e9533b4d2f08..364645ac732c 100644
>>>> --- a/arch/powerpc/kernel/process.c
>>>> +++ b/arch/powerpc/kernel/process.c
>>>> @@ -1299,6 +1299,46 @@ static void show_instructions(struct pt_regs *r=
egs)
>>>> pr_cont("\n");
>>>> }
>>>> +void show_user_instructions(struct pt_regs *regs)
>>>> +{
>>>> + int i;
>>>> + const char *prefix =3D KERN_INFO "%s[%d]: code: ";
>>>> + unsigned long pc =3D regs->nip - (instructions_to_print * 3 / 4 *
>>>> + sizeof(int));
>>>> +
>>>> + printk(prefix, current->comm, current->pid);
>>>
>>> Why not use pr_info() and remove KERN_INFO from *prefix ?
>>=20
>> Because it doesn't compile:
>>=20
>> arch/powerpc/kernel/process.c:1317:10: error: expected =E2=80=98)=E2=
=80=99 before =E2=80=98prefix=E2=80=99
>> pr_info(prefix, current->comm, current->pid);
>> ^
>> ./include/linux/printk.h:288:21: note: in definition of macro =E2=80=
=98pr_fmt=E2=80=99
>> #define pr_fmt(fmt) fmt
>> ^
>>=20
>> `pr_info(prefix, ...)` expands to `printk("\001" "6" prefix, ...)`,
>> which is an invalid string concatenation.
>>=20
>> `pr_info("%s", ...)` expands to `printk("\001" "6" "%s", ...)`, which is
>> valid.
>
> Then what about using directly:
>
> pr_info("%s[%d]: code: ", ...);
Yeah that's better, I'll fix it up when applying.
>>>> +#if !defined(CONFIG_BOOKE)
>>>> + /* If executing with the IMMU off, adjust pc rather
>>>> + * than print XXXXXXXX.
>>>> + */
>>>> + if (!(regs->msr & MSR_IR))
>>>> + pc =3D (unsigned long)phys_to_virt(pc);
>>>
>>> Shouldn't this be done outside of the loop, only once ?
>>=20
>> I don't think so.
>>=20
>> pc gets incremented at the bottom of the loop:
>>=20
>> pc +=3D sizeof(int);
>>=20
>> Adjusting pc is necessary at each iteration. Leaving this block inside
>> the loop seems correct.
>
> This looks pretty strange.
> The first time, pc is a physical address, that you change to a virtual=20
> address. Then when you increment it it is still a virtual address.
> So when you call phys_to_virt(pc) for the second time, pc is already a=20
> virt address, so what happens indeed ?
Yeah that's a bit fishy.
On 64-bit it works because phys_to_virt() =3D=3D __va() which is:
#define __va(x) ((void *)(unsigned long)((phys_addr_t)(x) | PAGE_OFFSET))
ie. it uses bitwise or, so __va(__va(x)) =3D=3D __va(x).
But it looks like on 32-bit it's going to do the wrong thing. Do we ever
actually hit that case though, I'm not sure?
However for this patch I'll just remove the whole thing, because we
don't expect to be dumping user instructions in realmode.
cheers
^ permalink raw reply
* Re: [PATCH v5 09/11] hugetlb: Introduce generic version of huge_ptep_set_wrprotect
From: Michael Ellerman @ 2018-08-03 8:51 UTC (permalink / raw)
To: Alex Ghiti, linux-mm, mike.kravetz, linux, catalin.marinas,
will.deacon, tony.luck, fenghua.yu, ralf, paul.burton, jhogan,
jejb, deller, benh, ysato, dalias, davem, tglx, mingo, hpa, x86,
arnd, linux-arm-kernel, linux-kernel, linux-ia64, linux-mips,
linux-parisc, linuxppc-dev, linux-sh, sparclinux, linux-arch,
aneesh.kumar@linux.ibm.com
In-Reply-To: <90bf556f-144d-24b8-d2f6-70fee4a30559@ghiti.fr>
Hi Alex,
Sorry missed your previous mail.
Alex Ghiti <alex@ghiti.fr> writes:
> Ok, I tried every defconfig available:
>
> - for the nohash/32, I found that I could use mpc885_ads_defconfig and I
> activated HUGETLBFS.
> I removed the definition of huge_ptep_set_wrprotect from
> nohash/32/pgtable.h, add an #error in
> include/asm-generic/hugetlb.h right before the generic definition of
> huge_ptep_set_wrprotect,
> and fell onto it at compile-time:
> => I'm pretty confident then that removing the definition of
> huge_ptep_set_wrprotect does not
> break anythingin this case.
Thanks, that sounds good.
> - regardind book3s/32, I did not find any defconfig with
> CONFIG_PPC_BOOK3S_32, CONFIG_PPC32
> allowing to enable huge page support (ie CONFIG_SYS_SUPPORTS_HUGETLBFS)
> => Do you have a defconfig that would allow me to try the same as above ?
I think you're right, it's dead code AFAICS.
We have:
config PPC_BOOK3S_64
...
select SYS_SUPPORTS_HUGETLBFS
config PPC_FSL_BOOK3E
...
select SYS_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
config PPC_8xx
...
select SYS_SUPPORTS_HUGETLBFS
So we can't ever enable HUGETLBFS for Book3S 32.
Presumably the code got copied when we split the headers apart.
So I think you can just ignore that one, and we'll delete it.
cheers
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox