LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Aleksa Sarai @ 2019-05-15 14:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
	sparclinux, linux-api, elena.reshetova, Yann Droneaud, linux-s390,
	linux-arch, linux-xtensa, keescook, arnd, jannh, linux-m68k, viro,
	luto, oleg, tglx, linux-arm-kernel, linux-parisc, torvalds,
	linux-mips, luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190515141634.lrc5ynllcmjr64mn@brauner.io>

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

On 2019-05-15, Christian Brauner <christian@brauner.io> wrote:
> On Wed, May 15, 2019 at 04:00:20PM +0200, Yann Droneaud wrote:
> > Would it be possible to create file descriptor with "restricted"
> > operation ?
> > 
> > - O_RDONLY: waiting for process completion allowed (for example)
> > - O_WRONLY: sending process signal allowed
> 
> Yes, something like this is likely going to be possible in the future.
> We had discussion around this. But mapping this to O_RDONLY and O_WRONLY
> is not the right model. It makes more sense to have specialized flags
> that restrict actions.

Not to mention that the O_* flags have silly values which we shouldn't
replicate in new syscalls IMHO.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/mm/book3s64: Implement STRICT_MODULE_RWX
From: Russell Currey @ 2019-05-16  0:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linuxppc-dev
In-Reply-To: <20190515064111.GA15778@infradead.org>

On Tue, 2019-05-14 at 23:41 -0700, Christoph Hellwig wrote:
> > + * This program is free software; you can redistribute it and/or
> > modify it
> > + * under the terms of the GNU General Public License as published
> > by the
> > + * Free Software Foundation; either version 2 of the License, or
> > (at your
> > + * option) any later version.
> 
> This license boilerplate should not be added together with an SPDX
> tag.
> 
> > +// we need this to have a single pointer to pass into
> > apply_to_page_range()
> 
> Please use normal /* - */ style comments.

I was under the impression they're allowed (in powerpc at least, if not
the wider kernel nowadays) but happy to defer on this.


^ permalink raw reply

* Re: [RFC PATCH] powerpc/mm: Implement STRICT_MODULE_RWX
From: Russell Currey @ 2019-05-16  0:51 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <df502ffe07caa38c46b0144fc824fff447f4105b.1557901092.git.christophe.leroy@c-s.fr>

On Wed, 2019-05-15 at 06:20 +0000, Christophe Leroy wrote:
> Strict module RWX is just like strict kernel RWX, but for modules -
> so
> loadable modules aren't marked both writable and executable at the
> same
> time.  This is handled by the generic code in kernel/module.c, and
> simply requires the architecture to implement the set_memory() set of
> functions, declared with ARCH_HAS_SET_MEMORY.
> 
> There's nothing other than these functions required to turn
> ARCH_HAS_STRICT_MODULE_RWX on, so turn that on too.
> 
> With STRICT_MODULE_RWX enabled, there are as many W+X pages at
> runtime
> as there are with CONFIG_MODULES=n (none), so in Russel's testing it
> works
> well on both Hash and Radix book3s64.
> 
> There's a TODO in the code for also applying the page permission
> changes
> to the backing pages in the linear mapping: this is pretty simple for
> Radix and (seemingly) a lot harder for Hash, so I've left it for now
> since there's still a notable security benefit for the patch as-is.
> 
> Technically can be enabled without STRICT_KERNEL_RWX, but
> that doesn't gets you a whole lot, so we should leave it off by
> default
> until we can get STRICT_KERNEL_RWX to the point where it's enabled by
> default.
> 
> Signed-off-by: Russell Currey <ruscur@russell.cc>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---

Thanks for this, I figured you'd know how to make this work on 32bit
too.  I'll test on my end today.

Note that there are two Ls in my name!  To quote the great Rusty, "This
Russel disease must be stamped out before it becomes widespread".



^ permalink raw reply

* Re: [v4 PATCH 1/2] [PowerPC] Add simd.h implementation
From: Shawn Landden @ 2019-05-16  1:12 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linuxppc-dev, Paul Mackerras
In-Reply-To: <568f62f6-7810-7179-f9aa-03f9df198fb6@c-s.fr>

On Wed, May 15, 2019 at 1:27 AM Christophe Leroy
<christophe.leroy@c-s.fr> wrote:
> Could you please as usual list here the changes provided by each version
> to ease the review ?
A bunch of embarrassing stuff that caused it not to build on some
set-ups (the functions were under the wrong include guards), and I
added include guards on simd.h so that you can use may_use_simd() even
if you don't have the FPU enabled (ARM's simd.h does this).

^ permalink raw reply

* [PATCH] powerpc/64s: Make boot look nice(r)
From: Nicholas Piggin @ 2019-05-16  2:04 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Radix boot looks like this:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 hash-mmu: ppc64_pft_size    = 0x0
 hash-mmu: kernel vmalloc start   = 0xc008000000000000
 hash-mmu: kernel IO start        = 0xc00a000000000000
 hash-mmu: kernel vmemmap start   = 0xc00c000000000000
 -----------------------------------------------------

Fix:

 -----------------------------------------------------
 phys_mem_size     = 0x200000000
 dcache_bsize      = 0x80
 icache_bsize      = 0x80
 cpu_features      = 0x0000c06f8f5fb1a7
   possible        = 0x0000fbffcf5fb1a7
   always          = 0x00000003800081a1
 cpu_user_features = 0xdc0065c2 0xaee00000
 mmu_features      = 0xbc006041
 firmware_features = 0x0000000010000000
 vmalloc start     = 0xc008000000000000
 IO start          = 0xc00a000000000000
 vmemmap start     = 0xc00c000000000000
 -----------------------------------------------------

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/setup-common.c    | 8 +++++++-
 arch/powerpc/mm/book3s64/hash_utils.c | 3 ---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index aad9f5df6ab6..f2da8c809c85 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -810,9 +810,15 @@ static __init void print_system_info(void)
 	pr_info("mmu_features      = 0x%08x\n", cur_cpu_spec->mmu_features);
 #ifdef CONFIG_PPC64
 	pr_info("firmware_features = 0x%016lx\n", powerpc_firmware_features);
+#ifdef CONFIG_PPC_BOOK3S
+	pr_info("vmalloc start     = 0x%lx\n", KERN_VIRT_START);
+	pr_info("IO start          = 0x%lx\n", KERN_IO_START);
+	pr_info("vmemmap start     = 0x%lx\n", (unsigned long)vmemmap);
+#endif
 #endif
 
-	print_system_hash_info();
+	if (!early_radix_enabled())
+		print_system_hash_info();
 
 	if (PHYSICAL_START > 0)
 		pr_info("physical_start    = 0x%llx\n",
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 919a861a8ec0..8b307b796b83 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1954,7 +1954,4 @@ void __init print_system_hash_info(void)
 
 	if (htab_hash_mask)
 		pr_info("htab_hash_mask    = 0x%lx\n", htab_hash_mask);
-	pr_info("kernel vmalloc start   = 0x%lx\n", KERN_VIRT_START);
-	pr_info("kernel IO start        = 0x%lx\n", KERN_IO_START);
-	pr_info("kernel vmemmap start   = 0x%lx\n", (unsigned long)vmemmap);
 }
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Daniel Axtens @ 2019-05-16  2:12 UTC (permalink / raw)
  To: Herbert Xu
  Cc: leo.barbosa, nayna, Stephan Mueller, Nayna, omosnacek,
	Eric Biggers, marcelo.cerri, pfsmorigo, linux-crypto, leitao,
	George Wilson, linuxppc-dev
In-Reply-To: <874l5w1axv.fsf@dja-thinkpad.axtens.net>

Daniel Axtens <dja@axtens.net> writes:

> Herbert Xu <herbert@gondor.apana.org.au> writes:
>
>> On Wed, May 15, 2019 at 03:35:51AM +1000, Daniel Axtens wrote:
>>>
>>> By all means disable vmx ctr if I don't get an answer to you in a
>>> timeframe you are comfortable with, but I am going to at least try to
>>> have a look.
>>
>> I'm happy to give you guys more time.  How much time do you think
>> you will need?
>>
> Give me till the end of the week: if I haven't solved it by then I will
> probably have to give up and go on to other things anyway.

So as you've hopefully seen, I've nailed it down and posted a patch.
(http://patchwork.ozlabs.org/patch/1099934/)

I'm also seeing issues with ghash with the extended tests:

[    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"

It seems to happen when one of the source divisions has nosimd and the
final result uses the simd finaliser, so that's interesting.

Regards,
Daniel

>
> (FWIW, it seems to happen when encoding greater than 4 but less than 8
> AES blocks - in particular with both 7 and 5 blocks encoded I can see it
> go wrong from block 4 onwards. No idea why yet, and the asm is pretty
> dense, but that's where I'm at at the moment.)
>
> Regards,
> Daniel
>
>> Thanks,
>> -- 
>> Email: Herbert Xu <herbert@gondor.apana.org.au>
>> Home Page: http://gondor.apana.org.au/~herbert/
>> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV
From: Eric Biggers @ 2019-05-16  2:30 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Herbert Xu, Horia Geanta, linux-kernel@vger.kernel.org,
	linux-crypto@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	David S. Miller
In-Reply-To: <29db3f20-f931-efc6-02a8-fe160ab8b484@c-s.fr>

On Wed, May 15, 2019 at 08:49:48PM +0200, Christophe Leroy wrote:
> 
> 
> Le 15/05/2019 à 16:05, Horia Geanta a écrit :
> > On 5/15/2019 3:29 PM, Christophe Leroy wrote:
> > > Selftests report the following:
> > > 
> > > [    2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    2.995377] 00000000: 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
> > > [    3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    3.043185] 00000000: fe dc ba 98 76 54 32 10
> > > [    3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    3.073818] 00000000: 7d 33 88 93 0f 93 b2 42
> > > 
> > > This above dumps show that the actual output IV is indeed the input IV.
> > > This is due to the IV not being copied back into the request.
> > > 
> > > This patch fixes that.
> > > 
> > > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> > Reviewed-by: Horia Geantă <horia.geanta@nxp.com>
> 
> It's missing a Fixes: tag and a Cc: to stable.
> 
> I'll resend tomorrow.
> 
> > 
> > While here, could you please check ecb mode (which by definition does not have
> > an IV) is behaving correctly?
> > Looking in driver_algs[] list of crypto algorithms supported by talitos,
> > ecb(aes,des,3des) are declared with ivsize != 0.
> 
> According to /proc/crypto, test are passed for ecb.
> 

Did you try enabling CONFIG_CRYPTO_MANAGER_EXTRA_TESTS?  There is now a check
that the driver's ivsize matches the generic implementation's:

        if (ivsize != crypto_skcipher_ivsize(generic_tfm)) {
                pr_err("alg: skcipher: ivsize for %s (%u) doesn't match generic impl (%u)\n",
                       driver, ivsize, crypto_skcipher_ivsize(generic_tfm));
                err = -EINVAL;
                goto out;
        }

For ECB that means the ivsize must be 0.

AFAICS the talitos driver even accesses the IV for ECB, which is wrong; and the
only reason this isn't crashing the self-tests already is that they are confused
by the declared ivsize being nonzero so they don't pass NULL as they should.

- Eric

^ permalink raw reply

* [PATCH 1/3] powerpc/pseries: Simplify cpu readd to use drc_index
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev

The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up
the cpus device_node so that we can get at the ibm,my-drc-index
property. The only user of cpu readd is an OF notifier call back. This
call back already has a reference to the device_node and therefore can
retrieve the drc_index from the device_node.

This patch simplifies dlpar_cpu_readd() to take a drc_index directly and
does away with an uneccsary device_node lookup.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h          |  2 +-
 arch/powerpc/mm/numa.c                       |  6 +++---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +---------
 3 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..c906d9ec9013 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,7 +133,7 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
 
-int dlpar_cpu_readd(int cpu);
+int dlpar_cpu_readd(u32 drc_index);
 #endif
 #endif
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 57e64273cb33..40c0b6da12c2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1479,9 +1479,9 @@ static int dt_update_callback(struct notifier_block *nb,
 	case OF_RECONFIG_UPDATE_PROPERTY:
 		if (of_node_is_type(update->dn, "cpu") &&
 		    !of_prop_cmp(update->prop->name, "ibm,associativity")) {
-			u32 core_id;
-			of_property_read_u32(update->dn, "reg", &core_id);
-			rc = dlpar_cpu_readd(core_id);
+			u32 drc_index;
+			of_property_read_u32(update->dn, "ibm,my-drc-index", &drc_index);
+			rc = dlpar_cpu_readd(drc_index);
 			rc = NOTIFY_OK;
 		}
 		break;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 97feb6e79f1a..2dfa9416ce54 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -802,18 +802,10 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add)
 	return rc;
 }
 
-int dlpar_cpu_readd(int cpu)
+int dlpar_cpu_readd(u32 drc_index)
 {
-	struct device_node *dn;
-	struct device *dev;
-	u32 drc_index;
 	int rc;
 
-	dev = get_cpu_device(cpu);
-	dn = dev->of_node;
-
-	rc = of_property_read_u32(dn, "ibm,my-drc-index", &drc_index);
-
 	rc = dlpar_cpu_remove_by_index(drc_index);
 	if (!rc)
 		rc = dlpar_cpu_add(drc_index);
-- 
2.18.1


^ permalink raw reply related

* [PATCH 3/3] powerpc/pseries: Don't update cpu topology after PRRN event
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev
In-Reply-To: <20190516023706.50118-1-tyreld@linux.ibm.com>

When we receive a PRRN event through the event-scan interface we call
pseries_devicetree_update() to update the affinty properties in our
device tree via RTAS. Following this our implemenation attempts to both
frob the existing kernel cpu numa affinities of the live system with the
new device tree properties while also performing a full cpu hotplug
readd of the affected cpus in resposne to the a OF property notifier
triggerd by the device tree update.

This patch does away with the topology update call to frob the
associativity since the DLPAR readd will put the cpu in the proper
numa node when its added back.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/rtasd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 8a1746d755c9..d3aa3a056d8e 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -285,7 +285,6 @@ static void handle_prrn_event(s32 scope)
 	 * the RTAS event.
 	 */
 	pseries_devicetree_update(-scope);
-	numa_update_cpu_topology(false);
 }

 static void handle_rtas_event(const struct rtas_error_log *log)
-- 
2.18.1

^ permalink raw reply related

* [PATCH 2/3] powerpc/pseries: Disable PRRN memory device tree trigger
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev
In-Reply-To: <20190516023706.50118-1-tyreld@linux.ibm.com>

Memory affintiy updates as currently implemented have proved unstable.

This patch comments out the PRRN hook for the time being while we
investigate how to either stablize the current implementation or find a
better approach.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/mobility.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index 88925f8ca8a0..660a2dbc43d7 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -242,13 +242,15 @@ static int add_dt_node(__be32 parent_phandle, __be32 drc_index)
 
 static void prrn_update_node(__be32 phandle)
 {
+	/* PRRN Memory Updates have proved unstable. Disable for the time being.
+	 *
 	struct pseries_hp_errorlog hp_elog;
 	struct device_node *dn;
 
-	/*
+	 *
 	 * If a node is found from a the given phandle, the phandle does not
 	 * represent the drc index of an LMB and we can ignore.
-	 */
+	 *
 	dn = of_find_node_by_phandle(be32_to_cpu(phandle));
 	if (dn) {
 		of_node_put(dn);
@@ -261,6 +263,7 @@ static void prrn_update_node(__be32 phandle)
 	hp_elog._drc_u.drc_index = phandle;
 
 	handle_dlpar_errorlog(&hp_elog);
+	*/
 }
 
 int pseries_devicetree_update(s32 scope)
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Eric Biggers @ 2019-05-16  2:56 UTC (permalink / raw)
  To: Daniel Axtens
  Cc: leo.barbosa, nayna, Herbert Xu, Stephan Mueller, Nayna, omosnacek,
	marcelo.cerri, pfsmorigo, linux-crypto, leitao, George Wilson,
	linuxppc-dev
In-Reply-To: <871s0z171b.fsf@dja-thinkpad.axtens.net>

On Thu, May 16, 2019 at 12:12:48PM +1000, Daniel Axtens wrote:
> 
> I'm also seeing issues with ghash with the extended tests:
> 
> [    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"
> 
> It seems to happen when one of the source divisions has nosimd and the
> final result uses the simd finaliser, so that's interesting.
> 

The bug is that p8_ghash uses different shash_descs for the SIMD and no-SIMD
cases.  So if you start out doing the hash in SIMD context but then switch to
no-SIMD context or vice versa, the digest will be wrong.  Note that there can be
an ->export() and ->import() in between, so it's not quite as obscure a case as
one might think.

To fix it I think you'll need to make p8_ghash use 'struct ghash_desc_ctx' just
like ghash-generic so that the two code paths can share the same shash_desc.
That's similar to what the various SHA hash algorithms do.

- Eric

^ permalink raw reply

* Re: [PATCH v10 2/2] powerpc/64s: KVM update for reimplement book3s idle code in C
From: Nicholas Piggin @ 2019-05-16  4:43 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Gautham R . Shenoy, linuxppc-dev, kvm-ppc
In-Reply-To: <20190513064207.GA11679@blackberry>

Paul Mackerras's on May 13, 2019 4:42 pm:
> On Sun, Apr 28, 2019 at 09:45:15PM +1000, Nicholas Piggin wrote:
>> This is the KVM update to the new idle code. A few improvements:
>> 
>> - Idle sleepers now always return to caller rather than branch out
>>   to KVM first.
>> - This allows optimisations like very fast return to caller when no
>>   state has been lost.
>> - KVM no longer requires nap_state_lost because it controls NVGPR
>>   save/restore itself on the way in and out.
>> - The heavy idle wakeup KVM request check can be moved out of the
>>   normal host idle code and into the not-performance-critical offline
>>   code.
>> - KVM nap code now returns from where it is called, which makes the
>>   flow a bit easier to follow.
> 
> One question below...
> 
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index 58d0f1ba845d..f66191d8f841 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> ...
>> @@ -2656,6 +2662,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>>  
>>  	lis	r3, LPCR_PECEDP@h	/* Do wake on privileged doorbell */
>>  
>> +	/* Go back to host stack */
>> +	ld	r1, HSTATE_HOST_R1(r13)
> 
> At this point we are in kvmppc_h_cede, which we branched to from
> hcall_try_real_mode, which came from the guest exit path, where we
> have already loaded r1 from HSTATE_HOST_R1(r13).  So if there is a
> path to get here with r1 not already set to HSTATE_HOST_R1(r13), then
> I missed it - please point it out to me.  Otherwise this statement
> seems superfluous.

I'm not sure why I put that there. I think you're right it could
be removed.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Make boot look nice(r)
From: Christophe Leroy @ 2019-05-16  4:47 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20190516020437.11783-1-npiggin@gmail.com>



Le 16/05/2019 à 04:04, Nicholas Piggin a écrit :
> Radix boot looks like this:
> 
>   -----------------------------------------------------
>   phys_mem_size     = 0x200000000
>   dcache_bsize      = 0x80
>   icache_bsize      = 0x80
>   cpu_features      = 0x0000c06f8f5fb1a7
>     possible        = 0x0000fbffcf5fb1a7
>     always          = 0x00000003800081a1
>   cpu_user_features = 0xdc0065c2 0xaee00000
>   mmu_features      = 0xbc006041
>   firmware_features = 0x0000000010000000
>   hash-mmu: ppc64_pft_size    = 0x0
>   hash-mmu: kernel vmalloc start   = 0xc008000000000000
>   hash-mmu: kernel IO start        = 0xc00a000000000000
>   hash-mmu: kernel vmemmap start   = 0xc00c000000000000
>   -----------------------------------------------------
> 
> Fix:
> 
>   -----------------------------------------------------
>   phys_mem_size     = 0x200000000
>   dcache_bsize      = 0x80
>   icache_bsize      = 0x80
>   cpu_features      = 0x0000c06f8f5fb1a7
>     possible        = 0x0000fbffcf5fb1a7
>     always          = 0x00000003800081a1
>   cpu_user_features = 0xdc0065c2 0xaee00000
>   mmu_features      = 0xbc006041
>   firmware_features = 0x0000000010000000
>   vmalloc start     = 0xc008000000000000
>   IO start          = 0xc00a000000000000
>   vmemmap start     = 0xc00c000000000000
>   -----------------------------------------------------
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

I fear your change defeats most of the purpose of commit 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20190515&id=e4dccf9092ab48a6f902003b9558c0e45d0e849a

As far as I understand, the main issue is the "hash-mmu:" prefix ?
That's due to the following define in top of book3s64/hash_utils.c:

#define pr_fmt(fmt) "hash-mmu: " fmt

Could we simply undef it just before print_system_hash_info() ?
Or move print_system_hash_info() in another book3s64 specific file which 
doesn't set pr_fmt ?

Christophe

> ---
>   arch/powerpc/kernel/setup-common.c    | 8 +++++++-
>   arch/powerpc/mm/book3s64/hash_utils.c | 3 ---
>   2 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index aad9f5df6ab6..f2da8c809c85 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -810,9 +810,15 @@ static __init void print_system_info(void)
>   	pr_info("mmu_features      = 0x%08x\n", cur_cpu_spec->mmu_features);
>   #ifdef CONFIG_PPC64
>   	pr_info("firmware_features = 0x%016lx\n", powerpc_firmware_features);
> +#ifdef CONFIG_PPC_BOOK3S
> +	pr_info("vmalloc start     = 0x%lx\n", KERN_VIRT_START);
> +	pr_info("IO start          = 0x%lx\n", KERN_IO_START);
> +	pr_info("vmemmap start     = 0x%lx\n", (unsigned long)vmemmap);
> +#endif
>   #endif
>   
> -	print_system_hash_info();
> +	if (!early_radix_enabled())
> +		print_system_hash_info();
>   
>   	if (PHYSICAL_START > 0)
>   		pr_info("physical_start    = 0x%llx\n",
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 919a861a8ec0..8b307b796b83 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1954,7 +1954,4 @@ void __init print_system_hash_info(void)
>   
>   	if (htab_hash_mask)
>   		pr_info("htab_hash_mask    = 0x%lx\n", htab_hash_mask);
> -	pr_info("kernel vmalloc start   = 0x%lx\n", KERN_VIRT_START);
> -	pr_info("kernel IO start        = 0x%lx\n", KERN_IO_START);
> -	pr_info("kernel vmemmap start   = 0x%lx\n", (unsigned long)vmemmap);
>   }
> 

^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Nicholas Piggin @ 2019-05-16  4:55 UTC (permalink / raw)
  To: Abhishek, linux-kernel, linux-pm, linuxppc-dev
  Cc: ego, daniel.lezcano, rjw, dja
In-Reply-To: <b2fcf69a-aecd-ea81-b497-737642354736@linux.vnet.ibm.com>

Abhishek's on May 13, 2019 7:49 pm:
> On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
>> Abhishek Goel's on April 22, 2019 4:32 pm:
>>> Currently, the cpuidle governors determine what idle state a idling CPU
>>> should enter into based on heuristics that depend on the idle history on
>>> that CPU. Given that no predictive heuristic is perfect, there are cases
>>> where the governor predicts a shallow idle state, hoping that the CPU will
>>> be busy soon. However, if no new workload is scheduled on that CPU in the
>>> near future, the CPU will end up in the shallow state.
>>>
>>> Motivation
>>> ----------
>>> In case of POWER, this is problematic, when the predicted state in the
>>> aforementioned scenario is a lite stop state, as such lite states will
>>> inhibit SMT folding, thereby depriving the other threads in the core from
>>> using the core resources.
>>>
>>> So we do not want to get stucked in such states for longer duration. To
>>> address this, the cpuidle-core can queue timer to correspond with the
>>> residency value of the next available state. This timer will forcefully
>>> wakeup the cpu. Few such iterations will essentially train the governor to
>>> select a deeper state for that cpu, as the timer here corresponds to the
>>> next available cpuidle state residency. Cpu will be kicked out of the lite
>>> state and end up in a non-lite state.
>>>
>>> Experiment
>>> ----------
>>> I performed experiments for three scenarios to collect some data.
>>>
>>> case 1 :
>>> Without this patch and without tick retained, i.e. in a upstream kernel,
>>> It would spend more than even a second to get out of stop0_lite.
>>>
>>> case 2 : With tick retained in a upstream kernel -
>>>
>>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
>>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
>>> observation was
>>>
>>> =========================================================
>>> sample          min            max           99percentile
>>> 20              4ms            12ms          4ms
>>> =========================================================
>>>
>>> It would take atleast one sched tick to get out of stop0_lite.
>>>
>>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
>>>            timer)
>>>
>>> ============================================================
>>> sample          min             max             99percentile
>>> ============================================================
>>> 20              144us           192us           144us
>>> ============================================================
>>>
>>> In this patch, we queue a timer just before entering into a stop0_lite
>>> state. The timer fires at (residency of next available state + exit latency
>>> of next available state * 2). Let's say if next state(stop0) is available
>>> which has residency of 20us, it should get out in as low as (20+2*2)*8
>>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
>>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
>>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
>>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
>>> into stop2.
>>>
>>> So, We are able to get out of stop0_lite generally in 150us(with this
>>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
>>> want to get stuck into stop0_lite as it inhibits SMT folding for other
>>> sibling threads, depriving them of core resources. Current patch is using
>>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
>>> reason) along with lowering down power consumption. We may extend this
>>> model for other states in future.
>> I still have to wonder, between our snooze loop and stop0, what does
>> stop0_lite buy us.
>>
>> That said, the problem you're solving here is a generic one that all
>> stop states have, I think. Doesn't the same thing apply going from
>> stop0 to stop5? You might under estimate the sleep time and lose power
>> savings and therefore performance there too. Shouldn't we make it
>> generic for all stop states?
>>
>> Thanks,
>> Nick
>>
>>
> When a cpu is in snooze, it takes both space and time of core. When in 
> stop0_lite,
> it free up time but it still takes space.

True, but snooze should only be taking less than 1% of front end
cycles. I appreciate there is some non-zero difference here, I just
wonder in practice what exactly we gain by it.

We should always have fewer states unless proven otherwise.

That said, we enable it today so I don't want to argue this point
here, because it is a different issue from your patch.

> When it is in stop0 or deeper, 
> it free up both
> space and time slice of core.
> In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> thread
> folding. When a cpu goes to stop0, it will free up the core resources 
> thus increasing
> the single thread performance of other sibling thread.
> Hence, we do not want to get stuck in stop0_lite for long duration, and 
> want to quickly
> move onto the next state.
> If we get stuck in any other state we would possibly be losing on to 
> power saving,
> but will still be able to gain the performance benefits for other 
> sibling threads.

That's true, but stop0 -> deeper stop is also a benefit (for
performance if we have some power/thermal constraints, and/or for power
usage).

Sure it may not be so noticable as the SMT switch, but I just wonder
if the infrastructure should be there for the same reason.

I was testing interrupt frequency on some tickless workloads configs,
and without too much trouble you can get CPUs to sleep with no
interrupts for many minutes. Hours even. We wouldn't want the CPU to
stay in stop0 for that long.

Just thinking about the patch itself, I wonder do you need a full
kernel timer, or could we just set the decrementer? Is there much 
performance cost here?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/book3s/mm: Clear MMU_FTR_HPTE_TABLE when radix is enabled.
From: Nicholas Piggin @ 2019-05-16  5:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, mpe, paulus; +Cc: linuxppc-dev
In-Reply-To: <20190514060205.20887-1-aneesh.kumar@linux.ibm.com>

Aneesh Kumar K.V's on May 14, 2019 4:02 pm:
> Avoids confusion when printing Oops message like below
> 
>  Faulting instruction address: 0xc00000000008bdb4
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> 
> Either ibm,pa-features or ibm,powerpc-cpu-features can be used to enable the
> MMU features. We don't clear related MMU feature bits there. We use the kernel
> commandline to determine what translation mode we want to use and clear the
> HPTE or radix bit accordingly. On LPAR we do have to renable HASH bit if the
> hypervisor can't do radix.

Well we have the HPTE feature: the CPU supports hash MMU mode. It's
just the the kernel is booted in radix mode.

Could make a difference for KVM, if it will support an HPT guest or
not.

That's all highly theoretical and we have other inconsistencies
already in this stuff, I'd just like to try make things a bit better
in the long term.

Can we just add an early_radix_enabled() in the oops printing code
to select radix or hash MMU?

> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index b97aee03924f..0fa6cac3fe82 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -77,9 +77,6 @@ static struct page *maybe_pte_to_page(pte_t pte)
>  
>  static pte_t set_pte_filter_hash(pte_t pte)
>  {
> -	if (radix_enabled())
> -		return pte;
> -
>  	pte = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
>  	if (pte_looks_normal(pte) && !(cpu_has_feature(CPU_FTR_COHERENT_ICACHE) ||
>  				       cpu_has_feature(CPU_FTR_NOEXECUTE))) {
> @@ -110,6 +107,8 @@ static pte_t set_pte_filter(pte_t pte)
>  
>  	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
>  		return set_pte_filter_hash(pte);
> +	else if (radix_enabled())
> +		return pte;
>  
>  	/* No exec permission in the first place, move on */
>  	if (!pte_exec(pte) || !pte_looks_normal(pte))
> @@ -140,7 +139,7 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma,
>  {
>  	struct page *pg;
>  
> -	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
> +	if (mmu_has_feature(MMU_FTR_HPTE_TABLE) || radix_enabled())
>  		return pte;
>  
>  	/* So here, we only care about exec faults, as we use them

These would still be good cleanup to make the HPTE_TABLE feature
independent from radix.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Daniel Axtens @ 2019-05-16  5:28 UTC (permalink / raw)
  To: Eric Biggers
  Cc: leo.barbosa, nayna, Herbert Xu, Stephan Mueller, Nayna, omosnacek,
	marcelo.cerri, pfsmorigo, linux-crypto, leitao, George Wilson,
	linuxppc-dev
In-Reply-To: <20190516025603.GB23200@sol.localdomain>

Eric Biggers <ebiggers@kernel.org> writes:

> On Thu, May 16, 2019 at 12:12:48PM +1000, Daniel Axtens wrote:
>> 
>> I'm also seeing issues with ghash with the extended tests:
>> 
>> [    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"
>> 
>> It seems to happen when one of the source divisions has nosimd and the
>> final result uses the simd finaliser, so that's interesting.
>> 
>
> The bug is that p8_ghash uses different shash_descs for the SIMD and no-SIMD
> cases.  So if you start out doing the hash in SIMD context but then switch to
> no-SIMD context or vice versa, the digest will be wrong.  Note that there can be
> an ->export() and ->import() in between, so it's not quite as obscure a case as
> one might think.

Ah cool, I was just in the process of figuring this out for myself -
always lovely to have my theory confirmed!

> To fix it I think you'll need to make p8_ghash use 'struct ghash_desc_ctx' just
> like ghash-generic so that the two code paths can share the same shash_desc.
> That's similar to what the various SHA hash algorithms do.

This is very helpful, thank you. I guess I will do that then.

Regards,
Daniel

>
> - Eric

^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Gautham R Shenoy @ 2019-05-16  5:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <1557981860.eltms77ctp.astroid@bobo.none>

Hello Nicholas,


On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote:
> Abhishek's on May 13, 2019 7:49 pm:
> > On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
> >> Abhishek Goel's on April 22, 2019 4:32 pm:
> >>> Currently, the cpuidle governors determine what idle state a idling CPU
> >>> should enter into based on heuristics that depend on the idle history on
> >>> that CPU. Given that no predictive heuristic is perfect, there are cases
> >>> where the governor predicts a shallow idle state, hoping that the CPU will
> >>> be busy soon. However, if no new workload is scheduled on that CPU in the
> >>> near future, the CPU will end up in the shallow state.
> >>>
> >>> Motivation
> >>> ----------
> >>> In case of POWER, this is problematic, when the predicted state in the
> >>> aforementioned scenario is a lite stop state, as such lite states will
> >>> inhibit SMT folding, thereby depriving the other threads in the core from
> >>> using the core resources.
> >>>
> >>> So we do not want to get stucked in such states for longer duration. To
> >>> address this, the cpuidle-core can queue timer to correspond with the
> >>> residency value of the next available state. This timer will forcefully
> >>> wakeup the cpu. Few such iterations will essentially train the governor to
> >>> select a deeper state for that cpu, as the timer here corresponds to the
> >>> next available cpuidle state residency. Cpu will be kicked out of the lite
> >>> state and end up in a non-lite state.
> >>>
> >>> Experiment
> >>> ----------
> >>> I performed experiments for three scenarios to collect some data.
> >>>
> >>> case 1 :
> >>> Without this patch and without tick retained, i.e. in a upstream kernel,
> >>> It would spend more than even a second to get out of stop0_lite.
> >>>
> >>> case 2 : With tick retained in a upstream kernel -
> >>>
> >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
> >>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
> >>> observation was
> >>>
> >>> =========================================================
> >>> sample          min            max           99percentile
> >>> 20              4ms            12ms          4ms
> >>> =========================================================
> >>>
> >>> It would take atleast one sched tick to get out of stop0_lite.
> >>>
> >>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
> >>>            timer)
> >>>
> >>> ============================================================
> >>> sample          min             max             99percentile
> >>> ============================================================
> >>> 20              144us           192us           144us
> >>> ============================================================
> >>>
> >>> In this patch, we queue a timer just before entering into a stop0_lite
> >>> state. The timer fires at (residency of next available state + exit latency
> >>> of next available state * 2). Let's say if next state(stop0) is available
> >>> which has residency of 20us, it should get out in as low as (20+2*2)*8
> >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
> >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
> >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
> >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
> >>> into stop2.
> >>>
> >>> So, We are able to get out of stop0_lite generally in 150us(with this
> >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
> >>> want to get stuck into stop0_lite as it inhibits SMT folding for other
> >>> sibling threads, depriving them of core resources. Current patch is using
> >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
> >>> reason) along with lowering down power consumption. We may extend this
> >>> model for other states in future.
> >> I still have to wonder, between our snooze loop and stop0, what does
> >> stop0_lite buy us.
> >>
> >> That said, the problem you're solving here is a generic one that all
> >> stop states have, I think. Doesn't the same thing apply going from
> >> stop0 to stop5? You might under estimate the sleep time and lose power
> >> savings and therefore performance there too. Shouldn't we make it
> >> generic for all stop states?
> >>
> >> Thanks,
> >> Nick
> >>
> >>
> > When a cpu is in snooze, it takes both space and time of core. When in 
> > stop0_lite,
> > it free up time but it still takes space.
> 
> True, but snooze should only be taking less than 1% of front end
> cycles. I appreciate there is some non-zero difference here, I just
> wonder in practice what exactly we gain by it.

The idea behind implementing a lite-state was that on the future
platforms it can be made to wait on a flag and hence act as a
replacement for snooze. On POWER9 we don't have this feature.

The motivation behind this patch was a HPC customer issue where they
were observing some CPUs in the core getting stuck at stop0_lite
state, thereby lowering the performance on the other CPUs of the core
which were running the application.

Disabling stop0_lite via sysfs didn't help since we would fallback to
snooze and it would make matters worse.

> 
> We should always have fewer states unless proven otherwise.

I agree.

> 
> That said, we enable it today so I don't want to argue this point
> here, because it is a different issue from your patch.
> 
> > When it is in stop0 or deeper, 
> > it free up both
> > space and time slice of core.
> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> > thread
> > folding. When a cpu goes to stop0, it will free up the core resources 
> > thus increasing
> > the single thread performance of other sibling thread.
> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
> > want to quickly
> > move onto the next state.
> > If we get stuck in any other state we would possibly be losing on to 
> > power saving,
> > but will still be able to gain the performance benefits for other 
> > sibling threads.
> 
> That's true, but stop0 -> deeper stop is also a benefit (for
> performance if we have some power/thermal constraints, and/or for power
> usage).
> 
> Sure it may not be so noticable as the SMT switch, but I just wonder
> if the infrastructure should be there for the same reason.
> 
> I was testing interrupt frequency on some tickless workloads configs,
> and without too much trouble you can get CPUs to sleep with no
> interrupts for many minutes. Hours even. We wouldn't want the CPU to
> stay in stop0 for that long.

If it stays in stop0 or even stop2 for that long, we would want to
"promote" it to a deeper state, such as say STOP5 which allows the
other cores to run at higher frequencies.

> 
> Just thinking about the patch itself, I wonder do you need a full
> kernel timer, or could we just set the decrementer? Is there much 
> performance cost here?
>

Good point. A decrementer would do actually.

> Thanks,
> Nick

--
Thanks and Regards
gautham.


^ permalink raw reply

* Re: [RFC PATCH] powerpc/64/ftrace: mprofile-kernel patch out mflr
From: Nicholas Piggin @ 2019-05-16  5:49 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman, Naveen N. Rao
In-Reply-To: <1557821918.xbleq18bk2.naveen@linux.ibm.com>

Naveen N. Rao's on May 14, 2019 6:32 pm:
> Michael Ellerman wrote:
>> "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> writes:
>>> Michael Ellerman wrote:
>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>> The new mprofile-kernel mcount sequence is
>>>>>
>>>>>   mflr	r0
>>>>>   bl	_mcount
>>>>>
>>>>> Dynamic ftrace patches the branch instruction with a noop, but leaves
>>>>> the mflr. mflr is executed by the branch unit that can only execute one
>>>>> per cycle on POWER9 and shared with branches, so it would be nice to
>>>>> avoid it where possible.
>>>>>
>>>>> This patch is a hacky proof of concept to nop out the mflr. Can we do
>>>>> this or are there races or other issues with it?
>>>> 
>>>> There's a race, isn't there?
>>>> 
>>>> We have a function foo which currently has tracing disabled, so the mflr
>>>> and bl are nop'ed out.
>>>> 
>>>>   CPU 0			CPU 1
>>>>   ==================================
>>>>   bl foo
>>>>   nop (ie. not mflr)
>>>>   -> interrupt
>>>>   something else	enable tracing for foo
>>>>   ...			patch mflr and branch
>>>>   <- rfi
>>>>   bl _mcount
>>>> 
>>>> So we end up in _mcount() but with r0 not populated.
>>>
>>> Good catch! Looks like we need to patch the mflr with a "b +8" similar 
>>> to what we do in __ftrace_make_nop().
>> 
>> Would that actually make it any faster though? Nick?
> 
> Ok, how about doing this as a 2-step process?
> 1. patch 'mflr r0' with a 'b +8'
>    synchronize_rcu_tasks()
> 2. convert 'b +8' to a 'nop'

Good idea. Well the mflr r0 is harmless, so you can leave that in.
You just need to ensure it's not removed before the bl is. So nop
the bl _mcount, then synchronize_rcu_tasks(), then nop the mflr?

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH] powerpc/64s: Make boot look nice(r)
From: Nicholas Piggin @ 2019-05-16  6:08 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <d65ae686-c117-ae1c-1d48-498fdd1ea0eb@c-s.fr>

Christophe Leroy's on May 16, 2019 2:47 pm:
> 
> 
> Le 16/05/2019 à 04:04, Nicholas Piggin a écrit :
>> Radix boot looks like this:
>> 
>>   -----------------------------------------------------
>>   phys_mem_size     = 0x200000000
>>   dcache_bsize      = 0x80
>>   icache_bsize      = 0x80
>>   cpu_features      = 0x0000c06f8f5fb1a7
>>     possible        = 0x0000fbffcf5fb1a7
>>     always          = 0x00000003800081a1
>>   cpu_user_features = 0xdc0065c2 0xaee00000
>>   mmu_features      = 0xbc006041
>>   firmware_features = 0x0000000010000000
>>   hash-mmu: ppc64_pft_size    = 0x0
>>   hash-mmu: kernel vmalloc start   = 0xc008000000000000
>>   hash-mmu: kernel IO start        = 0xc00a000000000000
>>   hash-mmu: kernel vmemmap start   = 0xc00c000000000000
>>   -----------------------------------------------------
>> 
>> Fix:
>> 
>>   -----------------------------------------------------
>>   phys_mem_size     = 0x200000000
>>   dcache_bsize      = 0x80
>>   icache_bsize      = 0x80
>>   cpu_features      = 0x0000c06f8f5fb1a7
>>     possible        = 0x0000fbffcf5fb1a7
>>     always          = 0x00000003800081a1
>>   cpu_user_features = 0xdc0065c2 0xaee00000
>>   mmu_features      = 0xbc006041
>>   firmware_features = 0x0000000010000000
>>   vmalloc start     = 0xc008000000000000
>>   IO start          = 0xc00a000000000000
>>   vmemmap start     = 0xc00c000000000000
>>   -----------------------------------------------------
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> I fear your change defeats most of the purpose of commit 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20190515&id=e4dccf9092ab48a6f902003b9558c0e45d0e849a

I think it's still a significant improvement without introducing
the regression :)

> As far as I understand, the main issue is the "hash-mmu:" prefix ?
> That's due to the following define in top of book3s64/hash_utils.c:
> 
> #define pr_fmt(fmt) "hash-mmu: " fmt
> 
> Could we simply undef it just before print_system_hash_info() ?

Little bit fragile I think.

> Or move print_system_hash_info() in another book3s64 specific file which 
> doesn't set pr_fmt ?

print_system_info() would be okay for me and allow getting rid of
that PPC64 config. Although it also needs to go in a file without
pr_fmt I guess that's not so hard.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Nicholas Piggin @ 2019-05-16  6:13 UTC (permalink / raw)
  To: ego
  Cc: linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <20190516053659.GA20396@in.ibm.com>

Gautham R Shenoy's on May 16, 2019 3:36 pm:
> Hello Nicholas,
> 
> 
> On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote:
>> Abhishek's on May 13, 2019 7:49 pm:
>> > On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
>> >> Abhishek Goel's on April 22, 2019 4:32 pm:
>> >>> Currently, the cpuidle governors determine what idle state a idling CPU
>> >>> should enter into based on heuristics that depend on the idle history on
>> >>> that CPU. Given that no predictive heuristic is perfect, there are cases
>> >>> where the governor predicts a shallow idle state, hoping that the CPU will
>> >>> be busy soon. However, if no new workload is scheduled on that CPU in the
>> >>> near future, the CPU will end up in the shallow state.
>> >>>
>> >>> Motivation
>> >>> ----------
>> >>> In case of POWER, this is problematic, when the predicted state in the
>> >>> aforementioned scenario is a lite stop state, as such lite states will
>> >>> inhibit SMT folding, thereby depriving the other threads in the core from
>> >>> using the core resources.
>> >>>
>> >>> So we do not want to get stucked in such states for longer duration. To
>> >>> address this, the cpuidle-core can queue timer to correspond with the
>> >>> residency value of the next available state. This timer will forcefully
>> >>> wakeup the cpu. Few such iterations will essentially train the governor to
>> >>> select a deeper state for that cpu, as the timer here corresponds to the
>> >>> next available cpuidle state residency. Cpu will be kicked out of the lite
>> >>> state and end up in a non-lite state.
>> >>>
>> >>> Experiment
>> >>> ----------
>> >>> I performed experiments for three scenarios to collect some data.
>> >>>
>> >>> case 1 :
>> >>> Without this patch and without tick retained, i.e. in a upstream kernel,
>> >>> It would spend more than even a second to get out of stop0_lite.
>> >>>
>> >>> case 2 : With tick retained in a upstream kernel -
>> >>>
>> >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
>> >>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
>> >>> observation was
>> >>>
>> >>> =========================================================
>> >>> sample          min            max           99percentile
>> >>> 20              4ms            12ms          4ms
>> >>> =========================================================
>> >>>
>> >>> It would take atleast one sched tick to get out of stop0_lite.
>> >>>
>> >>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
>> >>>            timer)
>> >>>
>> >>> ============================================================
>> >>> sample          min             max             99percentile
>> >>> ============================================================
>> >>> 20              144us           192us           144us
>> >>> ============================================================
>> >>>
>> >>> In this patch, we queue a timer just before entering into a stop0_lite
>> >>> state. The timer fires at (residency of next available state + exit latency
>> >>> of next available state * 2). Let's say if next state(stop0) is available
>> >>> which has residency of 20us, it should get out in as low as (20+2*2)*8
>> >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
>> >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
>> >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
>> >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
>> >>> into stop2.
>> >>>
>> >>> So, We are able to get out of stop0_lite generally in 150us(with this
>> >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
>> >>> want to get stuck into stop0_lite as it inhibits SMT folding for other
>> >>> sibling threads, depriving them of core resources. Current patch is using
>> >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
>> >>> reason) along with lowering down power consumption. We may extend this
>> >>> model for other states in future.
>> >> I still have to wonder, between our snooze loop and stop0, what does
>> >> stop0_lite buy us.
>> >>
>> >> That said, the problem you're solving here is a generic one that all
>> >> stop states have, I think. Doesn't the same thing apply going from
>> >> stop0 to stop5? You might under estimate the sleep time and lose power
>> >> savings and therefore performance there too. Shouldn't we make it
>> >> generic for all stop states?
>> >>
>> >> Thanks,
>> >> Nick
>> >>
>> >>
>> > When a cpu is in snooze, it takes both space and time of core. When in 
>> > stop0_lite,
>> > it free up time but it still takes space.
>> 
>> True, but snooze should only be taking less than 1% of front end
>> cycles. I appreciate there is some non-zero difference here, I just
>> wonder in practice what exactly we gain by it.
> 
> The idea behind implementing a lite-state was that on the future
> platforms it can be made to wait on a flag and hence act as a
> replacement for snooze. On POWER9 we don't have this feature.

Right. I mean for POWER9.

> The motivation behind this patch was a HPC customer issue where they
> were observing some CPUs in the core getting stuck at stop0_lite
> state, thereby lowering the performance on the other CPUs of the core
> which were running the application.
> 
> Disabling stop0_lite via sysfs didn't help since we would fallback to
> snooze and it would make matters worse.

snooze has the timeout though, so it should kick into stop0 properly
(and if it doesn't that's another issue that should be fixed in this
series).

I'm not questioning the patch for stop0_lite, to be clear. I think
the logic is sound. I just raise one urelated issue that happens to
be for stop0_lite as well (should we even enable it on P9?), and one
peripheral issue (should we make a similar fix for deeper stop states?)

> 
>> 
>> We should always have fewer states unless proven otherwise.
> 
> I agree.
> 
>> 
>> That said, we enable it today so I don't want to argue this point
>> here, because it is a different issue from your patch.
>> 
>> > When it is in stop0 or deeper, 
>> > it free up both
>> > space and time slice of core.
>> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
>> > thread
>> > folding. When a cpu goes to stop0, it will free up the core resources 
>> > thus increasing
>> > the single thread performance of other sibling thread.
>> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
>> > want to quickly
>> > move onto the next state.
>> > If we get stuck in any other state we would possibly be losing on to 
>> > power saving,
>> > but will still be able to gain the performance benefits for other 
>> > sibling threads.
>> 
>> That's true, but stop0 -> deeper stop is also a benefit (for
>> performance if we have some power/thermal constraints, and/or for power
>> usage).
>> 
>> Sure it may not be so noticable as the SMT switch, but I just wonder
>> if the infrastructure should be there for the same reason.
>> 
>> I was testing interrupt frequency on some tickless workloads configs,
>> and without too much trouble you can get CPUs to sleep with no
>> interrupts for many minutes. Hours even. We wouldn't want the CPU to
>> stay in stop0 for that long.
> 
> If it stays in stop0 or even stop2 for that long, we would want to
> "promote" it to a deeper state, such as say STOP5 which allows the
> other cores to run at higher frequencies.

So we would want this same logic for all but the deepest runtime
stop state?

>> Just thinking about the patch itself, I wonder do you need a full
>> kernel timer, or could we just set the decrementer? Is there much 
>> performance cost here?
>>
> 
> Good point. A decrementer would do actually.

That would be good if it does, might save a few cycles.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 4.4 247/266] cpu/speculation: Add mitigations= cmdline option
From: Geert Uytterhoeven @ 2019-05-16  7:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Ben Hutchings
  Cc: Peter Zijlstra, Heiko Carstens, Paul Mackerras, H . Peter Anvin,
	Andrea Arcangeli, linux-s390, Steven Price, Linux ARM,
	Catalin Marinas, Waiman Long, Linux-Arch, Will Deacon,
	Jiri Kosina, linuxppc-dev, Borislav Petkov, Andy Lutomirski,
	Josh Poimboeuf, Thomas Gleixner, Jon Masters, Phil Auld,
	Jiri Kosina, Randy Dunlap, Linux Kernel Mailing List, stable,
	Tyler Hicks, Martin Schwidefsky, Linus Torvalds
In-Reply-To: <20190515090731.364702401@linuxfoundation.org>

Hi Greg, Ben,

On Wed, May 15, 2019 at 1:12 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
>
> commit 98af8452945c55652de68536afdde3b520fec429 upstream.
>
> Keeping track of the number of mitigations for all the CPU speculation
> bugs has become overwhelming for many users.  It's getting more and more
> complicated to decide which mitigations are needed for a given
> architecture.  Complicating matters is the fact that each arch tends to
> have its own custom way to mitigate the same vulnerability.
>
> Most users fall into a few basic categories:
>
> a) they want all mitigations off;
>
> b) they want all reasonable mitigations on, with SMT enabled even if
>    it's vulnerable; or
>
> c) they want all reasonable mitigations on, with SMT disabled if
>    vulnerable.
>
> Define a set of curated, arch-independent options, each of which is an
> aggregation of existing options:
>
> - mitigations=off: Disable all mitigations.
>
> - mitigations=auto: [default] Enable all the default mitigations, but
>   leave SMT enabled, even if it's vulnerable.
>
> - mitigations=auto,nosmt: Enable all the default mitigations, disabling
>   SMT if needed by a mitigation.
>
> Currently, these options are placeholders which don't actually do
> anything.  They will be fleshed out in upcoming patches.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

> [bwh: Backported to 4.4:
>  - Drop the auto,nosmt option which we can't support

This doesn't really stand out. I.e. I completely missed it, and started
wondering why "auto,nosmt" was not documented in
kernel-parameters.txt below...

> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2173,6 +2173,25 @@ bytes respectively. Such letter suffixes
>                         in the "bleeding edge" mini2440 support kernel at
>                         http://repo.or.cz/w/linux-2.6/mini2440.git
>
> +       mitigations=
> +                       Control optional mitigations for CPU vulnerabilities.
> +                       This is a set of curated, arch-independent options, each
> +                       of which is an aggregation of existing arch-specific
> +                       options.
> +
> +                       off
> +                               Disable all optional CPU mitigations.  This
> +                               improves system performance, but it may also
> +                               expose users to several CPU vulnerabilities.
> +
> +                       auto (default)
> +                               Mitigate all CPU vulnerabilities, but leave SMT
> +                               enabled, even if it's vulnerable.  This is for
> +                               users who don't want to be surprised by SMT
> +                               getting disabled across kernel upgrades, or who
> +                               have other ways of avoiding SMT-based attacks.
> +                               This is the default behavior.
> +
>         mminit_loglevel=
>                         [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
>                         parameter allows control of the logging verbosity for

> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -842,3 +842,16 @@ void init_cpu_online(const struct cpumas
>  {
>         cpumask_copy(to_cpumask(cpu_online_bits), src);
>  }
> +
> +enum cpu_mitigations cpu_mitigations = CPU_MITIGATIONS_AUTO;
> +
> +static int __init mitigations_parse_cmdline(char *arg)
> +{
> +       if (!strcmp(arg, "off"))
> +               cpu_mitigations = CPU_MITIGATIONS_OFF;
> +       else if (!strcmp(arg, "auto"))
> +               cpu_mitigations = CPU_MITIGATIONS_AUTO;

Perhaps

    else
            pr_crit("mitigations=%s is not supported\n", arg);

?

Actually that makes sense on mainline, too.
Cooking a patch...

> +
> +       return 0;
> +}
> +early_param("mitigations", mitigations_parse_cmdline);

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Gautham R Shenoy @ 2019-05-16  8:22 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <1557986956.6pmjz10b9z.astroid@bobo.none>

Hi Nicholas,

On Thu, May 16, 2019 at 04:13:17PM +1000, Nicholas Piggin wrote:

> 
> > The motivation behind this patch was a HPC customer issue where they
> > were observing some CPUs in the core getting stuck at stop0_lite
> > state, thereby lowering the performance on the other CPUs of the core
> > which were running the application.
> > 
> > Disabling stop0_lite via sysfs didn't help since we would fallback to
> > snooze and it would make matters worse.
> 
> snooze has the timeout though, so it should kick into stop0 properly
> (and if it doesn't that's another issue that should be fixed in this
> series).
>
> I'm not questioning the patch for stop0_lite, to be clear. I think
> the logic is sound. I just raise one urelated issue that happens to
> be for stop0_lite as well (should we even enable it on P9?), and one
> peripheral issue (should we make a similar fix for deeper stop states?)
>

I think it makes sense to generalize this from the point of view of
CPUs remaining in shallower idle states for long durations on tickless
kernels.

> > 
> >> 
> >> We should always have fewer states unless proven otherwise.
> > 
> > I agree.
> > 
> >> 
> >> That said, we enable it today so I don't want to argue this point
> >> here, because it is a different issue from your patch.
> >> 
> >> > When it is in stop0 or deeper, 
> >> > it free up both
> >> > space and time slice of core.
> >> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> >> > thread
> >> > folding. When a cpu goes to stop0, it will free up the core resources 
> >> > thus increasing
> >> > the single thread performance of other sibling thread.
> >> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
> >> > want to quickly
> >> > move onto the next state.
> >> > If we get stuck in any other state we would possibly be losing on to 
> >> > power saving,
> >> > but will still be able to gain the performance benefits for other 
> >> > sibling threads.
> >> 
> >> That's true, but stop0 -> deeper stop is also a benefit (for
> >> performance if we have some power/thermal constraints, and/or for power
> >> usage).
> >> 
> >> Sure it may not be so noticable as the SMT switch, but I just wonder
> >> if the infrastructure should be there for the same reason.
> >> 
> >> I was testing interrupt frequency on some tickless workloads configs,
> >> and without too much trouble you can get CPUs to sleep with no
> >> interrupts for many minutes. Hours even. We wouldn't want the CPU to
> >> stay in stop0 for that long.
> > 
> > If it stays in stop0 or even stop2 for that long, we would want to
> > "promote" it to a deeper state, such as say STOP5 which allows the
> > other cores to run at higher frequencies.
> 
> So we would want this same logic for all but the deepest runtime
> stop state?

Yes. We can, in steps, promote individual threads of the core to
eventually request a deeper state such as stop4/5. On a completely
idle tickless system, eventually we should see the core go to the
deeper idle state.

> 
> >> Just thinking about the patch itself, I wonder do you need a full
> >> kernel timer, or could we just set the decrementer? Is there much 
> >> performance cost here?
> >>
> > 
> > Good point. A decrementer would do actually.
> 
> That would be good if it does, might save a few cycles.
> 
> Thanks,
> Nick
>

--
Thanks and Regards
gautham.


^ permalink raw reply

* [PATCH AUTOSEL 5.0 12/34] KVM: PPC: Book3S: Protect memslots while validating user address
From: Sasha Levin @ 2019-05-16 11:39 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Alexey Kardashevskiy, linuxppc-dev, kvm-ppc, Sasha Levin
In-Reply-To: <20190516113932.8348-1-sashal@kernel.org>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

[ Upstream commit 345077c8e172c255ea0707214303ccd099e5656b ]

Guest physical to user address translation uses KVM memslots and reading
these requires holding the kvm->srcu lock. However recently introduced
kvmppc_tce_validate() broke the rule (see the lockdep warning below).

This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
kvmppc_tce_validate() as well.

=============================
WARNING: suspicious RCU usage
5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
-----------------------------
include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/8020:
 #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]

stack backtrace:
CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
Call Trace:
[c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
[c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
[c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
[c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
[c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
[c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
[c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
[c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70

Fixes: 42de7b9e2167 ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/powerpc/kvm/book3s_64_vio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 532ab79734c7a..d43e8fe6d4241 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -543,14 +543,14 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
 	ret = kvmppc_tce_validate(stt, tce);
 	if (ret != H_SUCCESS)
-		return ret;
+		goto unlock_exit;
 
 	dir = iommu_tce_direction(tce);
 
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
-
 	if ((dir != DMA_NONE) && kvmppc_tce_to_ua(vcpu->kvm, tce, &ua, NULL)) {
 		ret = H_PARAMETER;
 		goto unlock_exit;
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.0 11/34] KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit
From: Sasha Levin @ 2019-05-16 11:39 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Sasha Levin, linuxppc-dev, kvm-ppc, Suraj Jitindar Singh
In-Reply-To: <20190516113932.8348-1-sashal@kernel.org>

From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

[ Upstream commit 7cb9eb106d7a4efab6bcf30ec9503f1d703c77f5 ]

There is a hardware bug in some POWER9 processors where a treclaim in
fake suspend mode can cause an inconsistency in the XER[SO] bit across
the threads of a core, the workaround being to force the core into SMT4
when doing the treclaim.

The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
thread is in fake suspend or real suspend. The important difference here
being that thread reconfiguration is blocked in real suspend but not
fake suspend mode.

When we exit a guest which was in fake suspend mode, we force the core
into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
However on the new exit path introduced with the function
kvmhv_run_single_vcpu() we restore the host PSSCR before calling
kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
put the thread into real suspend mode when we clear the
PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
and the thread which is trying to get the core into SMT4 before it can
do the treclaim spins forever since it itself is blocking thread
reconfiguration. The result is that that core is essentially lost.

This results in a trace such as:
[   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
[   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
[   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
[   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
[   93.512914] CFAR: c000000000098a5c IRQMASK: 3
[   93.512915] PACATMSCRATCH: 0000000000000001
[   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
[   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
[   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
[   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
[   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
[   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
[   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
[   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
[   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
[   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
[   93.512960] Call Trace:
[   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
[   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
[   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
[   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
[   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
[   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
[   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
[   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
[   93.512983] Instruction dump:
[   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
[   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214

To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
perform the treclaim. Note kvmppc_save_tm_hv() clears the
PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.

Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/powerpc/kvm/book3s_hv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5a066fc299e17..f17065f2c962f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3407,7 +3407,9 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	vcpu->arch.shregs.sprg2 = mfspr(SPRN_SPRG2);
 	vcpu->arch.shregs.sprg3 = mfspr(SPRN_SPRG3);
 
-	mtspr(SPRN_PSSCR, host_psscr);
+	/* Preserve PSSCR[FAKE_SUSPEND] until we've called kvmppc_save_tm_hv */
+	mtspr(SPRN_PSSCR, host_psscr |
+	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 	mtspr(SPRN_HFSCR, host_hfscr);
 	mtspr(SPRN_CIABR, host_ciabr);
 	mtspr(SPRN_DAWR, host_dawr);
-- 
2.20.1


^ permalink raw reply related

* [PATCH] powerpc/mm/hash: Improve address limit checks
From: Aneesh Kumar K.V @ 2019-05-16 11:50 UTC (permalink / raw)
  To: npiggin, paulus, mpe; +Cc: Aneesh Kumar K.V, linuxppc-dev

Different parts of the code do the limit check by ignoring the top nibble
of EA. ie. we do checks like

	if ((ea & EA_MASK)  >= H_PGTABLE_RANGE)
	   	error

This patch makes sure we don't insert SLB entries for addresses whose top nibble
doesn't match the ignored bits.

With an address like 0x4000000008000000, we can result in wrong slb entries like

13 4000000008000000 400ea1b217000510   1T ESID=   400000 VSID=   ea1b217000 LLP:110

without this patch we will map that EA with LINEAR_MAP_REGION_ID and further
those addr limit check will return false.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 5486087e64ea..1060fadb4a56 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -29,10 +29,8 @@
 #define H_PGTABLE_EADDR_SIZE	(H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + \
 				 H_PUD_INDEX_SIZE + H_PGD_INDEX_SIZE + PAGE_SHIFT)
 #define H_PGTABLE_RANGE		(ASM_CONST(1) << H_PGTABLE_EADDR_SIZE)
-/*
- * Top 2 bits are ignored in page table walk.
- */
-#define EA_MASK			(~(0xcUL << 60))
+
+#define EA_MASK			(~PAGE_OFFSET)
 
 /*
  * We store the slot details in the second half of page table.
@@ -93,6 +91,7 @@
 #define VMALLOC_REGION_ID	NON_LINEAR_REGION_ID(H_VMALLOC_START)
 #define IO_REGION_ID		NON_LINEAR_REGION_ID(H_KERN_IO_START)
 #define VMEMMAP_REGION_ID	NON_LINEAR_REGION_ID(H_VMEMMAP_START)
+#define INVALID_REGION_ID	(VMEMMAP_REGION_ID + 1)
 
 /*
  * Defines the address of the vmemap area, in its own region on
@@ -119,6 +118,9 @@ static inline int get_region_id(unsigned long ea)
 	if (id == 0)
 		return USER_REGION_ID;
 
+	if (id != (PAGE_OFFSET >> 60))
+		return INVALID_REGION_ID;
+
 	if (ea < H_KERN_VIRT_START)
 		return LINEAR_MAP_REGION_ID;
 
-- 
2.21.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox