LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 203609] Build error: implicit declaration of function 'cpu_mitigations_off'
From: bugzilla-daemon @ 2019-05-16 13:49 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-203609-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=203609

Michael Ellerman (michael@ellerman.id.au) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
                 CC|                            |michael@ellerman.id.au

--- Comment #1 from Michael Ellerman (michael@ellerman.id.au) ---
This should be fixed in 4.19.44 and 4.14.120 which will be released soon.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH] powerpc/book3s/mm: Clear MMU_FTR_HPTE_TABLE when radix is enabled.
From: Aneesh Kumar K.V @ 2019-05-16 13:36 UTC (permalink / raw)
  To: Nicholas Piggin, mpe, paulus; +Cc: linuxppc-dev
In-Reply-To: <1557982690.pk1t7llmyy.astroid@bobo.none>

On 5/16/19 10:34 AM, Nicholas Piggin wrote:
> Aneesh Kumar K.V's on May 14, 2019 4:02 pm:
>> Avoids confusion when printing Oops message like below
>>
>>   Faulting instruction address: 0xc00000000008bdb4
>>   Oops: Kernel access of bad area, sig: 11 [#1]
>>   LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
>>
>> Either ibm,pa-features or ibm,powerpc-cpu-features can be used to enable the
>> MMU features. We don't clear related MMU feature bits there. We use the kernel
>> commandline to determine what translation mode we want to use and clear the
>> HPTE or radix bit accordingly. On LPAR we do have to renable HASH bit if the
>> hypervisor can't do radix.
> 
> Well we have the HPTE feature: the CPU supports hash MMU mode. It's
> just the the kernel is booted in radix mode.
> 

We are not using mmu_features to indicate the capability of the hardware 
right? ie, mmu_features is an indication of current running config. We 
set MMU_FTR_TYPE_RADIX if the kernel is running in radix translation 
mode and on similar lines we should set MMU_FTR_HPTE_TABLE if the kernel 
is running in only hash translation mode. Whether the hardware support 
these translation mode is different from which mode is currently used.



> Could make a difference for KVM, if it will support an HPT guest or
> not.
> 

kvm should not depend on MMU_FTR_HPTE_TABLE to identify whether the 
hardware supports hash page table translation. I don't think we do that.


> That's all highly theoretical and we have other inconsistencies
> already in this stuff, I'd just like to try make things a bit better
> in the long term.
> 


-aneesh


^ permalink raw reply

* Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV
From: Christophe Leroy @ 2019-05-16 13:32 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Herbert Xu, Horia Geanta, linux-kernel@vger.kernel.org,
	linux-crypto@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	David S. Miller
In-Reply-To: <20190516023050.GA23200@sol.localdomain>



Le 16/05/2019 à 04:30, Eric Biggers a écrit :
> On Wed, May 15, 2019 at 08:49:48PM +0200, Christophe Leroy wrote:
>>
>>
>> Le 15/05/2019 à 16:05, Horia Geanta a écrit :
>>> On 5/15/2019 3:29 PM, Christophe Leroy wrote:
>>>> Selftests report the following:
>>>>
>>>> [    2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>>>> [    2.995377] 00000000: 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
>>>> [    3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>>>> [    3.043185] 00000000: fe dc ba 98 76 54 32 10
>>>> [    3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
>>>> [    3.073818] 00000000: 7d 33 88 93 0f 93 b2 42
>>>>
>>>> This above dumps show that the actual output IV is indeed the input IV.
>>>> This is due to the IV not being copied back into the request.
>>>>
>>>> This patch fixes that.
>>>>
>>>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>>> Reviewed-by: Horia Geantă <horia.geanta@nxp.com>
>>
>> It's missing a Fixes: tag and a Cc: to stable.
>>
>> I'll resend tomorrow.
>>
>>>
>>> While here, could you please check ecb mode (which by definition does not have
>>> an IV) is behaving correctly?
>>> Looking in driver_algs[] list of crypto algorithms supported by talitos,
>>> ecb(aes,des,3des) are declared with ivsize != 0.
>>
>> According to /proc/crypto, test are passed for ecb.
>>
> 
> Did you try enabling CONFIG_CRYPTO_MANAGER_EXTRA_TESTS?  There is now a check
> that the driver's ivsize matches the generic implementation's:
> 
>          if (ivsize != crypto_skcipher_ivsize(generic_tfm)) {
>                  pr_err("alg: skcipher: ivsize for %s (%u) doesn't match generic impl (%u)\n",
>                         driver, ivsize, crypto_skcipher_ivsize(generic_tfm));
>                  err = -EINVAL;
>                  goto out;
>          }
> 
> For ECB that means the ivsize must be 0.
> 
> AFAICS the talitos driver even accesses the IV for ECB, which is wrong; and the
> only reason this isn't crashing the self-tests already is that they are confused
> by the declared ivsize being nonzero so they don't pass NULL as they should.
> 

Ok, thanks. I'll try and run EXTRA TESTS as soon as I get the current 
test all fixed.

For the time being, I'm having a problem that I'm a bit lost with:

AEAD decryption fails at the moment for out-of-line tests, and the 
reason is that the ICV (used to do the SW compare with the expected one) 
is generated after the decrypted data.
It works perfectly when src == dst, because the src has space for it, 
but when the dst is different, the dst length is smaller so the ICV is 
generated outside the sg list, and the comparison fails because the 
comparison is done with the last bytes of the last segment of dst sg 
list (which corresponds to the end of decrypted data in that case).

What I'm having hard time with it that it seems that when the sg list 
has several elements, it uses an out of line area for generating the ICV 
but not when the sg list has only one element. I'm really wondering why.

Thanks
Christophe

^ permalink raw reply

* Re: Latest Git kernel: Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
From: Shawn Landden @ 2019-05-16 11:06 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Christian Zigotzky, linuxppc-dev, Christian Zigotzky
In-Reply-To: <dbfbd2e0-eca8-8ecc-793b-a6f1471ce2ee@c-s.fr>

[-- Attachment #1: Type: text/plain, Size: 1422 bytes --]

Hi

El mié., 15 may. 2019 5:17, Christophe Leroy <christophe.leroy@c-s.fr>
escribió:

> Hi,
>
> Le 15/05/2019 à 12:09, Christian Zigotzky a écrit :
> > Hi All,
> >
> > I got the following error messages with the latest Git kernel today:
> >
> > GEN     .version
> >    CHK     include/generated/compile.h
> >    LD      vmlinux.o
> >    MODPOST vmlinux.o
> > WARNING: vmlinux.o(.text+0x302a): Section mismatch in reference from the
> > variable start_here_multiplatform to the function
> .init.text:.early_setup()
> > The function start_here_multiplatform() references
> > the function __init .early_setup().
> > This is often because start_here_multiplatform lacks a __init
> > annotation or the annotation of .early_setup is wrong.
> >
> >    MODINFO modules.builtin.modinfo
> >    KSYM    .tmp_kallsyms1.o
> >    KSYM    .tmp_kallsyms2.o
> >    LD      vmlinux
> >    SORTEX  vmlinux
> >    SYSMAP  System.map
> >    CHKHEAD vmlinux
> >
> > What does it mean?
>
The best first thing to do with these type of errors is to Google them. If
you see there are patches just assume somebody else is working on it. That
saves everybody time, because these things tend to get duplicate reports.

>
> I proposed a patch for it at https://patchwork.ozlabs.org/patch/1097845/
>
> Christophe
>
> >
> > Please find attached the kernel config.
> >
> > Thanks,
> > Christian
> >
>

[-- Attachment #2: Type: text/html, Size: 2199 bytes --]

^ permalink raw reply

* Re: [PATCH 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-16 13:08 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: linux-ia64, linux-sh, linux-kernel, David Howells,
	open list:KERNEL SELFTEST FRAMEWORK, sparclinux, Linux API,
	elena.reshetova, linux-arch, linux-s390, linux-xtensa, Kees Cook,
	Arnd Bergmann, Jann Horn, linux-m68k, Al Viro, Andy Lutomirski,
	Oleg Nesterov, Thomas Gleixner, linux-arm-kernel, linux-parisc,
	Aleksa Sarai, Linus Torvalds, linux-mips, Andy Lutomirski,
	Eric W. Biederman, linux-alpha, Andrew Morton, linuxppc-dev
In-Reply-To: <CAKOZuesPF+ftwqsNDMBy1LpwJgWTNuQm9-E=C90sSTBYEEsDww@mail.gmail.com>

On Wed, May 15, 2019 at 10:45:06AM -0700, Daniel Colascione wrote:
> On Wed, May 15, 2019 at 3:04 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > process that is created via traditional fork()/clone() calls that is only
> > referenced by a PID:
> 
> Thanks for doing this work. I'm really looking forward to this new
> approach to process management.

Thanks! Glad to hear!

> 
> > int pidfd = pidfd_open(1234, 0);
> > ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
> >
> > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > created pidfds at process creation time.
> > However, a lot of processes get created with traditional PID-based calls
> > such as fork() or clone() (without CLONE_PIDFD). For these processes a
> > caller can currently not create a pollable pidfd. This is a huge problem
> > for Android's low memory killer (LMK) and service managers such as systemd.
> > Both are examples of tools that want to make use of pidfds to get reliable
> > notification of process exit for non-parents (pidfd polling) and race-free
> > signal sending (pidfd_send_signal()). They intend to switch to this API for
> > process supervision/management as soon as possible. Having no way to get
> > pollable pidfds from PID-only processes is one of the biggest blockers for
> > them in adopting this api. With pidfd_open() making it possible to retrieve
> > pidfd for PID-based processes we enable them to adopt this api.
> >
> > In line with Arnd's recent changes to consolidate syscall numbers across
> > architectures, I have added the pidfd_open() syscall to all architectures
> > at the same time.
> 
> I'm glad it's easier now.
> 
> >  arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/arm64/include/asm/unistd32.h           |  2 +
> >  arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
> >  arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
> >  arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
> >  arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
> >  arch/s390/kernel/syscalls/syscall.tbl       |  1 +
> >  arch/sh/kernel/syscalls/syscall.tbl         |  1 +
> >  arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
> >  arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
> 
> It'd be nice to arrange the system call tables so that we need to
> change only one file when adding a new system call.
> 
> [Snip system call wiring]
> 
> > --- a/include/linux/pid.h
> > +++ b/include/linux/pid.h
> > @@ -67,6 +67,7 @@ struct pid
> >  extern struct pid init_struct_pid;
> >
> >  extern const struct file_operations pidfd_fops;
> > +extern int pidfd_create(struct pid *pid);
> >
> >  static inline struct pid *get_pid(struct pid *pid)
> >  {
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..989055e0b501 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -929,6 +929,7 @@ asmlinkage long sys_clock_adjtime32(clockid_t which_clock,
> >                                 struct old_timex32 __user *tx);
> >  asmlinkage long sys_syncfs(int fd);
> >  asmlinkage long sys_setns(int fd, int nstype);
> > +asmlinkage long sys_pidfd_open(pid_t pid, unsigned int flags);
> >  asmlinkage long sys_sendmmsg(int fd, struct mmsghdr __user *msg,
> >                              unsigned int vlen, unsigned flags);
> >  asmlinkage long sys_process_vm_readv(pid_t pid,
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..94a257a93d20 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,9 +832,11 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_pidfd_open 428
> > +__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
> >
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 428
> > +#define __NR_syscalls 429
> >
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 737db1828437..980cc1d2b8d4 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1714,7 +1714,7 @@ const struct file_operations pidfd_fops = {
> >   * Return: On success, a cloexec pidfd is returned.
> >   *         On error, a negative errno number will be returned.
> >   */
> > -static int pidfd_create(struct pid *pid)
> > +int pidfd_create(struct pid *pid)
> >  {
> >         int fd;
> >
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 20881598bdfa..237d18d6ecb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -38,6 +38,7 @@
> >  #include <linux/syscalls.h>
> >  #include <linux/proc_ns.h>
> >  #include <linux/proc_fs.h>
> > +#include <linux/sched/signal.h>
> >  #include <linux/sched/task.h>
> >  #include <linux/idr.h>
> >
> > @@ -451,6 +452,53 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> >         return idr_get_next(&ns->idr, &nr);
> >  }
> >
> > +/**
> > + * pidfd_open() - Open new pid file descriptor.
> > + *
> > + * @pid:   pid for which to retrieve a pidfd
> > + * @flags: flags to pass
> > + *
> > + * This creates a new pid file descriptor with the O_CLOEXEC flag set for
> > + * the process identified by @pid. Currently, the process identified by
> > + * @pid must be a thread-group leader. This restriction currently exists
> > + * for all aspects of pidfds including pidfd creation (CLONE_PIDFD cannot
> > + * be used with CLONE_THREAD) and pidfd polling (only supports thread group
> > + * leaders).
> > + *
> > + * Return: On success, a cloexec pidfd is returned.
> > + *         On error, a negative errno number will be returned.
> > + */
> > +SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > +{
> > +       int fd, ret;
> > +       struct pid *p;
> > +       struct task_struct *tsk;
> > +
> > +       if (flags)
> > +               return -EINVAL;
> 
> If we support blocking operations on pidfds, we'll want to be able to
> put them in non-blocking mode. Does it make sense to accept and ignore
> O_NONBLOCK here now?

Hm, is there a race condition you see that would prevent you from using
fcntl()? If you're ok with using fcntl() I would argue we should hold on
tight to every bit we have for flags.

> 
> > +       if (pid <= 0)
> > +               return -EINVAL;
> 
> WDYT of defining pid == 0 to mean "open myself"?

I'm torn. It be a nice shortcut of course but pid being 0 is usually an
indicator for child processes. So unless the getpid() before
pidfd_open() is an issue I'd say let's leave it as is. If you really
want the shortcut might -1 be better?

> 
> > +       p = find_get_pid(pid);
> > +       if (!p)
> > +               return -ESRCH;
> > +
> > +       rcu_read_lock();
> > +       tsk = pid_task(p, PIDTYPE_PID);
> > +       if (!tsk)
> > +               ret = -ESRCH;
> > +       else if (unlikely(!thread_group_leader(tsk)))
> > +               ret = -EINVAL;
> > +       else
> > +               ret = 0;
> > +       rcu_read_unlock();
> > +
> > +       fd = ret ?: pidfd_create(p);
> > +       put_pid(p);
> > +       return fd;
> > +}

^ permalink raw reply

* [PATCH] powerpc/mm/hash: Improve address limit checks
From: Aneesh Kumar K.V @ 2019-05-16 11:50 UTC (permalink / raw)
  To: npiggin, paulus, mpe; +Cc: Aneesh Kumar K.V, linuxppc-dev

Different parts of the code do the limit check by ignoring the top nibble
of EA. ie. we do checks like

	if ((ea & EA_MASK)  >= H_PGTABLE_RANGE)
	   	error

This patch makes sure we don't insert SLB entries for addresses whose top nibble
doesn't match the ignored bits.

With an address like 0x4000000008000000, we can result in wrong slb entries like

13 4000000008000000 400ea1b217000510   1T ESID=   400000 VSID=   ea1b217000 LLP:110

without this patch we will map that EA with LINEAR_MAP_REGION_ID and further
those addr limit check will return false.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/book3s/64/hash.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 5486087e64ea..1060fadb4a56 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -29,10 +29,8 @@
 #define H_PGTABLE_EADDR_SIZE	(H_PTE_INDEX_SIZE + H_PMD_INDEX_SIZE + \
 				 H_PUD_INDEX_SIZE + H_PGD_INDEX_SIZE + PAGE_SHIFT)
 #define H_PGTABLE_RANGE		(ASM_CONST(1) << H_PGTABLE_EADDR_SIZE)
-/*
- * Top 2 bits are ignored in page table walk.
- */
-#define EA_MASK			(~(0xcUL << 60))
+
+#define EA_MASK			(~PAGE_OFFSET)
 
 /*
  * We store the slot details in the second half of page table.
@@ -93,6 +91,7 @@
 #define VMALLOC_REGION_ID	NON_LINEAR_REGION_ID(H_VMALLOC_START)
 #define IO_REGION_ID		NON_LINEAR_REGION_ID(H_KERN_IO_START)
 #define VMEMMAP_REGION_ID	NON_LINEAR_REGION_ID(H_VMEMMAP_START)
+#define INVALID_REGION_ID	(VMEMMAP_REGION_ID + 1)
 
 /*
  * Defines the address of the vmemap area, in its own region on
@@ -119,6 +118,9 @@ static inline int get_region_id(unsigned long ea)
 	if (id == 0)
 		return USER_REGION_ID;
 
+	if (id != (PAGE_OFFSET >> 60))
+		return INVALID_REGION_ID;
+
 	if (ea < H_KERN_VIRT_START)
 		return LINEAR_MAP_REGION_ID;
 
-- 
2.21.0


^ permalink raw reply related

* [PATCH AUTOSEL 5.0 11/34] KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit
From: Sasha Levin @ 2019-05-16 11:39 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Sasha Levin, linuxppc-dev, kvm-ppc, Suraj Jitindar Singh
In-Reply-To: <20190516113932.8348-1-sashal@kernel.org>

From: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

[ Upstream commit 7cb9eb106d7a4efab6bcf30ec9503f1d703c77f5 ]

There is a hardware bug in some POWER9 processors where a treclaim in
fake suspend mode can cause an inconsistency in the XER[SO] bit across
the threads of a core, the workaround being to force the core into SMT4
when doing the treclaim.

The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
thread is in fake suspend or real suspend. The important difference here
being that thread reconfiguration is blocked in real suspend but not
fake suspend mode.

When we exit a guest which was in fake suspend mode, we force the core
into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
However on the new exit path introduced with the function
kvmhv_run_single_vcpu() we restore the host PSSCR before calling
kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
put the thread into real suspend mode when we clear the
PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
and the thread which is trying to get the core into SMT4 before it can
do the treclaim spins forever since it itself is blocking thread
reconfiguration. The result is that that core is essentially lost.

This results in a trace such as:
[   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
[   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
[   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
[   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
[   93.512914] CFAR: c000000000098a5c IRQMASK: 3
[   93.512915] PACATMSCRATCH: 0000000000000001
[   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
[   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
[   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
[   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
[   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
[   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
[   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
[   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
[   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
[   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
[   93.512960] Call Trace:
[   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
[   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
[   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
[   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
[   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
[   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
[   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
[   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
[   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
[   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
[   93.512983] Instruction dump:
[   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
[   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214

To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
perform the treclaim. Note kvmppc_save_tm_hv() clears the
PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.

Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")

Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/powerpc/kvm/book3s_hv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5a066fc299e17..f17065f2c962f 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3407,7 +3407,9 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
 	vcpu->arch.shregs.sprg2 = mfspr(SPRN_SPRG2);
 	vcpu->arch.shregs.sprg3 = mfspr(SPRN_SPRG3);
 
-	mtspr(SPRN_PSSCR, host_psscr);
+	/* Preserve PSSCR[FAKE_SUSPEND] until we've called kvmppc_save_tm_hv */
+	mtspr(SPRN_PSSCR, host_psscr |
+	      (local_paca->kvm_hstate.fake_suspend << PSSCR_FAKE_SUSPEND_LG));
 	mtspr(SPRN_HFSCR, host_hfscr);
 	mtspr(SPRN_CIABR, host_ciabr);
 	mtspr(SPRN_DAWR, host_dawr);
-- 
2.20.1


^ permalink raw reply related

* [PATCH AUTOSEL 5.0 12/34] KVM: PPC: Book3S: Protect memslots while validating user address
From: Sasha Levin @ 2019-05-16 11:39 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Alexey Kardashevskiy, linuxppc-dev, kvm-ppc, Sasha Levin
In-Reply-To: <20190516113932.8348-1-sashal@kernel.org>

From: Alexey Kardashevskiy <aik@ozlabs.ru>

[ Upstream commit 345077c8e172c255ea0707214303ccd099e5656b ]

Guest physical to user address translation uses KVM memslots and reading
these requires holding the kvm->srcu lock. However recently introduced
kvmppc_tce_validate() broke the rule (see the lockdep warning below).

This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
kvmppc_tce_validate() as well.

=============================
WARNING: suspicious RCU usage
5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
-----------------------------
include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by qemu-system-ppc/8020:
 #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]

stack backtrace:
CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
Call Trace:
[c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
[c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
[c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
[c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
[c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
[c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
[c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
[c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
[c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
[c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
[c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
[c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
[c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70

Fixes: 42de7b9e2167 ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/powerpc/kvm/book3s_64_vio.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 532ab79734c7a..d43e8fe6d4241 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -543,14 +543,14 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
 	if (ret != H_SUCCESS)
 		return ret;
 
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
 	ret = kvmppc_tce_validate(stt, tce);
 	if (ret != H_SUCCESS)
-		return ret;
+		goto unlock_exit;
 
 	dir = iommu_tce_direction(tce);
 
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
-
 	if ((dir != DMA_NONE) && kvmppc_tce_to_ua(vcpu->kvm, tce, &ua, NULL)) {
 		ret = H_PARAMETER;
 		goto unlock_exit;
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Gautham R Shenoy @ 2019-05-16  8:22 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <1557986956.6pmjz10b9z.astroid@bobo.none>

Hi Nicholas,

On Thu, May 16, 2019 at 04:13:17PM +1000, Nicholas Piggin wrote:

> 
> > The motivation behind this patch was a HPC customer issue where they
> > were observing some CPUs in the core getting stuck at stop0_lite
> > state, thereby lowering the performance on the other CPUs of the core
> > which were running the application.
> > 
> > Disabling stop0_lite via sysfs didn't help since we would fallback to
> > snooze and it would make matters worse.
> 
> snooze has the timeout though, so it should kick into stop0 properly
> (and if it doesn't that's another issue that should be fixed in this
> series).
>
> I'm not questioning the patch for stop0_lite, to be clear. I think
> the logic is sound. I just raise one urelated issue that happens to
> be for stop0_lite as well (should we even enable it on P9?), and one
> peripheral issue (should we make a similar fix for deeper stop states?)
>

I think it makes sense to generalize this from the point of view of
CPUs remaining in shallower idle states for long durations on tickless
kernels.

> > 
> >> 
> >> We should always have fewer states unless proven otherwise.
> > 
> > I agree.
> > 
> >> 
> >> That said, we enable it today so I don't want to argue this point
> >> here, because it is a different issue from your patch.
> >> 
> >> > When it is in stop0 or deeper, 
> >> > it free up both
> >> > space and time slice of core.
> >> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> >> > thread
> >> > folding. When a cpu goes to stop0, it will free up the core resources 
> >> > thus increasing
> >> > the single thread performance of other sibling thread.
> >> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
> >> > want to quickly
> >> > move onto the next state.
> >> > If we get stuck in any other state we would possibly be losing on to 
> >> > power saving,
> >> > but will still be able to gain the performance benefits for other 
> >> > sibling threads.
> >> 
> >> That's true, but stop0 -> deeper stop is also a benefit (for
> >> performance if we have some power/thermal constraints, and/or for power
> >> usage).
> >> 
> >> Sure it may not be so noticable as the SMT switch, but I just wonder
> >> if the infrastructure should be there for the same reason.
> >> 
> >> I was testing interrupt frequency on some tickless workloads configs,
> >> and without too much trouble you can get CPUs to sleep with no
> >> interrupts for many minutes. Hours even. We wouldn't want the CPU to
> >> stay in stop0 for that long.
> > 
> > If it stays in stop0 or even stop2 for that long, we would want to
> > "promote" it to a deeper state, such as say STOP5 which allows the
> > other cores to run at higher frequencies.
> 
> So we would want this same logic for all but the deepest runtime
> stop state?

Yes. We can, in steps, promote individual threads of the core to
eventually request a deeper state such as stop4/5. On a completely
idle tickless system, eventually we should see the core go to the
deeper idle state.

> 
> >> Just thinking about the patch itself, I wonder do you need a full
> >> kernel timer, or could we just set the decrementer? Is there much 
> >> performance cost here?
> >>
> > 
> > Good point. A decrementer would do actually.
> 
> That would be good if it does, might save a few cycles.
> 
> Thanks,
> Nick
>

--
Thanks and Regards
gautham.


^ permalink raw reply

* Re: [PATCH 4.4 247/266] cpu/speculation: Add mitigations= cmdline option
From: Geert Uytterhoeven @ 2019-05-16  7:04 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Ben Hutchings
  Cc: Peter Zijlstra, Heiko Carstens, Paul Mackerras, H . Peter Anvin,
	Andrea Arcangeli, linux-s390, Steven Price, Linux ARM,
	Catalin Marinas, Waiman Long, Linux-Arch, Will Deacon,
	Jiri Kosina, linuxppc-dev, Borislav Petkov, Andy Lutomirski,
	Josh Poimboeuf, Thomas Gleixner, Jon Masters, Phil Auld,
	Jiri Kosina, Randy Dunlap, Linux Kernel Mailing List, stable,
	Tyler Hicks, Martin Schwidefsky, Linus Torvalds
In-Reply-To: <20190515090731.364702401@linuxfoundation.org>

Hi Greg, Ben,

On Wed, May 15, 2019 at 1:12 PM Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> From: Josh Poimboeuf <jpoimboe@redhat.com>
>
> commit 98af8452945c55652de68536afdde3b520fec429 upstream.
>
> Keeping track of the number of mitigations for all the CPU speculation
> bugs has become overwhelming for many users.  It's getting more and more
> complicated to decide which mitigations are needed for a given
> architecture.  Complicating matters is the fact that each arch tends to
> have its own custom way to mitigate the same vulnerability.
>
> Most users fall into a few basic categories:
>
> a) they want all mitigations off;
>
> b) they want all reasonable mitigations on, with SMT enabled even if
>    it's vulnerable; or
>
> c) they want all reasonable mitigations on, with SMT disabled if
>    vulnerable.
>
> Define a set of curated, arch-independent options, each of which is an
> aggregation of existing options:
>
> - mitigations=off: Disable all mitigations.
>
> - mitigations=auto: [default] Enable all the default mitigations, but
>   leave SMT enabled, even if it's vulnerable.
>
> - mitigations=auto,nosmt: Enable all the default mitigations, disabling
>   SMT if needed by a mitigation.
>
> Currently, these options are placeholders which don't actually do
> anything.  They will be fleshed out in upcoming patches.
>
> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

> [bwh: Backported to 4.4:
>  - Drop the auto,nosmt option which we can't support

This doesn't really stand out. I.e. I completely missed it, and started
wondering why "auto,nosmt" was not documented in
kernel-parameters.txt below...

> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2173,6 +2173,25 @@ bytes respectively. Such letter suffixes
>                         in the "bleeding edge" mini2440 support kernel at
>                         http://repo.or.cz/w/linux-2.6/mini2440.git
>
> +       mitigations=
> +                       Control optional mitigations for CPU vulnerabilities.
> +                       This is a set of curated, arch-independent options, each
> +                       of which is an aggregation of existing arch-specific
> +                       options.
> +
> +                       off
> +                               Disable all optional CPU mitigations.  This
> +                               improves system performance, but it may also
> +                               expose users to several CPU vulnerabilities.
> +
> +                       auto (default)
> +                               Mitigate all CPU vulnerabilities, but leave SMT
> +                               enabled, even if it's vulnerable.  This is for
> +                               users who don't want to be surprised by SMT
> +                               getting disabled across kernel upgrades, or who
> +                               have other ways of avoiding SMT-based attacks.
> +                               This is the default behavior.
> +
>         mminit_loglevel=
>                         [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
>                         parameter allows control of the logging verbosity for

> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -842,3 +842,16 @@ void init_cpu_online(const struct cpumas
>  {
>         cpumask_copy(to_cpumask(cpu_online_bits), src);
>  }
> +
> +enum cpu_mitigations cpu_mitigations = CPU_MITIGATIONS_AUTO;
> +
> +static int __init mitigations_parse_cmdline(char *arg)
> +{
> +       if (!strcmp(arg, "off"))
> +               cpu_mitigations = CPU_MITIGATIONS_OFF;
> +       else if (!strcmp(arg, "auto"))
> +               cpu_mitigations = CPU_MITIGATIONS_AUTO;

Perhaps

    else
            pr_crit("mitigations=%s is not supported\n", arg);

?

Actually that makes sense on mainline, too.
Cooking a patch...

> +
> +       return 0;
> +}
> +early_param("mitigations", mitigations_parse_cmdline);

Gr{oetje,eeting}s,

                        Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Nicholas Piggin @ 2019-05-16  6:13 UTC (permalink / raw)
  To: ego
  Cc: linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <20190516053659.GA20396@in.ibm.com>

Gautham R Shenoy's on May 16, 2019 3:36 pm:
> Hello Nicholas,
> 
> 
> On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote:
>> Abhishek's on May 13, 2019 7:49 pm:
>> > On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
>> >> Abhishek Goel's on April 22, 2019 4:32 pm:
>> >>> Currently, the cpuidle governors determine what idle state a idling CPU
>> >>> should enter into based on heuristics that depend on the idle history on
>> >>> that CPU. Given that no predictive heuristic is perfect, there are cases
>> >>> where the governor predicts a shallow idle state, hoping that the CPU will
>> >>> be busy soon. However, if no new workload is scheduled on that CPU in the
>> >>> near future, the CPU will end up in the shallow state.
>> >>>
>> >>> Motivation
>> >>> ----------
>> >>> In case of POWER, this is problematic, when the predicted state in the
>> >>> aforementioned scenario is a lite stop state, as such lite states will
>> >>> inhibit SMT folding, thereby depriving the other threads in the core from
>> >>> using the core resources.
>> >>>
>> >>> So we do not want to get stucked in such states for longer duration. To
>> >>> address this, the cpuidle-core can queue timer to correspond with the
>> >>> residency value of the next available state. This timer will forcefully
>> >>> wakeup the cpu. Few such iterations will essentially train the governor to
>> >>> select a deeper state for that cpu, as the timer here corresponds to the
>> >>> next available cpuidle state residency. Cpu will be kicked out of the lite
>> >>> state and end up in a non-lite state.
>> >>>
>> >>> Experiment
>> >>> ----------
>> >>> I performed experiments for three scenarios to collect some data.
>> >>>
>> >>> case 1 :
>> >>> Without this patch and without tick retained, i.e. in a upstream kernel,
>> >>> It would spend more than even a second to get out of stop0_lite.
>> >>>
>> >>> case 2 : With tick retained in a upstream kernel -
>> >>>
>> >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
>> >>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
>> >>> observation was
>> >>>
>> >>> =========================================================
>> >>> sample          min            max           99percentile
>> >>> 20              4ms            12ms          4ms
>> >>> =========================================================
>> >>>
>> >>> It would take atleast one sched tick to get out of stop0_lite.
>> >>>
>> >>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
>> >>>            timer)
>> >>>
>> >>> ============================================================
>> >>> sample          min             max             99percentile
>> >>> ============================================================
>> >>> 20              144us           192us           144us
>> >>> ============================================================
>> >>>
>> >>> In this patch, we queue a timer just before entering into a stop0_lite
>> >>> state. The timer fires at (residency of next available state + exit latency
>> >>> of next available state * 2). Let's say if next state(stop0) is available
>> >>> which has residency of 20us, it should get out in as low as (20+2*2)*8
>> >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
>> >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
>> >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
>> >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
>> >>> into stop2.
>> >>>
>> >>> So, We are able to get out of stop0_lite generally in 150us(with this
>> >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
>> >>> want to get stuck into stop0_lite as it inhibits SMT folding for other
>> >>> sibling threads, depriving them of core resources. Current patch is using
>> >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
>> >>> reason) along with lowering down power consumption. We may extend this
>> >>> model for other states in future.
>> >> I still have to wonder, between our snooze loop and stop0, what does
>> >> stop0_lite buy us.
>> >>
>> >> That said, the problem you're solving here is a generic one that all
>> >> stop states have, I think. Doesn't the same thing apply going from
>> >> stop0 to stop5? You might under estimate the sleep time and lose power
>> >> savings and therefore performance there too. Shouldn't we make it
>> >> generic for all stop states?
>> >>
>> >> Thanks,
>> >> Nick
>> >>
>> >>
>> > When a cpu is in snooze, it takes both space and time of core. When in 
>> > stop0_lite,
>> > it free up time but it still takes space.
>> 
>> True, but snooze should only be taking less than 1% of front end
>> cycles. I appreciate there is some non-zero difference here, I just
>> wonder in practice what exactly we gain by it.
> 
> The idea behind implementing a lite-state was that on the future
> platforms it can be made to wait on a flag and hence act as a
> replacement for snooze. On POWER9 we don't have this feature.

Right. I mean for POWER9.

> The motivation behind this patch was a HPC customer issue where they
> were observing some CPUs in the core getting stuck at stop0_lite
> state, thereby lowering the performance on the other CPUs of the core
> which were running the application.
> 
> Disabling stop0_lite via sysfs didn't help since we would fallback to
> snooze and it would make matters worse.

snooze has the timeout though, so it should kick into stop0 properly
(and if it doesn't that's another issue that should be fixed in this
series).

I'm not questioning the patch for stop0_lite, to be clear. I think
the logic is sound. I just raise one urelated issue that happens to
be for stop0_lite as well (should we even enable it on P9?), and one
peripheral issue (should we make a similar fix for deeper stop states?)

> 
>> 
>> We should always have fewer states unless proven otherwise.
> 
> I agree.
> 
>> 
>> That said, we enable it today so I don't want to argue this point
>> here, because it is a different issue from your patch.
>> 
>> > When it is in stop0 or deeper, 
>> > it free up both
>> > space and time slice of core.
>> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
>> > thread
>> > folding. When a cpu goes to stop0, it will free up the core resources 
>> > thus increasing
>> > the single thread performance of other sibling thread.
>> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
>> > want to quickly
>> > move onto the next state.
>> > If we get stuck in any other state we would possibly be losing on to 
>> > power saving,
>> > but will still be able to gain the performance benefits for other 
>> > sibling threads.
>> 
>> That's true, but stop0 -> deeper stop is also a benefit (for
>> performance if we have some power/thermal constraints, and/or for power
>> usage).
>> 
>> Sure it may not be so noticable as the SMT switch, but I just wonder
>> if the infrastructure should be there for the same reason.
>> 
>> I was testing interrupt frequency on some tickless workloads configs,
>> and without too much trouble you can get CPUs to sleep with no
>> interrupts for many minutes. Hours even. We wouldn't want the CPU to
>> stay in stop0 for that long.
> 
> If it stays in stop0 or even stop2 for that long, we would want to
> "promote" it to a deeper state, such as say STOP5 which allows the
> other cores to run at higher frequencies.

So we would want this same logic for all but the deepest runtime
stop state?

>> Just thinking about the patch itself, I wonder do you need a full
>> kernel timer, or could we just set the decrementer? Is there much 
>> performance cost here?
>>
> 
> Good point. A decrementer would do actually.

That would be good if it does, might save a few cycles.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH] powerpc/64s: Make boot look nice(r)
From: Nicholas Piggin @ 2019-05-16  6:08 UTC (permalink / raw)
  To: Christophe Leroy, linuxppc-dev
In-Reply-To: <d65ae686-c117-ae1c-1d48-498fdd1ea0eb@c-s.fr>

Christophe Leroy's on May 16, 2019 2:47 pm:
> 
> 
> Le 16/05/2019 à 04:04, Nicholas Piggin a écrit :
>> Radix boot looks like this:
>> 
>>   -----------------------------------------------------
>>   phys_mem_size     = 0x200000000
>>   dcache_bsize      = 0x80
>>   icache_bsize      = 0x80
>>   cpu_features      = 0x0000c06f8f5fb1a7
>>     possible        = 0x0000fbffcf5fb1a7
>>     always          = 0x00000003800081a1
>>   cpu_user_features = 0xdc0065c2 0xaee00000
>>   mmu_features      = 0xbc006041
>>   firmware_features = 0x0000000010000000
>>   hash-mmu: ppc64_pft_size    = 0x0
>>   hash-mmu: kernel vmalloc start   = 0xc008000000000000
>>   hash-mmu: kernel IO start        = 0xc00a000000000000
>>   hash-mmu: kernel vmemmap start   = 0xc00c000000000000
>>   -----------------------------------------------------
>> 
>> Fix:
>> 
>>   -----------------------------------------------------
>>   phys_mem_size     = 0x200000000
>>   dcache_bsize      = 0x80
>>   icache_bsize      = 0x80
>>   cpu_features      = 0x0000c06f8f5fb1a7
>>     possible        = 0x0000fbffcf5fb1a7
>>     always          = 0x00000003800081a1
>>   cpu_user_features = 0xdc0065c2 0xaee00000
>>   mmu_features      = 0xbc006041
>>   firmware_features = 0x0000000010000000
>>   vmalloc start     = 0xc008000000000000
>>   IO start          = 0xc00a000000000000
>>   vmemmap start     = 0xc00c000000000000
>>   -----------------------------------------------------
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> 
> I fear your change defeats most of the purpose of commit 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20190515&id=e4dccf9092ab48a6f902003b9558c0e45d0e849a

I think it's still a significant improvement without introducing
the regression :)

> As far as I understand, the main issue is the "hash-mmu:" prefix ?
> That's due to the following define in top of book3s64/hash_utils.c:
> 
> #define pr_fmt(fmt) "hash-mmu: " fmt
> 
> Could we simply undef it just before print_system_hash_info() ?

Little bit fragile I think.

> Or move print_system_hash_info() in another book3s64 specific file which 
> doesn't set pr_fmt ?

print_system_info() would be okay for me and allow getting rid of
that PPC64 config. Although it also needs to go in a file without
pr_fmt I guess that's not so hard.

Thanks,
Nick


^ permalink raw reply

* Re: [RFC PATCH] powerpc/64/ftrace: mprofile-kernel patch out mflr
From: Nicholas Piggin @ 2019-05-16  5:49 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman, Naveen N. Rao
In-Reply-To: <1557821918.xbleq18bk2.naveen@linux.ibm.com>

Naveen N. Rao's on May 14, 2019 6:32 pm:
> Michael Ellerman wrote:
>> "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> writes:
>>> Michael Ellerman wrote:
>>>> Nicholas Piggin <npiggin@gmail.com> writes:
>>>>> The new mprofile-kernel mcount sequence is
>>>>>
>>>>>   mflr	r0
>>>>>   bl	_mcount
>>>>>
>>>>> Dynamic ftrace patches the branch instruction with a noop, but leaves
>>>>> the mflr. mflr is executed by the branch unit that can only execute one
>>>>> per cycle on POWER9 and shared with branches, so it would be nice to
>>>>> avoid it where possible.
>>>>>
>>>>> This patch is a hacky proof of concept to nop out the mflr. Can we do
>>>>> this or are there races or other issues with it?
>>>> 
>>>> There's a race, isn't there?
>>>> 
>>>> We have a function foo which currently has tracing disabled, so the mflr
>>>> and bl are nop'ed out.
>>>> 
>>>>   CPU 0			CPU 1
>>>>   ==================================
>>>>   bl foo
>>>>   nop (ie. not mflr)
>>>>   -> interrupt
>>>>   something else	enable tracing for foo
>>>>   ...			patch mflr and branch
>>>>   <- rfi
>>>>   bl _mcount
>>>> 
>>>> So we end up in _mcount() but with r0 not populated.
>>>
>>> Good catch! Looks like we need to patch the mflr with a "b +8" similar 
>>> to what we do in __ftrace_make_nop().
>> 
>> Would that actually make it any faster though? Nick?
> 
> Ok, how about doing this as a 2-step process?
> 1. patch 'mflr r0' with a 'b +8'
>    synchronize_rcu_tasks()
> 2. convert 'b +8' to a 'nop'

Good idea. Well the mflr r0 is harmless, so you can leave that in.
You just need to ensure it's not removed before the bl is. So nop
the bl _mcount, then synchronize_rcu_tasks(), then nop the mflr?

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Gautham R Shenoy @ 2019-05-16  5:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, linux-pm, daniel.lezcano, rjw, linux-kernel, Abhishek,
	linuxppc-dev, dja
In-Reply-To: <1557981860.eltms77ctp.astroid@bobo.none>

Hello Nicholas,


On Thu, May 16, 2019 at 02:55:42PM +1000, Nicholas Piggin wrote:
> Abhishek's on May 13, 2019 7:49 pm:
> > On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
> >> Abhishek Goel's on April 22, 2019 4:32 pm:
> >>> Currently, the cpuidle governors determine what idle state a idling CPU
> >>> should enter into based on heuristics that depend on the idle history on
> >>> that CPU. Given that no predictive heuristic is perfect, there are cases
> >>> where the governor predicts a shallow idle state, hoping that the CPU will
> >>> be busy soon. However, if no new workload is scheduled on that CPU in the
> >>> near future, the CPU will end up in the shallow state.
> >>>
> >>> Motivation
> >>> ----------
> >>> In case of POWER, this is problematic, when the predicted state in the
> >>> aforementioned scenario is a lite stop state, as such lite states will
> >>> inhibit SMT folding, thereby depriving the other threads in the core from
> >>> using the core resources.
> >>>
> >>> So we do not want to get stucked in such states for longer duration. To
> >>> address this, the cpuidle-core can queue timer to correspond with the
> >>> residency value of the next available state. This timer will forcefully
> >>> wakeup the cpu. Few such iterations will essentially train the governor to
> >>> select a deeper state for that cpu, as the timer here corresponds to the
> >>> next available cpuidle state residency. Cpu will be kicked out of the lite
> >>> state and end up in a non-lite state.
> >>>
> >>> Experiment
> >>> ----------
> >>> I performed experiments for three scenarios to collect some data.
> >>>
> >>> case 1 :
> >>> Without this patch and without tick retained, i.e. in a upstream kernel,
> >>> It would spend more than even a second to get out of stop0_lite.
> >>>
> >>> case 2 : With tick retained in a upstream kernel -
> >>>
> >>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
> >>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
> >>> observation was
> >>>
> >>> =========================================================
> >>> sample          min            max           99percentile
> >>> 20              4ms            12ms          4ms
> >>> =========================================================
> >>>
> >>> It would take atleast one sched tick to get out of stop0_lite.
> >>>
> >>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
> >>>            timer)
> >>>
> >>> ============================================================
> >>> sample          min             max             99percentile
> >>> ============================================================
> >>> 20              144us           192us           144us
> >>> ============================================================
> >>>
> >>> In this patch, we queue a timer just before entering into a stop0_lite
> >>> state. The timer fires at (residency of next available state + exit latency
> >>> of next available state * 2). Let's say if next state(stop0) is available
> >>> which has residency of 20us, it should get out in as low as (20+2*2)*8
> >>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
> >>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
> >>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
> >>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
> >>> into stop2.
> >>>
> >>> So, We are able to get out of stop0_lite generally in 150us(with this
> >>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
> >>> want to get stuck into stop0_lite as it inhibits SMT folding for other
> >>> sibling threads, depriving them of core resources. Current patch is using
> >>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
> >>> reason) along with lowering down power consumption. We may extend this
> >>> model for other states in future.
> >> I still have to wonder, between our snooze loop and stop0, what does
> >> stop0_lite buy us.
> >>
> >> That said, the problem you're solving here is a generic one that all
> >> stop states have, I think. Doesn't the same thing apply going from
> >> stop0 to stop5? You might under estimate the sleep time and lose power
> >> savings and therefore performance there too. Shouldn't we make it
> >> generic for all stop states?
> >>
> >> Thanks,
> >> Nick
> >>
> >>
> > When a cpu is in snooze, it takes both space and time of core. When in 
> > stop0_lite,
> > it free up time but it still takes space.
> 
> True, but snooze should only be taking less than 1% of front end
> cycles. I appreciate there is some non-zero difference here, I just
> wonder in practice what exactly we gain by it.

The idea behind implementing a lite-state was that on the future
platforms it can be made to wait on a flag and hence act as a
replacement for snooze. On POWER9 we don't have this feature.

The motivation behind this patch was a HPC customer issue where they
were observing some CPUs in the core getting stuck at stop0_lite
state, thereby lowering the performance on the other CPUs of the core
which were running the application.

Disabling stop0_lite via sysfs didn't help since we would fallback to
snooze and it would make matters worse.

> 
> We should always have fewer states unless proven otherwise.

I agree.

> 
> That said, we enable it today so I don't want to argue this point
> here, because it is a different issue from your patch.
> 
> > When it is in stop0 or deeper, 
> > it free up both
> > space and time slice of core.
> > In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> > thread
> > folding. When a cpu goes to stop0, it will free up the core resources 
> > thus increasing
> > the single thread performance of other sibling thread.
> > Hence, we do not want to get stuck in stop0_lite for long duration, and 
> > want to quickly
> > move onto the next state.
> > If we get stuck in any other state we would possibly be losing on to 
> > power saving,
> > but will still be able to gain the performance benefits for other 
> > sibling threads.
> 
> That's true, but stop0 -> deeper stop is also a benefit (for
> performance if we have some power/thermal constraints, and/or for power
> usage).
> 
> Sure it may not be so noticable as the SMT switch, but I just wonder
> if the infrastructure should be there for the same reason.
> 
> I was testing interrupt frequency on some tickless workloads configs,
> and without too much trouble you can get CPUs to sleep with no
> interrupts for many minutes. Hours even. We wouldn't want the CPU to
> stay in stop0 for that long.

If it stays in stop0 or even stop2 for that long, we would want to
"promote" it to a deeper state, such as say STOP5 which allows the
other cores to run at higher frequencies.

> 
> Just thinking about the patch itself, I wonder do you need a full
> kernel timer, or could we just set the decrementer? Is there much 
> performance cost here?
>

Good point. A decrementer would do actually.

> Thanks,
> Nick

--
Thanks and Regards
gautham.


^ permalink raw reply

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Daniel Axtens @ 2019-05-16  5:28 UTC (permalink / raw)
  To: Eric Biggers
  Cc: leo.barbosa, nayna, Herbert Xu, Stephan Mueller, Nayna, omosnacek,
	marcelo.cerri, pfsmorigo, linux-crypto, leitao, George Wilson,
	linuxppc-dev
In-Reply-To: <20190516025603.GB23200@sol.localdomain>

Eric Biggers <ebiggers@kernel.org> writes:

> On Thu, May 16, 2019 at 12:12:48PM +1000, Daniel Axtens wrote:
>> 
>> I'm also seeing issues with ghash with the extended tests:
>> 
>> [    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"
>> 
>> It seems to happen when one of the source divisions has nosimd and the
>> final result uses the simd finaliser, so that's interesting.
>> 
>
> The bug is that p8_ghash uses different shash_descs for the SIMD and no-SIMD
> cases.  So if you start out doing the hash in SIMD context but then switch to
> no-SIMD context or vice versa, the digest will be wrong.  Note that there can be
> an ->export() and ->import() in between, so it's not quite as obscure a case as
> one might think.

Ah cool, I was just in the process of figuring this out for myself -
always lovely to have my theory confirmed!

> To fix it I think you'll need to make p8_ghash use 'struct ghash_desc_ctx' just
> like ghash-generic so that the two code paths can share the same shash_desc.
> That's similar to what the various SHA hash algorithms do.

This is very helpful, thank you. I guess I will do that then.

Regards,
Daniel

>
> - Eric

^ permalink raw reply

* Re: [PATCH] powerpc/book3s/mm: Clear MMU_FTR_HPTE_TABLE when radix is enabled.
From: Nicholas Piggin @ 2019-05-16  5:04 UTC (permalink / raw)
  To: Aneesh Kumar K.V, mpe, paulus; +Cc: linuxppc-dev
In-Reply-To: <20190514060205.20887-1-aneesh.kumar@linux.ibm.com>

Aneesh Kumar K.V's on May 14, 2019 4:02 pm:
> Avoids confusion when printing Oops message like below
> 
>  Faulting instruction address: 0xc00000000008bdb4
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
> 
> Either ibm,pa-features or ibm,powerpc-cpu-features can be used to enable the
> MMU features. We don't clear related MMU feature bits there. We use the kernel
> commandline to determine what translation mode we want to use and clear the
> HPTE or radix bit accordingly. On LPAR we do have to renable HASH bit if the
> hypervisor can't do radix.

Well we have the HPTE feature: the CPU supports hash MMU mode. It's
just the the kernel is booted in radix mode.

Could make a difference for KVM, if it will support an HPT guest or
not.

That's all highly theoretical and we have other inconsistencies
already in this stuff, I'd just like to try make things a bit better
in the long term.

Can we just add an early_radix_enabled() in the oops printing code
to select radix or hash MMU?

> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index b97aee03924f..0fa6cac3fe82 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -77,9 +77,6 @@ static struct page *maybe_pte_to_page(pte_t pte)
>  
>  static pte_t set_pte_filter_hash(pte_t pte)
>  {
> -	if (radix_enabled())
> -		return pte;
> -
>  	pte = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
>  	if (pte_looks_normal(pte) && !(cpu_has_feature(CPU_FTR_COHERENT_ICACHE) ||
>  				       cpu_has_feature(CPU_FTR_NOEXECUTE))) {
> @@ -110,6 +107,8 @@ static pte_t set_pte_filter(pte_t pte)
>  
>  	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
>  		return set_pte_filter_hash(pte);
> +	else if (radix_enabled())
> +		return pte;
>  
>  	/* No exec permission in the first place, move on */
>  	if (!pte_exec(pte) || !pte_looks_normal(pte))
> @@ -140,7 +139,7 @@ static pte_t set_access_flags_filter(pte_t pte, struct vm_area_struct *vma,
>  {
>  	struct page *pg;
>  
> -	if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
> +	if (mmu_has_feature(MMU_FTR_HPTE_TABLE) || radix_enabled())
>  		return pte;
>  
>  	/* So here, we only care about exec faults, as we use them

These would still be good cleanup to make the HPTE_TABLE feature
independent from radix.

Thanks,
Nick


^ permalink raw reply

* Re: [PATCH 0/1] Forced-wakeup for stop lite states on Powernv
From: Nicholas Piggin @ 2019-05-16  4:55 UTC (permalink / raw)
  To: Abhishek, linux-kernel, linux-pm, linuxppc-dev
  Cc: ego, daniel.lezcano, rjw, dja
In-Reply-To: <b2fcf69a-aecd-ea81-b497-737642354736@linux.vnet.ibm.com>

Abhishek's on May 13, 2019 7:49 pm:
> On 05/08/2019 10:29 AM, Nicholas Piggin wrote:
>> Abhishek Goel's on April 22, 2019 4:32 pm:
>>> Currently, the cpuidle governors determine what idle state a idling CPU
>>> should enter into based on heuristics that depend on the idle history on
>>> that CPU. Given that no predictive heuristic is perfect, there are cases
>>> where the governor predicts a shallow idle state, hoping that the CPU will
>>> be busy soon. However, if no new workload is scheduled on that CPU in the
>>> near future, the CPU will end up in the shallow state.
>>>
>>> Motivation
>>> ----------
>>> In case of POWER, this is problematic, when the predicted state in the
>>> aforementioned scenario is a lite stop state, as such lite states will
>>> inhibit SMT folding, thereby depriving the other threads in the core from
>>> using the core resources.
>>>
>>> So we do not want to get stucked in such states for longer duration. To
>>> address this, the cpuidle-core can queue timer to correspond with the
>>> residency value of the next available state. This timer will forcefully
>>> wakeup the cpu. Few such iterations will essentially train the governor to
>>> select a deeper state for that cpu, as the timer here corresponds to the
>>> next available cpuidle state residency. Cpu will be kicked out of the lite
>>> state and end up in a non-lite state.
>>>
>>> Experiment
>>> ----------
>>> I performed experiments for three scenarios to collect some data.
>>>
>>> case 1 :
>>> Without this patch and without tick retained, i.e. in a upstream kernel,
>>> It would spend more than even a second to get out of stop0_lite.
>>>
>>> case 2 : With tick retained in a upstream kernel -
>>>
>>> Generally, we have a sched tick at 4ms(CONF_HZ = 250). Ideally I expected
>>> it to take 8 sched tick to get out of stop0_lite. Experimentally,
>>> observation was
>>>
>>> =========================================================
>>> sample          min            max           99percentile
>>> 20              4ms            12ms          4ms
>>> =========================================================
>>>
>>> It would take atleast one sched tick to get out of stop0_lite.
>>>
>>> case 2 :  With this patch (not stopping tick, but explicitly queuing a
>>>            timer)
>>>
>>> ============================================================
>>> sample          min             max             99percentile
>>> ============================================================
>>> 20              144us           192us           144us
>>> ============================================================
>>>
>>> In this patch, we queue a timer just before entering into a stop0_lite
>>> state. The timer fires at (residency of next available state + exit latency
>>> of next available state * 2). Let's say if next state(stop0) is available
>>> which has residency of 20us, it should get out in as low as (20+2*2)*8
>>> [Based on the forumla (residency + 2xlatency)*history length] microseconds
>>> = 192us. Ideally we would expect 8 iterations, it was observed to get out
>>> in 6-7 iterations. Even if let's say stop2 is next available state(stop0
>>> and stop1 both are unavailable), it would take (100+2*10)*8 = 960us to get
>>> into stop2.
>>>
>>> So, We are able to get out of stop0_lite generally in 150us(with this
>>> patch) as compared to 4ms(with tick retained). As stated earlier, we do not
>>> want to get stuck into stop0_lite as it inhibits SMT folding for other
>>> sibling threads, depriving them of core resources. Current patch is using
>>> forced-wakeup only for stop0_lite, as it gives performance benefit(primary
>>> reason) along with lowering down power consumption. We may extend this
>>> model for other states in future.
>> I still have to wonder, between our snooze loop and stop0, what does
>> stop0_lite buy us.
>>
>> That said, the problem you're solving here is a generic one that all
>> stop states have, I think. Doesn't the same thing apply going from
>> stop0 to stop5? You might under estimate the sleep time and lose power
>> savings and therefore performance there too. Shouldn't we make it
>> generic for all stop states?
>>
>> Thanks,
>> Nick
>>
>>
> When a cpu is in snooze, it takes both space and time of core. When in 
> stop0_lite,
> it free up time but it still takes space.

True, but snooze should only be taking less than 1% of front end
cycles. I appreciate there is some non-zero difference here, I just
wonder in practice what exactly we gain by it.

We should always have fewer states unless proven otherwise.

That said, we enable it today so I don't want to argue this point
here, because it is a different issue from your patch.

> When it is in stop0 or deeper, 
> it free up both
> space and time slice of core.
> In stop0_lite, cpu doesn't free up the core resources and thus inhibits 
> thread
> folding. When a cpu goes to stop0, it will free up the core resources 
> thus increasing
> the single thread performance of other sibling thread.
> Hence, we do not want to get stuck in stop0_lite for long duration, and 
> want to quickly
> move onto the next state.
> If we get stuck in any other state we would possibly be losing on to 
> power saving,
> but will still be able to gain the performance benefits for other 
> sibling threads.

That's true, but stop0 -> deeper stop is also a benefit (for
performance if we have some power/thermal constraints, and/or for power
usage).

Sure it may not be so noticable as the SMT switch, but I just wonder
if the infrastructure should be there for the same reason.

I was testing interrupt frequency on some tickless workloads configs,
and without too much trouble you can get CPUs to sleep with no
interrupts for many minutes. Hours even. We wouldn't want the CPU to
stay in stop0 for that long.

Just thinking about the patch itself, I wonder do you need a full
kernel timer, or could we just set the decrementer? Is there much 
performance cost here?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Make boot look nice(r)
From: Christophe Leroy @ 2019-05-16  4:47 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20190516020437.11783-1-npiggin@gmail.com>



Le 16/05/2019 à 04:04, Nicholas Piggin a écrit :
> Radix boot looks like this:
> 
>   -----------------------------------------------------
>   phys_mem_size     = 0x200000000
>   dcache_bsize      = 0x80
>   icache_bsize      = 0x80
>   cpu_features      = 0x0000c06f8f5fb1a7
>     possible        = 0x0000fbffcf5fb1a7
>     always          = 0x00000003800081a1
>   cpu_user_features = 0xdc0065c2 0xaee00000
>   mmu_features      = 0xbc006041
>   firmware_features = 0x0000000010000000
>   hash-mmu: ppc64_pft_size    = 0x0
>   hash-mmu: kernel vmalloc start   = 0xc008000000000000
>   hash-mmu: kernel IO start        = 0xc00a000000000000
>   hash-mmu: kernel vmemmap start   = 0xc00c000000000000
>   -----------------------------------------------------
> 
> Fix:
> 
>   -----------------------------------------------------
>   phys_mem_size     = 0x200000000
>   dcache_bsize      = 0x80
>   icache_bsize      = 0x80
>   cpu_features      = 0x0000c06f8f5fb1a7
>     possible        = 0x0000fbffcf5fb1a7
>     always          = 0x00000003800081a1
>   cpu_user_features = 0xdc0065c2 0xaee00000
>   mmu_features      = 0xbc006041
>   firmware_features = 0x0000000010000000
>   vmalloc start     = 0xc008000000000000
>   IO start          = 0xc00a000000000000
>   vmemmap start     = 0xc00c000000000000
>   -----------------------------------------------------
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

I fear your change defeats most of the purpose of commit 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20190515&id=e4dccf9092ab48a6f902003b9558c0e45d0e849a

As far as I understand, the main issue is the "hash-mmu:" prefix ?
That's due to the following define in top of book3s64/hash_utils.c:

#define pr_fmt(fmt) "hash-mmu: " fmt

Could we simply undef it just before print_system_hash_info() ?
Or move print_system_hash_info() in another book3s64 specific file which 
doesn't set pr_fmt ?

Christophe

> ---
>   arch/powerpc/kernel/setup-common.c    | 8 +++++++-
>   arch/powerpc/mm/book3s64/hash_utils.c | 3 ---
>   2 files changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
> index aad9f5df6ab6..f2da8c809c85 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -810,9 +810,15 @@ static __init void print_system_info(void)
>   	pr_info("mmu_features      = 0x%08x\n", cur_cpu_spec->mmu_features);
>   #ifdef CONFIG_PPC64
>   	pr_info("firmware_features = 0x%016lx\n", powerpc_firmware_features);
> +#ifdef CONFIG_PPC_BOOK3S
> +	pr_info("vmalloc start     = 0x%lx\n", KERN_VIRT_START);
> +	pr_info("IO start          = 0x%lx\n", KERN_IO_START);
> +	pr_info("vmemmap start     = 0x%lx\n", (unsigned long)vmemmap);
> +#endif
>   #endif
>   
> -	print_system_hash_info();
> +	if (!early_radix_enabled())
> +		print_system_hash_info();
>   
>   	if (PHYSICAL_START > 0)
>   		pr_info("physical_start    = 0x%llx\n",
> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
> index 919a861a8ec0..8b307b796b83 100644
> --- a/arch/powerpc/mm/book3s64/hash_utils.c
> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
> @@ -1954,7 +1954,4 @@ void __init print_system_hash_info(void)
>   
>   	if (htab_hash_mask)
>   		pr_info("htab_hash_mask    = 0x%lx\n", htab_hash_mask);
> -	pr_info("kernel vmalloc start   = 0x%lx\n", KERN_VIRT_START);
> -	pr_info("kernel IO start        = 0x%lx\n", KERN_IO_START);
> -	pr_info("kernel vmemmap start   = 0x%lx\n", (unsigned long)vmemmap);
>   }
> 

^ permalink raw reply

* Re: [PATCH v10 2/2] powerpc/64s: KVM update for reimplement book3s idle code in C
From: Nicholas Piggin @ 2019-05-16  4:43 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Gautham R . Shenoy, linuxppc-dev, kvm-ppc
In-Reply-To: <20190513064207.GA11679@blackberry>

Paul Mackerras's on May 13, 2019 4:42 pm:
> On Sun, Apr 28, 2019 at 09:45:15PM +1000, Nicholas Piggin wrote:
>> This is the KVM update to the new idle code. A few improvements:
>> 
>> - Idle sleepers now always return to caller rather than branch out
>>   to KVM first.
>> - This allows optimisations like very fast return to caller when no
>>   state has been lost.
>> - KVM no longer requires nap_state_lost because it controls NVGPR
>>   save/restore itself on the way in and out.
>> - The heavy idle wakeup KVM request check can be moved out of the
>>   normal host idle code and into the not-performance-critical offline
>>   code.
>> - KVM nap code now returns from where it is called, which makes the
>>   flow a bit easier to follow.
> 
> One question below...
> 
>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> index 58d0f1ba845d..f66191d8f841 100644
>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> ...
>> @@ -2656,6 +2662,9 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
>>  
>>  	lis	r3, LPCR_PECEDP@h	/* Do wake on privileged doorbell */
>>  
>> +	/* Go back to host stack */
>> +	ld	r1, HSTATE_HOST_R1(r13)
> 
> At this point we are in kvmppc_h_cede, which we branched to from
> hcall_try_real_mode, which came from the guest exit path, where we
> have already loaded r1 from HSTATE_HOST_R1(r13).  So if there is a
> path to get here with r1 not already set to HSTATE_HOST_R1(r13), then
> I missed it - please point it out to me.  Otherwise this statement
> seems superfluous.

I'm not sure why I put that there. I think you're right it could
be removed.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Eric Biggers @ 2019-05-16  2:56 UTC (permalink / raw)
  To: Daniel Axtens
  Cc: leo.barbosa, nayna, Herbert Xu, Stephan Mueller, Nayna, omosnacek,
	marcelo.cerri, pfsmorigo, linux-crypto, leitao, George Wilson,
	linuxppc-dev
In-Reply-To: <871s0z171b.fsf@dja-thinkpad.axtens.net>

On Thu, May 16, 2019 at 12:12:48PM +1000, Daniel Axtens wrote:
> 
> I'm also seeing issues with ghash with the extended tests:
> 
> [    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"
> 
> It seems to happen when one of the source divisions has nosimd and the
> final result uses the simd finaliser, so that's interesting.
> 

The bug is that p8_ghash uses different shash_descs for the SIMD and no-SIMD
cases.  So if you start out doing the hash in SIMD context but then switch to
no-SIMD context or vice versa, the digest will be wrong.  Note that there can be
an ->export() and ->import() in between, so it's not quite as obscure a case as
one might think.

To fix it I think you'll need to make p8_ghash use 'struct ghash_desc_ctx' just
like ghash-generic so that the two code paths can share the same shash_desc.
That's similar to what the various SHA hash algorithms do.

- Eric

^ permalink raw reply

* [PATCH 2/3] powerpc/pseries: Disable PRRN memory device tree trigger
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev
In-Reply-To: <20190516023706.50118-1-tyreld@linux.ibm.com>

Memory affintiy updates as currently implemented have proved unstable.

This patch comments out the PRRN hook for the time being while we
investigate how to either stablize the current implementation or find a
better approach.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/mobility.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index 88925f8ca8a0..660a2dbc43d7 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -242,13 +242,15 @@ static int add_dt_node(__be32 parent_phandle, __be32 drc_index)
 
 static void prrn_update_node(__be32 phandle)
 {
+	/* PRRN Memory Updates have proved unstable. Disable for the time being.
+	 *
 	struct pseries_hp_errorlog hp_elog;
 	struct device_node *dn;
 
-	/*
+	 *
 	 * If a node is found from a the given phandle, the phandle does not
 	 * represent the drc index of an LMB and we can ignore.
-	 */
+	 *
 	dn = of_find_node_by_phandle(be32_to_cpu(phandle));
 	if (dn) {
 		of_node_put(dn);
@@ -261,6 +263,7 @@ static void prrn_update_node(__be32 phandle)
 	hp_elog._drc_u.drc_index = phandle;
 
 	handle_dlpar_errorlog(&hp_elog);
+	*/
 }
 
 int pseries_devicetree_update(s32 scope)
-- 
2.18.1


^ permalink raw reply related

* [PATCH 3/3] powerpc/pseries: Don't update cpu topology after PRRN event
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev
In-Reply-To: <20190516023706.50118-1-tyreld@linux.ibm.com>

When we receive a PRRN event through the event-scan interface we call
pseries_devicetree_update() to update the affinty properties in our
device tree via RTAS. Following this our implemenation attempts to both
frob the existing kernel cpu numa affinities of the live system with the
new device tree properties while also performing a full cpu hotplug
readd of the affected cpus in resposne to the a OF property notifier
triggerd by the device tree update.

This patch does away with the topology update call to frob the
associativity since the DLPAR readd will put the cpu in the proper
numa node when its added back.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/rtasd.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 8a1746d755c9..d3aa3a056d8e 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -285,7 +285,6 @@ static void handle_prrn_event(s32 scope)
 	 * the RTAS event.
 	 */
 	pseries_devicetree_update(-scope);
-	numa_update_cpu_topology(false);
 }
 
 static void handle_rtas_event(const struct rtas_error_log *log)
-- 
2.18.1


^ permalink raw reply related

* [PATCH 1/3] powerpc/pseries: Simplify cpu readd to use drc_index
From: Tyrel Datwyler @ 2019-05-16  2:37 UTC (permalink / raw)
  To: mpe; +Cc: nathanl, Tyrel Datwyler, Tyrel Datwyler, mingming.cao,
	linuxppc-dev

The current dlpar_cpu_readd() takes in a cpu_id and uses that to look up
the cpus device_node so that we can get at the ibm,my-drc-index
property. The only user of cpu readd is an OF notifier call back. This
call back already has a reference to the device_node and therefore can
retrieve the drc_index from the device_node.

This patch simplifies dlpar_cpu_readd() to take a drc_index directly and
does away with an uneccsary device_node lookup.

Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h          |  2 +-
 arch/powerpc/mm/numa.c                       |  6 +++---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 10 +---------
 3 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f85e2b01c3df..c906d9ec9013 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -133,7 +133,7 @@ static inline void shared_proc_topology_init(void) {}
 #define topology_core_cpumask(cpu)	(per_cpu(cpu_core_map, cpu))
 #define topology_core_id(cpu)		(cpu_to_core_id(cpu))
 
-int dlpar_cpu_readd(int cpu);
+int dlpar_cpu_readd(u32 drc_index);
 #endif
 #endif
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 57e64273cb33..40c0b6da12c2 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1479,9 +1479,9 @@ static int dt_update_callback(struct notifier_block *nb,
 	case OF_RECONFIG_UPDATE_PROPERTY:
 		if (of_node_is_type(update->dn, "cpu") &&
 		    !of_prop_cmp(update->prop->name, "ibm,associativity")) {
-			u32 core_id;
-			of_property_read_u32(update->dn, "reg", &core_id);
-			rc = dlpar_cpu_readd(core_id);
+			u32 drc_index;
+			of_property_read_u32(update->dn, "ibm,my-drc-index", &drc_index);
+			rc = dlpar_cpu_readd(drc_index);
 			rc = NOTIFY_OK;
 		}
 		break;
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 97feb6e79f1a..2dfa9416ce54 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -802,18 +802,10 @@ static int dlpar_cpu_add_by_count(u32 cpus_to_add)
 	return rc;
 }
 
-int dlpar_cpu_readd(int cpu)
+int dlpar_cpu_readd(u32 drc_index)
 {
-	struct device_node *dn;
-	struct device *dev;
-	u32 drc_index;
 	int rc;
 
-	dev = get_cpu_device(cpu);
-	dn = dev->of_node;
-
-	rc = of_property_read_u32(dn, "ibm,my-drc-index", &drc_index);
-
 	rc = dlpar_cpu_remove_by_index(drc_index);
 	if (!rc)
 		rc = dlpar_cpu_add(drc_index);
-- 
2.18.1


^ permalink raw reply related

* Re: [PATCH] crypto: talitos - fix skcipher failure due to wrong output IV
From: Eric Biggers @ 2019-05-16  2:30 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Herbert Xu, Horia Geanta, linux-kernel@vger.kernel.org,
	linux-crypto@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	David S. Miller
In-Reply-To: <29db3f20-f931-efc6-02a8-fe160ab8b484@c-s.fr>

On Wed, May 15, 2019 at 08:49:48PM +0200, Christophe Leroy wrote:
> 
> 
> Le 15/05/2019 à 16:05, Horia Geanta a écrit :
> > On 5/15/2019 3:29 PM, Christophe Leroy wrote:
> > > Selftests report the following:
> > > 
> > > [    2.984845] alg: skcipher: cbc-aes-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    2.995377] 00000000: 3d af ba 42 9d 9e b4 30 b4 22 da 80 2c 9f ac 41
> > > [    3.032673] alg: skcipher: cbc-des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    3.043185] 00000000: fe dc ba 98 76 54 32 10
> > > [    3.063238] alg: skcipher: cbc-3des-talitos encryption test failed (wrong output IV) on test vector 0, cfg="in-place"
> > > [    3.073818] 00000000: 7d 33 88 93 0f 93 b2 42
> > > 
> > > This above dumps show that the actual output IV is indeed the input IV.
> > > This is due to the IV not being copied back into the request.
> > > 
> > > This patch fixes that.
> > > 
> > > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> > Reviewed-by: Horia Geantă <horia.geanta@nxp.com>
> 
> It's missing a Fixes: tag and a Cc: to stable.
> 
> I'll resend tomorrow.
> 
> > 
> > While here, could you please check ecb mode (which by definition does not have
> > an IV) is behaving correctly?
> > Looking in driver_algs[] list of crypto algorithms supported by talitos,
> > ecb(aes,des,3des) are declared with ivsize != 0.
> 
> According to /proc/crypto, test are passed for ecb.
> 

Did you try enabling CONFIG_CRYPTO_MANAGER_EXTRA_TESTS?  There is now a check
that the driver's ivsize matches the generic implementation's:

        if (ivsize != crypto_skcipher_ivsize(generic_tfm)) {
                pr_err("alg: skcipher: ivsize for %s (%u) doesn't match generic impl (%u)\n",
                       driver, ivsize, crypto_skcipher_ivsize(generic_tfm));
                err = -EINVAL;
                goto out;
        }

For ECB that means the ivsize must be 0.

AFAICS the talitos driver even accesses the IV for ECB, which is wrong; and the
only reason this isn't crashing the self-tests already is that they are confused
by the declared ivsize being nonzero so they don't pass NULL as they should.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: vmx - fix copy-paste error in CTR mode
From: Daniel Axtens @ 2019-05-16  2:12 UTC (permalink / raw)
  To: Herbert Xu
  Cc: leo.barbosa, nayna, Stephan Mueller, Nayna, omosnacek,
	Eric Biggers, marcelo.cerri, pfsmorigo, linux-crypto, leitao,
	George Wilson, linuxppc-dev
In-Reply-To: <874l5w1axv.fsf@dja-thinkpad.axtens.net>

Daniel Axtens <dja@axtens.net> writes:

> Herbert Xu <herbert@gondor.apana.org.au> writes:
>
>> On Wed, May 15, 2019 at 03:35:51AM +1000, Daniel Axtens wrote:
>>>
>>> By all means disable vmx ctr if I don't get an answer to you in a
>>> timeframe you are comfortable with, but I am going to at least try to
>>> have a look.
>>
>> I'm happy to give you guys more time.  How much time do you think
>> you will need?
>>
> Give me till the end of the week: if I haven't solved it by then I will
> probably have to give up and go on to other things anyway.

So as you've hopefully seen, I've nailed it down and posted a patch.
(http://patchwork.ozlabs.org/patch/1099934/)

I'm also seeing issues with ghash with the extended tests:

[    7.582926] alg: hash: p8_ghash test failed (wrong result) on test vector 0, cfg="random: use_final src_divs=[<reimport>9.72%@+39832, <reimport>18.2%@+65504, <reimport,nosimd>45.57%@alignmask+18, <reimport,nosimd>15.6%@+65496, 6.83%@+65514, <reimport,nosimd>1.2%@+25, <reim"

It seems to happen when one of the source divisions has nosimd and the
final result uses the simd finaliser, so that's interesting.

Regards,
Daniel

>
> (FWIW, it seems to happen when encoding greater than 4 but less than 8
> AES blocks - in particular with both 7 and 5 blocks encoded I can see it
> go wrong from block 4 onwards. No idea why yet, and the asm is pretty
> dense, but that's where I'm at at the moment.)
>
> Regards,
> Daniel
>
>> Thanks,
>> -- 
>> Email: Herbert Xu <herbert@gondor.apana.org.au>
>> Home Page: http://gondor.apana.org.au/~herbert/
>> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox