LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 19/31] riscv: use asm-generic/cacheflush.h
From: Christoph Hellwig @ 2020-05-13  6:23 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: linux-ia64, linux-sh, zippel, linux-mips, linux-mm, sparclinux,
	linux-riscv, Christoph Hellwig, linux-arch, linux-c6x-dev,
	linux-hexagon, x86, linux-xtensa, Arnd Bergmann, jeyu, linux-um,
	linux-m68k, openrisc, linux-arm-kernel, monstr, linux-kernel,
	linux-alpha, linux-fsdevel, akpm, linuxppc-dev
In-Reply-To: <mhng-8adbedbc-0f91-4291-9471-2df5eb7b802b@palmerdabbelt-glaptop1>

On Tue, May 12, 2020 at 04:00:26PM -0700, Palmer Dabbelt wrote:
> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
>
> Were you trying to get these all in at once, or do you want me to take it into
> my tree?

Except for the small fixups at the beginning of the series this needs
to go in together.  I'll have to do at least another resend, and after
that I hope Andrew is going to pick it up.

^ permalink raw reply

* Re: [PATCH v8 5/5] powerpc/hv-24x7: Update post_mobility_fixup() to handle migration
From: kajoljain @ 2020-05-13  6:20 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: ravi.bangoria, maddy, anju, peterz, gregkh, suka,
	alexander.shishkin, mingo, mpetlan, yao.jin, ak, mamatha4, acme,
	jmario, namhyung, linuxppc-dev, jolsa, kan.liang
In-Reply-To: <87eerq2rcc.fsf@linux.ibm.com>



On 5/12/20 1:10 AM, Nathan Lynch wrote:
> Hello,
> 
> Kajol Jain <kjain@linux.ibm.com> writes:
>> Function 'read_sys_info_pseries()' is added to get system parameter
>> values like number of sockets and chips per socket.
>> and it gets these details via rtas_call with token
>> "PROCESSOR_MODULE_INFO".
>>
>> Incase lpar migrate from one system to another, system
>> parameter details like chips per sockets or number of sockets might
>> change. So, it needs to be re-initialized otherwise, these values
>> corresponds to previous system values.
>> This patch adds a call to 'read_sys_info_pseries()' from
>> 'post-mobility_fixup()' to re-init the physsockets and physchips values
>>
>> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
>> ---
>>  arch/powerpc/platforms/pseries/mobility.c | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
> 
> Please cc me on patches for this code, thanks.

Hi Nathan,
	Thanks for reviewing the patch. I will cc you on next version of this patchset.
> 
> I see no technical problems with how this patch handles partition
> migration. However:
> 
> "Update post_mobility_fixup() to handle migration" is not an appropriate
> summary for this change. post_mobility_fixup() already handles
> migration. A better summary would be
> 
> "powerpc/pseries: update hv-24x7 info after migration"

Will update.

> 
> 
>> diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
>> index b571285f6c14..0fb8f1e6e9d2 100644
>> --- a/arch/powerpc/platforms/pseries/mobility.c
>> +++ b/arch/powerpc/platforms/pseries/mobility.c
>> @@ -42,6 +42,12 @@ struct update_props_workarea {
>>  #define MIGRATION_SCOPE	(1)
>>  #define PRRN_SCOPE -2
>>  
>> +#ifdef CONFIG_HV_PERF_CTRS
>> +void read_sys_info_pseries(void);
>> +#else
>> +static inline void read_sys_info_pseries(void) { }
>> +#endif
> 
> This should go in a header.
> 
> 
>>  static int mobility_rtas_call(int token, char *buf, s32 scope)
>>  {
>>  	int rc;
>> @@ -371,6 +377,16 @@ void post_mobility_fixup(void)
>>  	/* Possibly switch to a new RFI flush type */
>>  	pseries_setup_rfi_flush();
>>  
>> +	/*
>> +	 * In case an Lpar migrates from one system to another, system
>> +	 * parameter details like chips per sockets, cores per chip and
>> +	 * number of sockets details might change.
>> +	 * So, they needs to be re-initialized otherwise the
>> +	 * values will correspond to the previous system.
>> +	 * Call read_sys_info_pseries() to reinitialise the values.
>> +	 */
> 
> This is needlessly verbose; any literate reader of this code knows this
> is used immediately after resuming from a suspend (migration). If you
> give your hook a more descriptive name, the comment becomes unnecessary.
> 

Yes make sense, will update.

Thanks,
Kajol Jain

> 
>> +	read_sys_info_pseries();
>> +
> 

^ permalink raw reply

* Re: [PATCH 1/1] powerpc/rtas: Implement reentrant rtas call
From: Leonardo Bras @ 2020-05-13  5:23 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Gautham R. Shenoy, Greg Kroah-Hartman, linux-kernel,
	Nicholas Piggin, Paul Mackerras, Nadav Amit, Thomas Gleixner,
	linuxppc-dev, Allison Randal
In-Reply-To: <87ftdb87jf.fsf@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 191 bytes --]

v2: 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200513044025.105379-2-leobras.c@gmail.com/

(Series:
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=176534) 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 862 bytes --]

^ permalink raw reply

* Re: [PATCH 08/12] mm: pgtable: add shortcuts for accessing kernel PMD and PTE
From: Mike Rapoport @ 2020-05-13  5:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rich Felker, linux-ia64, linux-sh, Catalin Marinas,
	Heiko Carstens, linux-mips, Max Filippov, Guo Ren, linux-csky,
	sparclinux, linux-hexagon, linux-riscv, Vincent Chen, Will Deacon,
	Greg Ungerer, linux-arch, linux-s390, linux-c6x-dev, Brian Cain,
	Helge Deller, x86, Russell King, Ley Foon Tan, Mike Rapoport,
	Ingo Molnar, Geert Uytterhoeven, linux-parisc, Mark Salter,
	Matt Turner, linux-snps-arc, linux-xtensa, Arnd Bergmann,
	linux-alpha, linux-um, linux-m68k, Tony Luck, Borislav Petkov,
	Greentime Hu, Paul Walmsley, Stafford Horne, Guan Xuetao,
	linux-arm-kernel, Chris Zankel, Michal Simek, Thomas Bogendoerfer,
	Yoshinori Sato, Nick Hu, linux-mm, Vineet Gupta, linux-kernel,
	openrisc, Thomas Gleixner, Richard Weinberger, Andrew Morton,
	linuxppc-dev, David S. Miller
In-Reply-To: <20200512192441.GZ16070@bombadil.infradead.org>

On Tue, May 12, 2020 at 12:24:41PM -0700, Matthew Wilcox wrote:
> On Tue, May 12, 2020 at 09:44:18PM +0300, Mike Rapoport wrote:
> > +++ b/include/linux/pgtable.h
> > @@ -28,6 +28,24 @@
> >  #define USER_PGTABLES_CEILING	0UL
> >  #endif
> >  
> > +/* FIXME: */
> 
> Fix you what?  Add documentation?

Ouch, indeed :)

> > +static inline pmd_t *pmd_off(struct mm_struct *mm, unsigned long va)
> > +{
> > +	return pmd_offset(pud_offset(p4d_offset(pgd_offset(mm, va), va), va), va);
> > +}

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH 03/12] mm: reorder includes after introduction of linux/pgtable.h
From: Mike Rapoport @ 2020-05-13  5:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Rich Felker, linux-ia64, linux-sh, Catalin Marinas,
	Heiko Carstens, linux-mips, Max Filippov, Guo Ren, linux-csky,
	sparclinux, linux-hexagon, linux-riscv, Vincent Chen, Will Deacon,
	Greg Ungerer, linux-arch, linux-s390, linux-c6x-dev, Brian Cain,
	Helge Deller, x86, Russell King, Ley Foon Tan, Mike Rapoport,
	Ingo Molnar, Geert Uytterhoeven, linux-parisc, Mark Salter,
	Matt Turner, linux-snps-arc, linux-xtensa, Arnd Bergmann,
	linux-alpha, linux-um, linux-m68k, Tony Luck, Borislav Petkov,
	Greentime Hu, Paul Walmsley, Stafford Horne, Guan Xuetao,
	linux-arm-kernel, Chris Zankel, Michal Simek, Thomas Bogendoerfer,
	Yoshinori Sato, Nick Hu, linux-mm, Vineet Gupta, linux-kernel,
	openrisc, Thomas Gleixner, Richard Weinberger, Andrew Morton,
	linuxppc-dev, David S. Miller
In-Reply-To: <20200512192013.GY16070@bombadil.infradead.org>

On Tue, May 12, 2020 at 12:20:13PM -0700, Matthew Wilcox wrote:
> On Tue, May 12, 2020 at 09:44:13PM +0300, Mike Rapoport wrote:
> > diff --git a/arch/alpha/kernel/proto.h b/arch/alpha/kernel/proto.h
> > index a093cd45ec79..701a05090141 100644
> > --- a/arch/alpha/kernel/proto.h
> > +++ b/arch/alpha/kernel/proto.h
> > @@ -2,8 +2,6 @@
> >  #include <linux/interrupt.h>
> >  #include <linux/io.h>
> >  
> > -#include <linux/pgtable.h>
> > -
> >  /* Prototypes of functions used across modules here in this directory.  */
> >  
> >  #define vucp	volatile unsigned char  *
> 
> Looks like your script has a bug if linux/pgtable.h is the last include
> in the file?

Script indeed cannot handle all the corner case, but this is not one of
them.
I've started initially to look into removing asm/pgtable.h if it was not
needed, but I've run out of patience very soon. This file is what
sneaked in from that attempt.

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2 1/1] powerpc/crash: Use NMI context for printk when starting to crash
From: Leonardo Bras @ 2020-05-13  5:16 UTC (permalink / raw)
  To: Nicholas Piggin, Alexios Zavras, Benjamin Herrenschmidt,
	Christophe Leroy, Greg Kroah-Hartman, Enrico Weigelt,
	Michael Ellerman, Paul Mackerras, Thomas Gleixner
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <1589344247.2akwhmzwhg.astroid@bobo.none>

Hello Nick, thanks for your feedback.
Comments inline:

On Wed, 2020-05-13 at 14:36 +1000, Nicholas Piggin wrote:
> Excerpts from Leonardo Bras's message of May 13, 2020 7:45 am:
> > Currently, if printk lock (logbuf_lock) is held by other thread during
> > crash, there is a chance of deadlocking the crash on next printk, and
> > blocking a possibly desired kdump.
> > 
> > At the start of default_machine_crash_shutdown, make printk enter
> > NMI context, as it will use per-cpu buffers to store the message,
> > and avoid locking logbuf_lock.
> 
> printk_nmi_enter is used in one other place outside nmi_enter.
> 
> Is there a different/better way to handle this? What do other 
> architectures do?

To be honest, I was unaware of nmi_enter() and I have yet to study what
other architectures do here.

> Other subsystems get put into an nmi-mode when we call nmi_enter
> (lockdep, ftrace, rcu etc). It seems like those would be useful for 
> similar reasons, so at least explaining why that is not used in a 
> comment would be good.

My reasoning for using printk_nmi_enter() here was only to keep it from
using printk regular buffer (and locking logbuf_lock) at this point of
the crash.

I have yet to see how nmi_enter() extra functions would happen to
interfere with the crash at this point. 

(In a quick look at x86, (native_machine_crash_shutdown) I could not
see it using any printk, so it may not be necessary).

> Aside from that, I welcome any effort to make our crashes more reliable
> so thanks for working on this stuff.
> 
> Thanks,
> Nick

Thank you, it means a lot.

Leonardo Bras


^ permalink raw reply

* Re: [PATCH 1/1] powerpc/rtas: Implement reentrant rtas call
From: Leonardo Bras @ 2020-05-13  4:31 UTC (permalink / raw)
  To: Nathan Lynch
  Cc: Gautham R. Shenoy, Greg Kroah-Hartman, linux-kernel,
	Nicholas Piggin, Paul Mackerras, Nadav Amit, Thomas Gleixner,
	linuxppc-dev, Allison Randal
In-Reply-To: <87ftdb87jf.fsf@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3666 bytes --]

Hello Nathan, thanks for the feedback!

On Fri, 2020-04-10 at 14:28 -0500, Nathan Lynch wrote:
> Leonardo Bras <leonardo@linux.ibm.com> writes:
> > Implement rtas_call_reentrant() for reentrant rtas-calls:
> > "ibm,int-on", "ibm,int-off",ibm,get-xive" and  "ibm,set-xive".
> > 
> > On LoPAPR Version 1.1 (March 24, 2016), from 7.3.10.1 to 7.3.10.4,
> > items 2 and 3 say:
> > 
> > 2 - For the PowerPC External Interrupt option: The * call must be
> > reentrant to the number of processors on the platform.
> > 3 - For the PowerPC External Interrupt option: The * argument call
> > buffer for each simultaneous call must be physically unique.
> > 
> > So, these rtas-calls can be called in a lockless way, if using
> > a different buffer for each call.
> > 

> From the language in the spec it's clear that these calls are intended
> to be reentrant with respect to themselves, but it's less clear to me
> that they are safe to call simultaneously with respect to each other or
> arbitrary other RTAS methods.

In my viewpoint, being reentrant to themselves, without being reentrant
to others would be very difficult to do, considering the way the
rtas_call is crafted to work.

I mean, I have no experience in rtas code, it's my viewpoint. In my
thoughts there is something like this:

common_path -> selects function by token -> reentrant function
					|-> non-reentrant function

If there is one function that is reentrant, it means the common_path
and function selection by token would need to be reentrant too.

> > This can be useful to avoid deadlocks in crashing, where rtas-calls are
> > needed, but some other thread crashed holding the rtas.lock.
> 
> Are these calls commonly used in the crash-handling path? Is this
> addressing a real issue you've seen?
> 

Yes, I noticed deadlocks during crashes, like this one:
#0 arch_spin_lock
#1  lock_rtas () 
#2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
#3  ics_rtas_mask_real_irq (hw_irq=4100) 
#4  machine_kexec_mask_interrupts
#5  default_machine_crash_shutdown
#6  machine_crash_shutdown 
#7  __crash_kexec
#8  crash_kexec
#9  oops_end

On ics_rtas_mask_real_irq() we have both ibm_int_off and ibm_set_xive,
so it makes sense to also add ibm_int_on and ibm_get_xive as reentrant
too.

Full discussion available on this thread:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200401000020.590447-1-leonardo@linux.ibm.com/

> 
> > +/*
> > + * Used for reentrant rtas calls.
> > + * According to LoPAR documentation, only "ibm,int-on", "ibm,int-off",
> > + * "ibm,get-xive" and "ibm,set-xive" are currently reentrant.
> > + * Reentrant calls need their own rtas_args buffer, so not using rtas.args.
> > + */
> 
> Please use kernel-doc format in new code.

Sure, v2 is going to be fixed.

> 
> 
> > +int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...)
> > +{
> > +	va_list list;
> > +	struct rtas_args rtas_args;
> > +
> > +	if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
> > +		return -1;
> > +
> > +	va_start(list, outputs);
> > +	va_rtas_call_unlocked(&rtas_args, token, nargs, nret, list);
> > +	va_end(list);
> 
> No, I don't think you can place the RTAS argument buffer on the stack:
> 
>   7.2.7, Software Implementation Note:
>   | The OS must be aware that the effective address range for RTAS is 4
>   | GB when instantiated in 32-bit mode and the OS should not pass RTAS
>   | addresses or blocks of data which might fall outside of this range.

Agree, moved to PACA.

I will send a v2 soon, it will be a 2-patch patchset.

Best regards,
Leonardo Bras

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 2/2] powerpc/64s/hash: Add stress_hpt kernel boot option to increase hash faults
From: Nicholas Piggin @ 2020-05-13  4:50 UTC (permalink / raw)
  To: linuxppc-dev, Michael Ellerman
In-Reply-To: <20200511125825.3081305-2-mpe@ellerman.id.au>

Excerpts from Michael Ellerman's message of May 11, 2020 10:58 pm:
> +void hpt_do_stress(unsigned long ea, unsigned long access,
> +		   unsigned long rflags, unsigned long hpte_group)
> +{
> +	unsigned long last_group;
> +	int cpu = raw_smp_processor_id();
> +
> +	last_group = stress_hpt_last_group[cpu];
> +	if (last_group != -1UL) {
> +		while (mmu_hash_ops.hpte_remove(last_group) != -1)
> +			;

This seems to have issues causing hangs and livelocking, particularly on 
SMP guests. I think another CPU taking a fault and inserting an HPTE
into this group can get stuck when it has its entry removed before it can
return from the fault and load its TLB, so it faults again.

The hpte_remove hypercall must be slow enough on a guest that this loop 
doesn't finish before the other CPU comes in and puts another entry in 
there. Which explains why I didn't see it on bare metal.

Removing the loop doesn't end up generating a lot of faults because most 
HPTEs are userspace, so they end up overwhelming the kernel HPTE 
removal.

Using hpte_invalidate to invalidate the specific entry might be the go,
although it removes some element of randomness at least on PowerNV -- 
the kernel TLB is still there and will fault "some time" when the TLB 
entry drops out.

Maybe hpt stress should go into the hash implementation. I'm thinking 
about what to do.

Better drop this patch for now, but the SLB one should be good to go.

Thanks,
Nick

^ permalink raw reply

* [PATCH v2 2/2] powerpc/rtas: Implement reentrant rtas call
From: Leonardo Bras @ 2020-05-13  4:40 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Allison Randal, Leonardo Bras, Greg Kroah-Hartman,
	Thomas Gleixner, Nicholas Piggin, Nathan Lynch, Gautham R. Shenoy,
	Nadav Amit
  Cc: Leonardo Bras, linuxppc-dev, linux-kernel
In-Reply-To: <20200513044025.105379-1-leobras.c@gmail.com>

Implement rtas_call_reentrant() for reentrant rtas-calls:
"ibm,int-on", "ibm,int-off",ibm,get-xive" and  "ibm,set-xive".

On LoPAPR Version 1.1 (March 24, 2016), from 7.3.10.1 to 7.3.10.4,
items 2 and 3 say:

2 - For the PowerPC External Interrupt option: The * call must be
reentrant to the number of processors on the platform.
3 - For the PowerPC External Interrupt option: The * argument call
buffer for each simultaneous call must be physically unique.

So, these rtas-calls can be called in a lockless way, if using
a different buffer for each call.

This can be useful to avoid deadlocks in crashing, where rtas-calls are
needed, but some other thread crashed holding the rtas.lock.

This is an example backtrace of deadlock noticed:

  #0 arch_spin_lock
  #1  lock_rtas () 
  #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
  #3  ics_rtas_mask_real_irq (hw_irq=4100) 
  #4  machine_kexec_mask_interrupts
  #5  default_machine_crash_shutdown
  #6  machine_crash_shutdown 
  #7  __crash_kexec
  #8  crash_kexec
  #9  oops_end


Signed-off-by: Leonardo Bras <leobras.c@gmail.com>

---
Changes since v1:
- Moved buffer from stack to PACA (as suggested by Paul Mackerras)
- Added missing output bits
- Improve documentation following kernel-doc format (as suggested by
  Nathan Lynch)
---
 arch/powerpc/include/asm/paca.h     |  2 ++
 arch/powerpc/include/asm/rtas.h     |  1 +
 arch/powerpc/kernel/rtas.c          | 42 +++++++++++++++++++++++++++++
 arch/powerpc/sysdev/xics/ics-rtas.c | 22 +++++++--------
 4 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index e3cc9eb9204d..5a76ba50b40f 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -29,6 +29,7 @@
 #include <asm/hmi.h>
 #include <asm/cpuidle.h>
 #include <asm/atomic.h>
+#include <asm/rtas-types.h>
 
 #include <asm-generic/mmiowb_types.h>
 
@@ -270,6 +271,7 @@ struct paca_struct {
 #ifdef CONFIG_MMIOWB
 	struct mmiowb_state mmiowb_state;
 #endif
+	struct rtas_args reentrant_args;
 } ____cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index c35c5350b7e4..fa7509c85881 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -236,6 +236,7 @@ extern struct rtas_t rtas;
 extern int rtas_token(const char *service);
 extern int rtas_service_present(const char *service);
 extern int rtas_call(int token, int, int, int *, ...);
+int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...);
 void rtas_call_unlocked(struct rtas_args *args, int token, int nargs,
 			int nret, ...);
 extern void __noreturn rtas_restart(char *cmd);
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c5fa251b8950..d426b5c4856c 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -41,6 +41,7 @@
 #include <asm/time.h>
 #include <asm/mmu.h>
 #include <asm/topology.h>
+#include <asm/paca.h>
 
 /* This is here deliberately so it's only used in this file */
 void enter_rtas(unsigned long);
@@ -483,6 +484,47 @@ int rtas_call(int token, int nargs, int nret, int *outputs, ...)
 }
 EXPORT_SYMBOL(rtas_call);
 
+/**
+ * rtas_call_reentrant() - Used for reentrant rtas calls
+ * @token:	Token for desired reentrant RTAS call
+ * @nargs:	Number of Input Parameters
+ * @nret:	Number of Output Parameters
+ * @outputs:	Array of outputs
+ * @...:	Inputs for desired RTAS call
+ *
+ * According to LoPAR documentation, only "ibm,int-on", "ibm,int-off",
+ * "ibm,get-xive" and "ibm,set-xive" are currently reentrant.
+ * Reentrant calls need their own rtas_args buffer, so not using rtas.args, but
+ * PACA one instead.
+ *
+ * Return:	-1 on error,
+ *		First output value of RTAS call if (nret > 0),
+ *		0 otherwise,
+ */
+
+int rtas_call_reentrant(int token, int nargs, int nret, int *outputs, ...)
+{
+	va_list list;
+	struct rtas_args *args;
+	int i;
+
+	if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
+		return -1;
+
+	/* We use the per-cpu (PACA) rtas args buffer */
+	args = &local_paca->reentrant_args;
+
+	va_start(list, outputs);
+	va_rtas_call_unlocked(args, token, nargs, nret, list);
+	va_end(list);
+
+	if (nret > 1 && outputs)
+		for (i = 0; i < nret - 1; ++i)
+			outputs[i] = be32_to_cpu(args->rets[i + 1]);
+
+	return (nret > 0) ? be32_to_cpu(args->rets[0]) : 0;
+}
+
 /* For RTAS_BUSY (-2), delay for 1 millisecond.  For an extended busy status
  * code of 990n, perform the hinted delay of 10^n (last digit) milliseconds.
  */
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c b/arch/powerpc/sysdev/xics/ics-rtas.c
index 6aabc74688a6..4cf18000f07c 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -50,8 +50,8 @@ static void ics_rtas_unmask_irq(struct irq_data *d)
 
 	server = xics_get_irq_server(d->irq, irq_data_get_affinity_mask(d), 0);
 
-	call_status = rtas_call(ibm_set_xive, 3, 1, NULL, hw_irq, server,
-				DEFAULT_PRIORITY);
+	call_status = rtas_call_reentrant(ibm_set_xive, 3, 1, NULL, hw_irq,
+					  server, DEFAULT_PRIORITY);
 	if (call_status != 0) {
 		printk(KERN_ERR
 			"%s: ibm_set_xive irq %u server %x returned %d\n",
@@ -60,7 +60,7 @@ static void ics_rtas_unmask_irq(struct irq_data *d)
 	}
 
 	/* Now unmask the interrupt (often a no-op) */
-	call_status = rtas_call(ibm_int_on, 1, 1, NULL, hw_irq);
+	call_status = rtas_call_reentrant(ibm_int_on, 1, 1, NULL, hw_irq);
 	if (call_status != 0) {
 		printk(KERN_ERR "%s: ibm_int_on irq=%u returned %d\n",
 			__func__, hw_irq, call_status);
@@ -91,7 +91,7 @@ static void ics_rtas_mask_real_irq(unsigned int hw_irq)
 	if (hw_irq == XICS_IPI)
 		return;
 
-	call_status = rtas_call(ibm_int_off, 1, 1, NULL, hw_irq);
+	call_status = rtas_call_reentrant(ibm_int_off, 1, 1, NULL, hw_irq);
 	if (call_status != 0) {
 		printk(KERN_ERR "%s: ibm_int_off irq=%u returned %d\n",
 			__func__, hw_irq, call_status);
@@ -99,8 +99,8 @@ static void ics_rtas_mask_real_irq(unsigned int hw_irq)
 	}
 
 	/* Have to set XIVE to 0xff to be able to remove a slot */
-	call_status = rtas_call(ibm_set_xive, 3, 1, NULL, hw_irq,
-				xics_default_server, 0xff);
+	call_status = rtas_call_reentrant(ibm_set_xive, 3, 1, NULL, hw_irq,
+					  xics_default_server, 0xff);
 	if (call_status != 0) {
 		printk(KERN_ERR "%s: ibm_set_xive(0xff) irq=%u returned %d\n",
 			__func__, hw_irq, call_status);
@@ -131,7 +131,7 @@ static int ics_rtas_set_affinity(struct irq_data *d,
 	if (hw_irq == XICS_IPI || hw_irq == XICS_IRQ_SPURIOUS)
 		return -1;
 
-	status = rtas_call(ibm_get_xive, 1, 3, xics_status, hw_irq);
+	status = rtas_call_reentrant(ibm_get_xive, 1, 3, xics_status, hw_irq);
 
 	if (status) {
 		printk(KERN_ERR "%s: ibm,get-xive irq=%u returns %d\n",
@@ -146,8 +146,8 @@ static int ics_rtas_set_affinity(struct irq_data *d,
 		return -1;
 	}
 
-	status = rtas_call(ibm_set_xive, 3, 1, NULL,
-			   hw_irq, irq_server, xics_status[1]);
+	status = rtas_call_reentrant(ibm_set_xive, 3, 1, NULL,
+				     hw_irq, irq_server, xics_status[1]);
 
 	if (status) {
 		printk(KERN_ERR "%s: ibm,set-xive irq=%u returns %d\n",
@@ -179,7 +179,7 @@ static int ics_rtas_map(struct ics *ics, unsigned int virq)
 		return -EINVAL;
 
 	/* Check if RTAS knows about this interrupt */
-	rc = rtas_call(ibm_get_xive, 1, 3, status, hw_irq);
+	rc = rtas_call_reentrant(ibm_get_xive, 1, 3, status, hw_irq);
 	if (rc)
 		return -ENXIO;
 
@@ -198,7 +198,7 @@ static long ics_rtas_get_server(struct ics *ics, unsigned long vec)
 {
 	int rc, status[2];
 
-	rc = rtas_call(ibm_get_xive, 1, 3, status, vec);
+	rc = rtas_call_reentrant(ibm_get_xive, 1, 3, status, vec);
 	if (rc)
 		return -1;
 	return status[0];
-- 
2.25.4


^ permalink raw reply related

* [PATCH v2 1/2] powerpc/rtas: Move type/struct definitions from rtas.h into rtas-types.h
From: Leonardo Bras @ 2020-05-13  4:40 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Allison Randal, Leonardo Bras, Greg Kroah-Hartman,
	Thomas Gleixner, Nicholas Piggin, Nathan Lynch, Gautham R. Shenoy,
	Nadav Amit
  Cc: Leonardo Bras, linuxppc-dev, linux-kernel

In order to get any rtas* struct into other headers, including rtas.h
may cause a lot of errors, regarding include dependency needed for
inline functions.

Create rtas-types.h and move there all type/struct definitions
from rtas.h, then include rtas-types.h into rtas.h.

Also, as suggested by checkpath.pl, replace uint8_t for u8, and keep
the same type pattern for the whole file, as they are the same
according to powerpc/boot/types.h.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
 arch/powerpc/include/asm/rtas-types.h | 124 ++++++++++++++++++++++++++
 arch/powerpc/include/asm/rtas.h       | 118 +-----------------------
 2 files changed, 125 insertions(+), 117 deletions(-)
 create mode 100644 arch/powerpc/include/asm/rtas-types.h

diff --git a/arch/powerpc/include/asm/rtas-types.h b/arch/powerpc/include/asm/rtas-types.h
new file mode 100644
index 000000000000..59b0b4b25b7a
--- /dev/null
+++ b/arch/powerpc/include/asm/rtas-types.h
@@ -0,0 +1,124 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef _POWERPC_RTAS_TYPES_H
+#define _POWERPC_RTAS_TYPES_H
+#ifdef __KERNEL__
+
+typedef __be32 rtas_arg_t;
+
+struct rtas_args {
+	__be32 token;
+	__be32 nargs;
+	__be32 nret;
+	rtas_arg_t args[16];
+	rtas_arg_t *rets;     /* Pointer to return values in args[]. */
+};
+
+struct rtas_t {
+	unsigned long entry;		/* physical address pointer */
+	unsigned long base;		/* physical address pointer */
+	unsigned long size;
+	arch_spinlock_t lock;
+	struct rtas_args args;
+	struct device_node *dev;	/* virtual address pointer */
+};
+
+struct rtas_suspend_me_data {
+	atomic_t working; /* number of cpus accessing this struct */
+	atomic_t done;
+	int token; /* ibm,suspend-me */
+	atomic_t error;
+	struct completion *complete; /* wait on this until working == 0 */
+};
+
+struct rtas_error_log {
+	/* Byte 0 */
+	u8		byte0;			/* Architectural version */
+
+	/* Byte 1 */
+	u8		byte1;
+	/* XXXXXXXX
+	 * XXX		3: Severity level of error
+	 *    XX	2: Degree of recovery
+	 *      X	1: Extended log present?
+	 *       XX	2: Reserved
+	 */
+
+	/* Byte 2 */
+	u8		byte2;
+	/* XXXXXXXX
+	 * XXXX		4: Initiator of event
+	 *     XXXX	4: Target of failed operation
+	 */
+	u8		byte3;			/* General event or error*/
+	__be32		extended_log_length;	/* length in bytes */
+	unsigned char	buffer[1];		/* Start of extended log */
+						/* Variable length.      */
+};
+
+/* RTAS general extended event log, Version 6. The extended log starts
+ * from "buffer" field of struct rtas_error_log defined above.
+ */
+struct rtas_ext_event_log_v6 {
+	/* Byte 0 */
+	u8 byte0;
+	/* XXXXXXXX
+	 * X		1: Log valid
+	 *  X		1: Unrecoverable error
+	 *   X		1: Recoverable (correctable or successfully retried)
+	 *    X		1: Bypassed unrecoverable error (degraded operation)
+	 *     X	1: Predictive error
+	 *      X	1: "New" log (always 1 for data returned from RTAS)
+	 *       X	1: Big Endian
+	 *        X	1: Reserved
+	 */
+
+	/* Byte 1 */
+	u8 byte1;			/* reserved */
+
+	/* Byte 2 */
+	u8 byte2;
+	/* XXXXXXXX
+	 * X		1: Set to 1 (indicating log is in PowerPC format)
+	 *  XXX		3: Reserved
+	 *     XXXX	4: Log format used for bytes 12-2047
+	 */
+
+	/* Byte 3 */
+	u8 byte3;			/* reserved */
+	/* Byte 4-11 */
+	u8 reserved[8];			/* reserved */
+	/* Byte 12-15 */
+	__be32  company_id;		/* Company ID of the company	*/
+					/* that defines the format for	*/
+					/* the vendor specific log type	*/
+	/* Byte 16-end of log */
+	u8 vendor_log[1];		/* Start of vendor specific log	*/
+					/* Variable length.		*/
+};
+
+/* Vendor specific Platform Event Log Format, Version 6, section header */
+struct pseries_errorlog {
+	__be16 id;			/* 0x00 2-byte ASCII section ID	*/
+	__be16 length;			/* 0x02 Section length in bytes	*/
+	u8 version;			/* 0x04 Section version		*/
+	u8 subtype;			/* 0x05 Section subtype		*/
+	__be16 creator_component;	/* 0x06 Creator component ID	*/
+	u8 data[];			/* 0x08 Start of section data	*/
+};
+
+/* RTAS pseries hotplug errorlog section */
+struct pseries_hp_errorlog {
+	u8	resource;
+	u8	action;
+	u8	id_type;
+	u8	reserved;
+	union {
+		__be32	drc_index;
+		__be32	drc_count;
+		struct { __be32 count, index; } ic;
+		char	drc_name[1];
+	} _drc_u;
+};
+
+#endif /* __KERNEL__ */
+#endif /* _POWERPC_RTAS_TYPES_H */
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3c1887351c71..c35c5350b7e4 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -5,6 +5,7 @@
 
 #include <linux/spinlock.h>
 #include <asm/page.h>
+#include <asm/rtas-types.h>
 #include <linux/time.h>
 #include <linux/cpumask.h>
 
@@ -42,33 +43,6 @@
  *
  */
 
-typedef __be32 rtas_arg_t;
-
-struct rtas_args {
-	__be32 token;
-	__be32 nargs;
-	__be32 nret; 
-	rtas_arg_t args[16];
-	rtas_arg_t *rets;     /* Pointer to return values in args[]. */
-};  
-
-struct rtas_t {
-	unsigned long entry;		/* physical address pointer */
-	unsigned long base;		/* physical address pointer */
-	unsigned long size;
-	arch_spinlock_t lock;
-	struct rtas_args args;
-	struct device_node *dev;	/* virtual address pointer */
-};
-
-struct rtas_suspend_me_data {
-	atomic_t working; /* number of cpus accessing this struct */
-	atomic_t done;
-	int token; /* ibm,suspend-me */
-	atomic_t error;
-	struct completion *complete; /* wait on this until working == 0 */
-};
-
 /* RTAS event classes */
 #define RTAS_INTERNAL_ERROR		0x80000000 /* set bit 0 */
 #define RTAS_EPOW_WARNING		0x40000000 /* set bit 1 */
@@ -148,31 +122,6 @@ struct rtas_suspend_me_data {
 /* RTAS check-exception vector offset */
 #define RTAS_VECTOR_EXTERNAL_INTERRUPT	0x500
 
-struct rtas_error_log {
-	/* Byte 0 */
-	uint8_t		byte0;			/* Architectural version */
-
-	/* Byte 1 */
-	uint8_t		byte1;
-	/* XXXXXXXX
-	 * XXX		3: Severity level of error
-	 *    XX	2: Degree of recovery
-	 *      X	1: Extended log present?
-	 *       XX	2: Reserved
-	 */
-
-	/* Byte 2 */
-	uint8_t		byte2;
-	/* XXXXXXXX
-	 * XXXX		4: Initiator of event
-	 *     XXXX	4: Target of failed operation
-	 */
-	uint8_t		byte3;			/* General event or error*/
-	__be32		extended_log_length;	/* length in bytes */
-	unsigned char	buffer[1];		/* Start of extended log */
-						/* Variable length.      */
-};
-
 static inline uint8_t rtas_error_severity(const struct rtas_error_log *elog)
 {
 	return (elog->byte1 & 0xE0) >> 5;
@@ -212,47 +161,6 @@ uint32_t rtas_error_extended_log_length(const struct rtas_error_log *elog)
 
 #define RTAS_V6EXT_COMPANY_ID_IBM	(('I' << 24) | ('B' << 16) | ('M' << 8))
 
-/* RTAS general extended event log, Version 6. The extended log starts
- * from "buffer" field of struct rtas_error_log defined above.
- */
-struct rtas_ext_event_log_v6 {
-	/* Byte 0 */
-	uint8_t byte0;
-	/* XXXXXXXX
-	 * X		1: Log valid
-	 *  X		1: Unrecoverable error
-	 *   X		1: Recoverable (correctable or successfully retried)
-	 *    X		1: Bypassed unrecoverable error (degraded operation)
-	 *     X	1: Predictive error
-	 *      X	1: "New" log (always 1 for data returned from RTAS)
-	 *       X	1: Big Endian
-	 *        X	1: Reserved
-	 */
-
-	/* Byte 1 */
-	uint8_t byte1;			/* reserved */
-
-	/* Byte 2 */
-	uint8_t byte2;
-	/* XXXXXXXX
-	 * X		1: Set to 1 (indicating log is in PowerPC format)
-	 *  XXX		3: Reserved
-	 *     XXXX	4: Log format used for bytes 12-2047
-	 */
-
-	/* Byte 3 */
-	uint8_t byte3;			/* reserved */
-	/* Byte 4-11 */
-	uint8_t reserved[8];		/* reserved */
-	/* Byte 12-15 */
-	__be32  company_id;		/* Company ID of the company	*/
-					/* that defines the format for	*/
-					/* the vendor specific log type	*/
-	/* Byte 16-end of log */
-	uint8_t vendor_log[1];		/* Start of vendor specific log	*/
-					/* Variable length.		*/
-};
-
 static
 inline uint8_t rtas_ext_event_log_format(struct rtas_ext_event_log_v6 *ext_log)
 {
@@ -287,16 +195,6 @@ inline uint32_t rtas_ext_event_company_id(struct rtas_ext_event_log_v6 *ext_log)
 #define PSERIES_ELOG_SECT_ID_HOTPLUG		(('H' << 8) | 'P')
 #define PSERIES_ELOG_SECT_ID_MCE		(('M' << 8) | 'C')
 
-/* Vendor specific Platform Event Log Format, Version 6, section header */
-struct pseries_errorlog {
-	__be16 id;			/* 0x00 2-byte ASCII section ID	*/
-	__be16 length;			/* 0x02 Section length in bytes	*/
-	uint8_t version;		/* 0x04 Section version		*/
-	uint8_t subtype;		/* 0x05 Section subtype		*/
-	__be16 creator_component;	/* 0x06 Creator component ID	*/
-	uint8_t data[];			/* 0x08 Start of section data	*/
-};
-
 static
 inline uint16_t pseries_errorlog_id(struct pseries_errorlog *sect)
 {
@@ -309,20 +207,6 @@ inline uint16_t pseries_errorlog_length(struct pseries_errorlog *sect)
 	return be16_to_cpu(sect->length);
 }
 
-/* RTAS pseries hotplug errorlog section */
-struct pseries_hp_errorlog {
-	u8	resource;
-	u8	action;
-	u8	id_type;
-	u8	reserved;
-	union {
-		__be32	drc_index;
-		__be32	drc_count;
-		struct { __be32 count, index; } ic;
-		char	drc_name[1];
-	} _drc_u;
-};
-
 #define PSERIES_HP_ELOG_RESOURCE_CPU	1
 #define PSERIES_HP_ELOG_RESOURCE_MEM	2
 #define PSERIES_HP_ELOG_RESOURCE_SLOT	3
-- 
2.25.4


^ permalink raw reply related

* Re: [PATCH v2 1/1] powerpc/crash: Use NMI context for printk when starting to crash
From: Nicholas Piggin @ 2020-05-13  4:36 UTC (permalink / raw)
  To: Alexios Zavras, Benjamin Herrenschmidt, Christophe Leroy,
	Greg Kroah-Hartman, Enrico Weigelt, Leonardo Bras,
	Michael Ellerman, Paul Mackerras, Thomas Gleixner
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200512214533.93878-1-leobras.c@gmail.com>

Excerpts from Leonardo Bras's message of May 13, 2020 7:45 am:
> Currently, if printk lock (logbuf_lock) is held by other thread during
> crash, there is a chance of deadlocking the crash on next printk, and
> blocking a possibly desired kdump.
> 
> At the start of default_machine_crash_shutdown, make printk enter
> NMI context, as it will use per-cpu buffers to store the message,
> and avoid locking logbuf_lock.

printk_nmi_enter is used in one other place outside nmi_enter.

Is there a different/better way to handle this? What do other 
architectures do?

Other subsystems get put into an nmi-mode when we call nmi_enter
(lockdep, ftrace, rcu etc). It seems like those would be useful for 
similar reasons, so at least explaining why that is not used in a 
comment would be good.

Aside from that, I welcome any effort to make our crashes more reliable
so thanks for working on this stuff.

Thanks,
Nick

> 
> Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> 
> ---
> Changes since v1:
> - Added in-code comment explaining the need of context change
> - Function moved to the start of default_machine_crash_shutdown,
>   to avoid locking any printk on crashing routine.
> - Title was 'Use NMI context for printk after crashing other CPUs'
> 
> ---
>  arch/powerpc/kexec/crash.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
> index d488311efab1..c9a889880214 100644
> --- a/arch/powerpc/kexec/crash.c
> +++ b/arch/powerpc/kexec/crash.c
> @@ -311,6 +311,9 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
>  	unsigned int i;
>  	int (*old_handler)(struct pt_regs *regs);
>  
> +	/* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
> +	printk_nmi_enter();
> +
>  	/*
>  	 * This function is only called after the system
>  	 * has panicked or is otherwise in a critical state.
> -- 
> 2.25.4
> 
> 

^ permalink raw reply

* Re: [PATCH] powerpc/book3s64/radix/tlb: Determine hugepage flush correctly
From: Nicholas Piggin @ 2020-05-13  4:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linuxppc-dev, mpe; +Cc: Bharata B Rao
In-Reply-To: <20200513030616.152288-1-aneesh.kumar@linux.ibm.com>

Excerpts from Aneesh Kumar K.V's message of May 13, 2020 1:06 pm:
> With a 64K page size flush with start and end value as below
> (start, end) = (721f680d0000, 721f680e0000) results in
> (hstart, hend) = (721f68200000, 721f68000000)
> 
> Avoid doing a __tlbie_va_range with the wrong hstart and hend value in this
> case.
> 
> __tlbie_va_range will skip the actual tlbie operation for start > end.
> But we still end up doing the tlbie fixup.

Hm, good catch.

> Reported-by: Bharata B Rao <bharata@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/radix_tlb.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
> index 758ade2c2b6e..d592f9e1c037 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -884,10 +884,10 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
>  		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>  			hstart = (start + PMD_SIZE - 1) & PMD_MASK;
>  			hend = end & PMD_MASK;
> -			if (hstart == hend)
> -				hflush = false;
> -			else
> +			if (hstart < hend)
>  				hflush = true;
> +			else
> +				hflush = false;

We can probably get rid of the else part since it is already false.

Otherwise

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>


^ permalink raw reply

* Re: [PATCH] powerpc/kvm: silence kmemleak false positives
From: Michael Ellerman @ 2020-05-13  4:05 UTC (permalink / raw)
  To: Qian Cai; +Cc: linux-kernel, kvm-ppc, Qian Cai, catalin.marinas, linuxppc-dev
In-Reply-To: <20200509015538.3183-1-cai@lca.pw>

Qian Cai <cai@lca.pw> writes:
> kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
> pud_populate() and pmd_populate() will use __pa() to reference the newly
> allocated memory. The same is in xive_native_provision_pages().

Can you please split this into two patches, one for the KVM cases and
one for xive.

That way the KVM patch can go via the kvm-ppc tree, and I'll take the
xive one via powerpc.

> Since kmemleak is unable to track the physical memory resulting in false
> positives, silence those by using kmemleak_ignore().
>
> unreferenced object 0xc000201c382a1000 (size 4096):
>   comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
>   hex dump (first 32 bytes):
>     c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87  .. ..`.... .r...
>     c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87  .. ....... .....
>   backtrace:
>     [<000000004cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
>     kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
>     (inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:590
>     [<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
>     [<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
>     [<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
>     [<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
>     [<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
>     [<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
>     [<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
>     [<0000000048155cd6>] ksys_ioctl+0xd8/0x130
>     [<0000000041ffeaa7>] sys_ioctl+0x28/0x40
>     [<000000004afc4310>] system_call_exception+0x114/0x1e0
>     [<00000000fb70a873>] system_call_common+0xf0/0x278
> unreferenced object 0xc0002001f0c03900 (size 256):
>   comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
>   hex dump (first 32 bytes):
>     c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87  .. ....... .....
>     c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87  .. ....... .....
>   backtrace:
>     [<0000000023f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
>     kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
>     (inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:593
>     [<00000000d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
>     [<00000000bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
>     [<0000000086dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
>     [<000000005ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
>     [<00000000d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
>     [<00000000d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
>     [<000000002543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
>     [<0000000048155cd6>] ksys_ioctl+0xd8/0x130
>     [<0000000041ffeaa7>] sys_ioctl+0x28/0x40
>     [<000000004afc4310>] system_call_exception+0x114/0x1e0
>     [<00000000fb70a873>] system_call_common+0xf0/0x278
> unreferenced object 0xc000201b53e90000 (size 65536):
>   comm "qemu-kvm", pid 124557, jiffies 4295650285 (age 364.370s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>     [<00000000acc2fb77>] xive_native_alloc_vp_block+0x168/0x210
>     xive_native_provision_pages at arch/powerpc/sysdev/xive/native.c:645
>     (inlined by) xive_native_alloc_vp_block at arch/powerpc/sysdev/xive/native.c:674
>     [<000000004d5c7964>] kvmppc_xive_compute_vp_id+0x20c/0x3b0 [kvm]
>     [<0000000055317cd2>] kvmppc_xive_connect_vcpu+0xa4/0x4a0 [kvm]
>     [<0000000093dfc014>] kvm_arch_vcpu_ioctl+0x388/0x508 [kvm]
>     [<00000000d25aea0f>] kvm_vcpu_ioctl+0x15c/0x950 [kvm]
>     [<0000000048155cd6>] ksys_ioctl+0xd8/0x130
>     [<0000000041ffeaa7>] sys_ioctl+0x28/0x40
>     [<000000004afc4310>] system_call_exception+0x114/0x1e0
>     [<00000000fb70a873>] system_call_common+0xf0/0x278
>
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 16 ++++++++++++++--
>  arch/powerpc/sysdev/xive/native.c      |  4 ++++
>  2 files changed, 18 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index aa12cd4078b3..bc6c1aa3d0e9 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -353,7 +353,13 @@ static struct kmem_cache *kvm_pmd_cache;

This should probably also have an include of <linux/kmemleak.h> ?

>  static pte_t *kvmppc_pte_alloc(void)
>  {
> -	return kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
> +	pte_t *pte;
> +
> +	pte = kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
> +	/* pmd_populate() will only reference _pa(pte). */
> +	kmemleak_ignore(pte);
> +
> +	return pte;
>  }
>  
>  static void kvmppc_pte_free(pte_t *ptep)


cheers

^ permalink raw reply

* Re: [PATCH] powerpc/kvm: silence kmemleak false positives
From: Michael Ellerman @ 2020-05-13  4:00 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: linux-kernel, kvm-ppc, Qian Cai, linuxppc-dev
In-Reply-To: <20200511112829.GB19176@gaia>

Catalin Marinas <catalin.marinas@arm.com> writes:
> On Mon, May 11, 2020 at 09:15:55PM +1000, Michael Ellerman wrote:
>> Qian Cai <cai@lca.pw> writes:
>> > kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
>> > pud_populate() and pmd_populate() will use __pa() to reference the newly
>> > allocated memory. The same is in xive_native_provision_pages().
>> >
>> > Since kmemleak is unable to track the physical memory resulting in false
>> > positives, silence those by using kmemleak_ignore().
>> 
>> There is kmemleak_alloc_phys(), which according to the docs can be used
>> for tracking a phys address.
>
> This won't help. While kmemleak_alloc_phys() allows passing a physical
> address, it doesn't track physical address references to this object. It
> still expects VA pointing to it, otherwise the object would be reported
> as a leak.

OK, thanks for clarifying that.

> We currently only call this from the memblock code with a min_count of
> 0, meaning it will not be reported as a leak if no references are found.
>
> We don't have this issue with page tables on other architectures since
> most of them use whole page allocations which aren't tracked by
> kmemleak. These powerpc functions use kmem_cache_alloc() which would be
> tracked automatically by kmemleak. While we could add a phys alias to
> kmemleak (another search tree), I think the easiest is as per Qian's
> patch, just ignore those objects.

Agreed.

cheers

^ permalink raw reply

* Re: [PATCH v4 2/3] powerpc/numa: Prefer node id queried from vphn
From: Gautham R Shenoy @ 2020-05-13  3:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Gautham R Shenoy, Michal Hocko, David Hildenbrand, Linus Torvalds,
	linux-kernel, linux-mm, Satheesh Rajendran, Mel Gorman,
	Kirill A. Shutemov, Andrew Morton, linuxppc-dev,
	Christopher Lameter, Vlastimil Babka
In-Reply-To: <20200512132937.19295-3-srikar@linux.vnet.ibm.com>

On Tue, May 12, 2020 at 06:59:36PM +0530, Srikar Dronamraju wrote:
> Node id queried from the static device tree may not
> be correct. For example: it may always show 0 on a shared processor.
> Hence prefer the node id queried from vphn and fallback on the device tree
> based node id if vphn query fails.
> 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
> Changelog v2:->v3:
> - Resolved comments from Gautham.
> Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u
> 
> Changelog v1:->v2:
> - Rebased to v5.7-rc3
> 
>  arch/powerpc/mm/numa.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index b3615b7..2815313 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -719,20 +719,20 @@ static int __init parse_numa_properties(void)
>  	 */
>  	for_each_present_cpu(i) {
>  		struct device_node *cpu;
> -		int nid;
> -
> -		cpu = of_get_cpu_node(i, NULL);
> -		BUG_ON(!cpu);
> -		nid = of_node_to_nid_single(cpu);
> -		of_node_put(cpu);
> +		int nid = vphn_get_nid(i);
> 
>  		/*
>  		 * Don't fall back to default_nid yet -- we will plug
>  		 * cpus into nodes once the memory scan has discovered
>  		 * the topology.
>  		 */
> -		if (nid < 0)
> -			continue;
> -		node_set_online(nid);
> +		if (nid == NUMA_NO_NODE) {
> +			cpu = of_get_cpu_node(i, NULL);
> +			BUG_ON(!cpu);
> +			nid = of_node_to_nid_single(cpu);
> +			of_node_put(cpu);
> +		}
> +
> +		if (likely(nid > 0))
> +			node_set_online(nid);
>  	}
> 
>  	get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* [PATCH v2 5/5] powerpc/pmem: Avoid the barrier in flush routines
From: Aneesh Kumar K.V @ 2020-05-13  3:47 UTC (permalink / raw)
  To: linuxppc-dev, mpe, linux-nvdimm
  Cc: alistair, dan.j.williams, oohall, Aneesh Kumar K.V
In-Reply-To: <20200513034705.172983-1-aneesh.kumar@linux.ibm.com>

nvdimm expect the flush routines to just mark the cache clean. The barrier
that mark the store globally visible is done in nvdimm_flush().

Update the papr_scm driver to a simplified nvdim_flush callback that do
only the required barrier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/lib/pmem.c                   | 34 +++++++++++++++++------
 arch/powerpc/platforms/pseries/papr_scm.c | 13 +++++++++
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 076d75efb57a..3ef15cfa925b 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -9,7 +9,7 @@
 
 #include <asm/cacheflush.h>
 
-static inline void clean_pmem_range_isa310(unsigned long start, unsigned long stop)
+static inline void __clean_pmem_range(unsigned long start, unsigned long stop)
 {
 	unsigned long shift = l1_dcache_shift();
 	unsigned long bytes = l1_dcache_bytes();
@@ -18,13 +18,22 @@ static inline void clean_pmem_range_isa310(unsigned long start, unsigned long st
 	unsigned long i;
 
 	for (i = 0; i < size >> shift; i++, addr += bytes)
-		asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
+		dcbf(addr);
+}
 
+static inline void __flush_pmem_range(unsigned long start, unsigned long stop)
+{
+	unsigned long shift = l1_dcache_shift();
+	unsigned long bytes = l1_dcache_bytes();
+	void *addr = (void *)(start & ~(bytes - 1));
+	unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+	unsigned long i;
 
-	asm volatile(PPC_PHWSYNC ::: "memory");
+	for (i = 0; i < size >> shift; i++, addr += bytes)
+		dcbf(addr);
 }
 
-static inline void flush_pmem_range_isa310(unsigned long start, unsigned long stop)
+static inline void clean_pmem_range_isa310(unsigned long start, unsigned long stop)
 {
 	unsigned long shift = l1_dcache_shift();
 	unsigned long bytes = l1_dcache_bytes();
@@ -33,24 +42,33 @@ static inline void flush_pmem_range_isa310(unsigned long start, unsigned long st
 	unsigned long i;
 
 	for (i = 0; i < size >> shift; i++, addr += bytes)
-		asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
+		asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
+}
 
+static inline void flush_pmem_range_isa310(unsigned long start, unsigned long stop)
+{
+	unsigned long shift = l1_dcache_shift();
+	unsigned long bytes = l1_dcache_bytes();
+	void *addr = (void *)(start & ~(bytes - 1));
+	unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+	unsigned long i;
 
-	asm volatile(PPC_PHWSYNC ::: "memory");
+	for (i = 0; i < size >> shift; i++, addr += bytes)
+		asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
 }
 
 static inline void clean_pmem_range(unsigned long start, unsigned long stop)
 {
 	if (cpu_has_feature(CPU_FTR_ARCH_31))
 		return clean_pmem_range_isa310(start, stop);
-	return flush_dcache_range(start, stop);
+	return __clean_pmem_range(start, stop);
 }
 
 static inline void flush_pmem_range(unsigned long start, unsigned long stop)
 {
 	if (cpu_has_feature(CPU_FTR_ARCH_31))
 		return flush_pmem_range_isa310(start, stop);
-	return flush_dcache_range(start, stop);
+	return __flush_pmem_range(start, stop);
 }
 
 /*
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index f35592423380..ad506e7003c9 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -285,6 +285,18 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
 
 	return 0;
 }
+/*
+ * We have made sure the pmem writes are done such that before calling this
+ * all the caches are flushed/clean. We use dcbf/dcbfps to ensure this. Here
+ * we just need to add the necessary barrier to make sure the above flushes
+ * are have updated persistent storage before any data access or data transfer
+ * caused by subsequent instructions is initiated.
+ */
+static int papr_scm_flush_sync(struct nd_region *nd_region, struct bio *bio)
+{
+	arch_pmem_flush_barrier();
+	return 0;
+}
 
 static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 {
@@ -340,6 +352,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	ndr_desc.mapping = &mapping;
 	ndr_desc.num_mappings = 1;
 	ndr_desc.nd_set = &p->nd_set;
+	ndr_desc.flush = papr_scm_flush_sync;
 
 	if (p->is_volatile)
 		p->region = nvdimm_volatile_region_create(p->bus, &ndr_desc);
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 4/5] powerpc/pmem/of_pmem: Update of_pmem to use the new barrier instruction.
From: Aneesh Kumar K.V @ 2020-05-13  3:47 UTC (permalink / raw)
  To: linuxppc-dev, mpe, linux-nvdimm
  Cc: alistair, dan.j.williams, oohall, Aneesh Kumar K.V
In-Reply-To: <20200513034705.172983-1-aneesh.kumar@linux.ibm.com>

of_pmem on POWER10 can now use phwsync instead of hwsync to ensure
all previous writes are architecturally visible for the platform
buffer flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/cacheflush.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/include/asm/cacheflush.h b/arch/powerpc/include/asm/cacheflush.h
index e92191b390f3..f22057dc9dd0 100644
--- a/arch/powerpc/include/asm/cacheflush.h
+++ b/arch/powerpc/include/asm/cacheflush.h
@@ -119,6 +119,16 @@ static inline void invalidate_dcache_range(unsigned long start,
 #define copy_from_user_page(vma, page, vaddr, dst, src, len) \
 	memcpy(dst, src, len)
 
+
+#define arch_pmem_flush_barrier arch_pmem_flush_barrier
+static inline void  arch_pmem_flush_barrier(void)
+{
+	if (cpu_has_feature(CPU_FTR_ARCH_31))
+		asm volatile(PPC_PHWSYNC ::: "memory");
+	else
+		asm volatile("hwsync" ::: "memory");
+}
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_POWERPC_CACHEFLUSH_H */
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier
From: Aneesh Kumar K.V @ 2020-05-13  3:47 UTC (permalink / raw)
  To: linuxppc-dev, mpe, linux-nvdimm
  Cc: alistair, dan.j.williams, oohall, Aneesh Kumar K.V
In-Reply-To: <20200513034705.172983-1-aneesh.kumar@linux.ibm.com>

Architectures like ppc64 provide persistent memory specific barriers
that will ensure that all stores for which the modifications are
written to persistent storage by preceding dcbfps and dcbstps
instructions have updated persistent storage before any data
access or data transfer caused by subsequent instructions is initiated.
This is in addition to the ordering done by wmb()

Update nvdimm core such that architecture can use barriers other than
wmb to ensure all previous writes are architecturally visible for
the platform buffer flush.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/nvdimm/region_devs.c | 8 ++++----
 include/linux/libnvdimm.h    | 4 ++++
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ccbb5b43b8b2..88ea34a9c7fd 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1216,13 +1216,13 @@ int generic_nvdimm_flush(struct nd_region *nd_region)
 	idx = this_cpu_add_return(flush_idx, hash_32(current->pid + idx, 8));
 
 	/*
-	 * The first wmb() is needed to 'sfence' all previous writes
-	 * such that they are architecturally visible for the platform
-	 * buffer flush.  Note that we've already arranged for pmem
+	 * The first arch_pmem_flush_barrier() is needed to 'sfence' all
+	 * previous writes such that they are architecturally visible for
+	 * the platform buffer flush. Note that we've already arranged for pmem
 	 * writes to avoid the cache via memcpy_flushcache().  The final
 	 * wmb() ensures ordering for the NVDIMM flush write.
 	 */
-	wmb();
+	arch_pmem_flush_barrier();
 	for (i = 0; i < nd_region->ndr_mappings; i++)
 		if (ndrd_get_flush_wpq(ndrd, i, 0))
 			writeq(1, ndrd_get_flush_wpq(ndrd, i, idx));
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 18da4059be09..66f6c65bd789 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -286,4 +286,8 @@ static inline void arch_invalidate_pmem(void *addr, size_t size)
 }
 #endif
 
+#ifndef arch_pmem_flush_barrier
+#define arch_pmem_flush_barrier() wmb()
+#endif
+
 #endif /* __LIBNVDIMM_H__ */
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 1/5] powerpc/pmem: Add new instructions for persistent storage and sync
From: Aneesh Kumar K.V @ 2020-05-13  3:47 UTC (permalink / raw)
  To: linuxppc-dev, mpe, linux-nvdimm
  Cc: alistair, dan.j.williams, oohall, Aneesh Kumar K.V

POWER10 introduces two new variants of dcbf instructions (dcbstps and dcbfps)
that can be used to write modified locations back to persistent storage.

Additionally, POWER10 also introduce phwsync and plwsync which can be used
to establish order of these writes to persistent storage.

This patch exposes these instructions to the rest of the kernel. The existing
dcbf and hwsync instructions in P9 are adequate to enable appropriate
synchronization with OpenCAPI-hosted persistent storage. Hence the new
instructions are added as a variant of the old ones that old hardware
won't differentiate.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/ppc-opcode.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h
index c1df75edde44..45eccd842f84 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -216,6 +216,8 @@
 #define PPC_INST_STWCX			0x7c00012d
 #define PPC_INST_LWSYNC			0x7c2004ac
 #define PPC_INST_SYNC			0x7c0004ac
+#define PPC_INST_PHWSYNC		0x7c8004ac
+#define PPC_INST_PLWSYNC		0x7ca004ac
 #define PPC_INST_SYNC_MASK		0xfc0007fe
 #define PPC_INST_ISYNC			0x4c00012c
 #define PPC_INST_LXVD2X			0x7c000698
@@ -281,6 +283,8 @@
 #define PPC_INST_TABORT			0x7c00071d
 #define PPC_INST_TSR			0x7c0005dd
 
+#define PPC_INST_DCBF			0x7c0000ac
+
 #define PPC_INST_NAP			0x4c000364
 #define PPC_INST_SLEEP			0x4c0003a4
 #define PPC_INST_WINKLE			0x4c0003e4
@@ -529,6 +533,14 @@
 #define STBCIX(s,a,b)		stringify_in_c(.long PPC_INST_STBCIX | \
 				       __PPC_RS(s) | __PPC_RA(a) | __PPC_RB(b))
 
+#define	PPC_DCBFPS(a, b)	stringify_in_c(.long PPC_INST_DCBF |	\
+				       ___PPC_RA(a) | ___PPC_RB(b) | (4 << 21))
+#define	PPC_DCBSTPS(a, b)	stringify_in_c(.long PPC_INST_DCBF |	\
+				       ___PPC_RA(a) | ___PPC_RB(b) | (6 << 21))
+
+#define	PPC_PHWSYNC		stringify_in_c(.long PPC_INST_PHWSYNC)
+#define	PPC_PLWSYNC		stringify_in_c(.long PPC_INST_PLWSYNC)
+
 /*
  * Define what the VSX XX1 form instructions will look like, then add
  * the 128 bit load store instructions based on that.
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 2/5] powerpc/pmem: Add flush routines using new pmem store and sync instruction
From: Aneesh Kumar K.V @ 2020-05-13  3:47 UTC (permalink / raw)
  To: linuxppc-dev, mpe, linux-nvdimm
  Cc: alistair, dan.j.williams, oohall, Aneesh Kumar K.V
In-Reply-To: <20200513034705.172983-1-aneesh.kumar@linux.ibm.com>

Start using dcbstps; phwsync; sequence for flushing persistent memory range.
Even though the new instructions are implemented as a variant of dcbf and hwsync and on
POWER9 they will be executed as those instructions, we still avoid using them on
older hardware. This helps to avoid difficult to debug bugs.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/lib/pmem.c | 52 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 48 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 0666a8d29596..076d75efb57a 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -9,20 +9,64 @@
 
 #include <asm/cacheflush.h>
 
+static inline void clean_pmem_range_isa310(unsigned long start, unsigned long stop)
+{
+	unsigned long shift = l1_dcache_shift();
+	unsigned long bytes = l1_dcache_bytes();
+	void *addr = (void *)(start & ~(bytes - 1));
+	unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+	unsigned long i;
+
+	for (i = 0; i < size >> shift; i++, addr += bytes)
+		asm volatile(PPC_DCBSTPS(%0, %1): :"i"(0), "r"(addr): "memory");
+
+
+	asm volatile(PPC_PHWSYNC ::: "memory");
+}
+
+static inline void flush_pmem_range_isa310(unsigned long start, unsigned long stop)
+{
+	unsigned long shift = l1_dcache_shift();
+	unsigned long bytes = l1_dcache_bytes();
+	void *addr = (void *)(start & ~(bytes - 1));
+	unsigned long size = stop - (unsigned long)addr + (bytes - 1);
+	unsigned long i;
+
+	for (i = 0; i < size >> shift; i++, addr += bytes)
+		asm volatile(PPC_DCBFPS(%0, %1): :"i"(0), "r"(addr): "memory");
+
+
+	asm volatile(PPC_PHWSYNC ::: "memory");
+}
+
+static inline void clean_pmem_range(unsigned long start, unsigned long stop)
+{
+	if (cpu_has_feature(CPU_FTR_ARCH_31))
+		return clean_pmem_range_isa310(start, stop);
+	return flush_dcache_range(start, stop);
+}
+
+static inline void flush_pmem_range(unsigned long start, unsigned long stop)
+{
+	if (cpu_has_feature(CPU_FTR_ARCH_31))
+		return flush_pmem_range_isa310(start, stop);
+	return flush_dcache_range(start, stop);
+}
+
 /*
  * CONFIG_ARCH_HAS_PMEM_API symbols
  */
 void arch_wb_cache_pmem(void *addr, size_t size)
 {
 	unsigned long start = (unsigned long) addr;
-	flush_dcache_range(start, start + size);
+	clean_pmem_range(start, start + size);
 }
 EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
 
 void arch_invalidate_pmem(void *addr, size_t size)
 {
 	unsigned long start = (unsigned long) addr;
-	flush_dcache_range(start, start + size);
+	flush_pmem_range(start, start + size);
 }
 EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
 
@@ -35,7 +79,7 @@ long __copy_from_user_flushcache(void *dest, const void __user *src,
 	unsigned long copied, start = (unsigned long) dest;
 
 	copied = __copy_from_user(dest, src, size);
-	flush_dcache_range(start, start + size);
+	clean_pmem_range(start, start + size);
 
 	return copied;
 }
@@ -45,7 +89,7 @@ void *memcpy_flushcache(void *dest, const void *src, size_t size)
 	unsigned long start = (unsigned long) dest;
 
 	memcpy(dest, src, size);
-	flush_dcache_range(start, start + size);
+	clean_pmem_range(start, start + size);
 
 	return dest;
 }
-- 
2.26.2


^ permalink raw reply related

* [PATCH] powerpc/book3s64/radix/tlb: Determine hugepage flush correctly
From: Aneesh Kumar K.V @ 2020-05-13  3:06 UTC (permalink / raw)
  To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, npiggin, Bharata B Rao

With a 64K page size flush with start and end value as below
(start, end) = (721f680d0000, 721f680e0000) results in
(hstart, hend) = (721f68200000, 721f68000000)

Avoid doing a __tlbie_va_range with the wrong hstart and hend value in this
case.

__tlbie_va_range will skip the actual tlbie operation for start > end.
But we still end up doing the tlbie fixup.

Reported-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/radix_tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 758ade2c2b6e..d592f9e1c037 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -884,10 +884,10 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 		if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
 			hstart = (start + PMD_SIZE - 1) & PMD_MASK;
 			hend = end & PMD_MASK;
-			if (hstart == hend)
-				hflush = false;
-			else
+			if (hstart < hend)
 				hflush = true;
+			else
+				hflush = false;
 		}
 
 		if (local) {
-- 
2.26.2


^ permalink raw reply related

* Re: [RFC][PATCH 1/2] Add a layer of abstraction to use the memory reserved by device tree for ima buffer pass.
From: Rob Herring @ 2020-05-12 23:09 UTC (permalink / raw)
  To: Prakhar Srivastava
  Cc: kstewart, mark.rutland, gregkh, bhsharma, tao.li, zohar, paulus,
	vincenzo.frascino, will, nramas, frowand.list, masahiroy, jmorris,
	takahiro.akashi, linux-arm-kernel, catalin.marinas, serge,
	devicetree, pasha.tatashin, hsinyi, tusharsu, tglx, allison,
	mbrugger, balajib, dmitry.kasatkin, linux-kernel,
	linux-security-module, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <20200504203829.6330-2-prsriva@linux.microsoft.com>

On Mon, May 04, 2020 at 01:38:28PM -0700, Prakhar Srivastava wrote:
> Introduce a device tree layer for to read and store ima buffer
> from the reserved memory section of a device tree.

But why do I need 'a layer of abstraction'? I don't like them.

> Signed-off-by: Prakhar Srivastava <prsriva@linux.microsoft.com>
> ---
>  drivers/of/Kconfig  |   6 ++
>  drivers/of/Makefile |   1 +
>  drivers/of/of_ima.c | 165 ++++++++++++++++++++++++++++++++++++++++++++

Who are the users of this code and why does it need to be here? Most 
code for specific bindings are not in drivers/of/ but with the user. It 
doesn't sound like there's more than 1 user.

>  include/linux/of.h  |  34 +++++++++
>  4 files changed, 206 insertions(+)
>  create mode 100644 drivers/of/of_ima.c

^ permalink raw reply

* Re: [RFC][PATCH 0/2] Add support for using reserved memory for ima buffer pass
From: Rob Herring @ 2020-05-12 23:05 UTC (permalink / raw)
  To: Prakhar Srivastava
  Cc: Mark Rutland, kstewart, gregkh, bhsharma, tao.li, zohar, paulus,
	vincenzo.frascino, will, nramas, frowand.list, masahiroy, jmorris,
	takahiro.akashi, linux-arm-kernel, catalin.marinas, serge,
	devicetree, pasha.tatashin, hsinyi, tusharsu, tglx, allison,
	mbrugger, balajib, dmitry.kasatkin, linux-kernel,
	linux-security-module, james.morse, linux-integrity, linuxppc-dev
In-Reply-To: <e8c7d74e-74bf-caa3-452d-23faa649e825@linux.microsoft.com>

On Wed, May 06, 2020 at 10:50:04PM -0700, Prakhar Srivastava wrote:
> Hi Mark,

Please don't top post.

> This patch set currently only address the Pure DT implementation.
> EFI and ACPI implementations will be posted in subsequent patchsets.
> 
> The logs are intended to be carried over the kexec and once read the
> logs are no longer needed and in prior conversation with James(
> https://lore.kernel.org/linux-arm-kernel/0053eb68-0905-4679-c97a-00c5cb6f1abb@arm.com/)
> the apporach of using a chosen node doesn't
> support the case.
> 
> The DT entries make the reservation permanent and thus doesnt need kernel
> segments to be used for this, however using a chosen-node with
> reserved memory only changes the node information but memory still is
> reserved via reserved-memory section.

I think Mark's point was whether it needs to be permanent. We don't 
hardcode the initrd address for example.

> On 5/5/20 2:59 AM, Mark Rutland wrote:
> > Hi Prakhar,
> > 
> > On Mon, May 04, 2020 at 01:38:27PM -0700, Prakhar Srivastava wrote:
> > > IMA during kexec(kexec file load) verifies the kernel signature and measures

What's IMA?

> > > the signature of the kernel. The signature in the logs can be used to verfiy the
> > > authenticity of the kernel. The logs don not get carried over kexec and thus
> > > remote attesation cannot verify the signature of the running kernel.
> > > 
> > > Introduce an ABI to carry forward the ima logs over kexec.
> > > Memory reserved via device tree reservation can be used to store and read
> > > via the of_* functions.
> > 
> > This flow needs to work for:
> > 
> > 1) Pure DT
> > 2) DT + EFI memory map
> > 3) ACPI + EFI memory map
> > 
> > ... and if this is just for transiently passing the log, I don't think
> > that a reserved memory region is the right thing to use, since they're
> > supposed to be more permanent.
> > 
> > This sounds analogous to passing the initrd, and should probably use
> > properties under the chosen node (which can be used for all three boot
> > flows above).
> > 
> > For reference, how big is the IMA log likely to be? Does it need
> > physically contiguous space?
> 
> It purely depends on the policy used and the modules/files that are accessed
> for my local testing over a kexec session the log in
> about 30KB.
> 
> Current implementation expects enough contiguous memory to allocated to
> carry forward the logs. If the log size exceeds the reserved memory the
> call will fail.
> 
> Thanks,
> Prakhar Srivastava
> > 
> > Thanks,
> > Mark.
> > 
> > > 
> > > Reserved memory stores the size(sizeof(size_t)) of the buffer in the starting
> > > address, followed by the IMA log contents.
> > > 
> > > Tested on:
> > >    arm64 with Uboot
> > > 
> > > Prakhar Srivastava (2):
> > >    Add a layer of abstraction to use the memory reserved by device tree
> > >      for ima buffer pass.
> > >    Add support for ima buffer pass using reserved memory for arm64 kexec.
> > >      Update the arch sepcific code path in kexec file load to store the
> > >      ima buffer in the reserved memory. The same reserved memory is read
> > >      on kexec or cold boot.
> > > 
> > >   arch/arm64/Kconfig                     |   1 +
> > >   arch/arm64/include/asm/ima.h           |  22 ++++
> > >   arch/arm64/include/asm/kexec.h         |   5 +
> > >   arch/arm64/kernel/Makefile             |   1 +
> > >   arch/arm64/kernel/ima_kexec.c          |  64 ++++++++++
> > >   arch/arm64/kernel/machine_kexec_file.c |   1 +
> > >   arch/powerpc/include/asm/ima.h         |   3 +-
> > >   arch/powerpc/kexec/ima.c               |  14 ++-
> > >   drivers/of/Kconfig                     |   6 +
> > >   drivers/of/Makefile                    |   1 +
> > >   drivers/of/of_ima.c                    | 165 +++++++++++++++++++++++++
> > >   include/linux/of.h                     |  34 +++++
> > >   security/integrity/ima/ima_kexec.c     |  15 ++-
> > >   13 files changed, 325 insertions(+), 7 deletions(-)
> > >   create mode 100644 arch/arm64/include/asm/ima.h
> > >   create mode 100644 arch/arm64/kernel/ima_kexec.c
> > >   create mode 100644 drivers/of/of_ima.c
> > > 
> > > -- 
> > > 2.25.1
> > > 

^ permalink raw reply

* Re: [PATCH 19/31] riscv: use asm-generic/cacheflush.h
From: Palmer Dabbelt @ 2020-05-12 23:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-ia64, linux-sh, zippel, linux-mips, linux-mm, sparclinux,
	linux-riscv, linux-arch, linux-c6x-dev, linux-hexagon, x86,
	linux-xtensa, Arnd Bergmann, jeyu, linux-um, linux-m68k, openrisc,
	linux-arm-kernel, monstr, linux-kernel, linux-alpha,
	linux-fsdevel, akpm, linuxppc-dev
In-Reply-To: <20200510075510.987823-20-hch@lst.de>

On Sun, 10 May 2020 00:54:58 PDT (-0700), Christoph Hellwig wrote:
> RISC-V needs almost no cache flushing routines of its own.  Rely on
> asm-generic/cacheflush.h for the defaults.
>
> Also remove the pointless __KERNEL__ ifdef while we're at it.
> ---
>  arch/riscv/include/asm/cacheflush.h | 62 ++---------------------------
>  1 file changed, 3 insertions(+), 59 deletions(-)
>
> diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> index c8677c75f82cb..a167b4fbdf007 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -8,65 +8,6 @@
>
>  #include <linux/mm.h>
>
> -#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 0
> -
> -/*
> - * The cache doesn't need to be flushed when TLB entries change when
> - * the cache is mapped to physical memory, not virtual memory
> - */
> -static inline void flush_cache_all(void)
> -{
> -}
> -
> -static inline void flush_cache_mm(struct mm_struct *mm)
> -{
> -}
> -
> -static inline void flush_cache_dup_mm(struct mm_struct *mm)
> -{
> -}
> -
> -static inline void flush_cache_range(struct vm_area_struct *vma,
> -				     unsigned long start,
> -				     unsigned long end)
> -{
> -}
> -
> -static inline void flush_cache_page(struct vm_area_struct *vma,
> -				    unsigned long vmaddr,
> -				    unsigned long pfn)
> -{
> -}
> -
> -static inline void flush_dcache_mmap_lock(struct address_space *mapping)
> -{
> -}
> -
> -static inline void flush_dcache_mmap_unlock(struct address_space *mapping)
> -{
> -}
> -
> -static inline void flush_icache_page(struct vm_area_struct *vma,
> -				     struct page *page)
> -{
> -}
> -
> -static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> -{
> -}
> -
> -static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
> -{
> -}
> -
> -#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
> -	do { \
> -		memcpy(dst, src, len); \
> -		flush_icache_user_range(vma, page, vaddr, len); \
> -	} while (0)
> -#define copy_from_user_page(vma, page, vaddr, dst, src, len) \
> -	memcpy(dst, src, len)
> -
>  static inline void local_flush_icache_all(void)
>  {
>  	asm volatile ("fence.i" ::: "memory");
> @@ -79,6 +20,7 @@ static inline void flush_dcache_page(struct page *page)
>  	if (test_bit(PG_dcache_clean, &page->flags))
>  		clear_bit(PG_dcache_clean, &page->flags);
>  }
> +#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
>
>  /*
>   * RISC-V doesn't have an instruction to flush parts of the instruction cache,
> @@ -105,4 +47,6 @@ void flush_icache_mm(struct mm_struct *mm, bool local);
>  #define SYS_RISCV_FLUSH_ICACHE_LOCAL 1UL
>  #define SYS_RISCV_FLUSH_ICACHE_ALL   (SYS_RISCV_FLUSH_ICACHE_LOCAL)
>
> +#include <asm-generic/cacheflush.h>
> +
>  #endif /* _ASM_RISCV_CACHEFLUSH_H */

Thanks!

Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>

Were you trying to get these all in at once, or do you want me to take it into
my tree?

^ permalink raw reply

* [PATCH v2 1/1] powerpc/crash: Use NMI context for printk when starting to crash
From: Leonardo Bras @ 2020-05-12 21:45 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	Thomas Gleixner, Alexios Zavras, Enrico Weigelt,
	Greg Kroah-Hartman, Leonardo Bras, Christophe Leroy
  Cc: linuxppc-dev, linux-kernel

Currently, if printk lock (logbuf_lock) is held by other thread during
crash, there is a chance of deadlocking the crash on next printk, and
blocking a possibly desired kdump.

At the start of default_machine_crash_shutdown, make printk enter
NMI context, as it will use per-cpu buffers to store the message,
and avoid locking logbuf_lock.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>

---
Changes since v1:
- Added in-code comment explaining the need of context change
- Function moved to the start of default_machine_crash_shutdown,
  to avoid locking any printk on crashing routine.
- Title was 'Use NMI context for printk after crashing other CPUs'

---
 arch/powerpc/kexec/crash.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index d488311efab1..c9a889880214 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -311,6 +311,9 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
 	unsigned int i;
 	int (*old_handler)(struct pt_regs *regs);
 
+	/* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
+	printk_nmi_enter();
+
 	/*
 	 * This function is only called after the system
 	 * has panicked or is otherwise in a critical state.
-- 
2.25.4


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox