LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v1] soc: fsl: rcpm: Add ACPI support
From: Ard Biesheuvel @ 2020-09-16  6:10 UTC (permalink / raw)
  To: Ran Wang, kuldip dwivedi, Leo Li, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
  Cc: Biwen Li, Samer El-Haj-Mahmoud, Arokia Samy, Paul Yang,
	Varun Sethi, tanveer
In-Reply-To: <AM6PR04MB5413903EAAEDB2EED2E254C6F1210@AM6PR04MB5413.eurprd04.prod.outlook.com>

On 9/16/20 3:32 AM, Ran Wang wrote:
> Hi Ard,
> 
> On Tuesday, September 15, 2020 7:10 PM, Ard Biesheuvel wrote:
>> Subject: Re: [PATCH v1] soc: fsl: rcpm: Add ACPI support
>>
>> On 9/15/20 1:06 PM, kuldip dwivedi wrote:
>>> Add ACPI support in fsl RCPM driver. This is required to support ACPI
>>> S3 state. S3 is the ACPI sleep state that is known as "sleep" or
>>> "suspend to RAM".
>>> It essentially turns off most power of the system but keeps memory
>>> powered.
>>>
>>> Signed-off-by: tanveer <tanveer.alam@puresoftware.com>
>>> Signed-off-by: kuldip dwivedi <kuldip.dwivedi@puresoftware.com>
>>
>> Why does the OS need to program this device? Can't this be done by
>> firmware?
> 
> This device is use to tell HW which IP (such as USB, SDHC, SATA, etc) should not be
> clock gated during system enter low power state (to allow that IP work as a
> wakeup source). And user does this configuration in device tree.

The point of ACPI is *not* to describe a DT topology using a table 
format that is not suited for it. The point of ACPI is to describe a 
machine that is more abstracted from the hardware than is typically 
possible with DT, where the abstractions are implemented by AML code 
that is provided by the firmware, but executed in the context of the OS.

So the idea is *not* finding the shortest possible path to get your 
existing DT driver code running on a system that boots via ACPI. 
Instead, you should carefully think about the abstract ACPI machine that 
you will expose to the OS, and hide everything else in firmware.

In this particular case, it seems like your USB, SDHC and SATA device 
objects may need power state dependent AML methods that program this 
block directly.



> So implement
> this RCPM driver to do it in kernel rather than firmware.
> 
> Regards,
> Ran
> 
>>> ---
>>>
>>> Notes:
>>>       1. Add ACPI match table
>>>       2. NXP team members are added for confirming HID changes
>>>       3. There is only one node in ACPI so no need to check for
>>>          current device explicitly
>>>       4. These changes are tested on LX2160A and LS1046A platforms
>>>
>>>    drivers/soc/fsl/rcpm.c | 22 +++++++++++++++++++---
>>>    1 file changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c index
>>> a093dbe6d2cb..e75a436fb159 100644
>>> --- a/drivers/soc/fsl/rcpm.c
>>> +++ b/drivers/soc/fsl/rcpm.c
>>> @@ -2,10 +2,12 @@
>>>    //
>>>    // rcpm.c - Freescale QorIQ RCPM driver
>>>    //
>>> -// Copyright 2019 NXP
>>> +// Copyright 2019-2020 NXP
>>> +// Copyright 2020 Puresoftware Ltd.
>>>    //
>>>    // Author: Ran Wang <ran.wang_1@nxp.com>
>>>
>>> +#include <linux/acpi.h>
>>>    #include <linux/init.h>
>>>    #include <linux/module.h>
>>>    #include <linux/platform_device.h>
>>> @@ -57,8 +59,13 @@ static int rcpm_pm_prepare(struct device *dev)
>>>    				rcpm->wakeup_cells + 1);
>>>
>>>    		/*  Wakeup source should refer to current rcpm device */
>>> -		if (ret || (np->phandle != value[0]))
>>> -			continue;
>>> +		if (is_acpi_node(dev->fwnode)) {
>>> +			if (ret)
>>> +				continue;
>>> +		} else {
>>> +			if (ret || (np->phandle != value[0]))
>>> +				continue;
>>> +		}
>>>
>>>    		/* Property "#fsl,rcpm-wakeup-cells" of rcpm node defines the
>>>    		 * number of IPPDEXPCR register cells, and "fsl,rcpm-wakeup"
>>> @@ -139,10 +146,19 @@ static const struct of_device_id rcpm_of_match[]
>> = {
>>>    };
>>>    MODULE_DEVICE_TABLE(of, rcpm_of_match);
>>>
>>> +#ifdef CONFIG_ACPI
>>> +static const struct acpi_device_id rcpm_acpi_match[] = {
>>> +	{ "NXP0015", },
>>> +	{ }
>>> +};
>>> +MODULE_DEVICE_TABLE(acpi, rcpm_acpi_match); #endif
>>> +
>>>    static struct platform_driver rcpm_driver = {
>>>    	.driver = {
>>>    		.name = "rcpm",
>>>    		.of_match_table = rcpm_of_match,
>>> +		.acpi_match_table = ACPI_PTR(rcpm_acpi_match),
>>>    		.pm	= &rcpm_pm_ops,
>>>    	},
>>>    	.probe = rcpm_probe,
>>>
> 


^ permalink raw reply

* Re: [PATCH v2 2/7] powerpc/prom: Introduce early_reserve_mem_old()
From: Cédric Le Goater @ 2020-09-16  5:56 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Christophe Leroy, linuxppc-dev
In-Reply-To: <20200915184607.Horde._j-BRtSmJ6vRGSRwLWoN7Q2@messagerie.si.c-s.fr>

On 9/15/20 6:46 PM, Christophe Leroy wrote:
> Cédric Le Goater <clg@kaod.org> a écrit :
> 
>> and condition its call with IS_ENABLED(CONFIG_PPC32). This fixes a
>> compile error with W=1.
>>
>> arch/powerpc/kernel/prom.c: In function ‘early_reserve_mem’:
>> arch/powerpc/kernel/prom.c:625:10: error: variable ‘reserve_map’ set but not used [-Werror=unused-but-set-variable]
>>   __be64 *reserve_map;
>>           ^~~~~~~~~~~
>> cc1: all warnings being treated as errors
>>
>> Cc: Christophe Leroy <christophe.leroy@c-s.fr>
> 
> @csgroup.eu instead of @c-s.fr please
> 
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/kernel/prom.c | 37 ++++++++++++++++++++-----------------
>>  1 file changed, 20 insertions(+), 17 deletions(-)
> 
> That's a lot of changes for a tiny warning.
> 
> You could make it easy by just replacing the #ifdef by:
> 
>         if (!IS_ENABLED(CONFIG_PPC32))
>                 return;

It's equivalent and it moves out the reserve_map variable of the main routine
which I think is better.

>>
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index d8a2fb87ba0c..c958b67cf1a5 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -620,27 +620,14 @@ static void __init early_reserve_mem_dt(void)
>>      }
>>  }
>>
>> -static void __init early_reserve_mem(void)
>> +static void __init early_reserve_mem_old(void)
> 
> Why _old ? Do you mean ppc32 are old ? Modern ADSL boxes like for instance the famous French freebox have powerpc32 microcontroller.
> Eventually you could name it _ppc32, but I don't think that's the good way, see above.

I choose old because of the comment ' ... booting from an old kexec ... ', 
but I agree _ppc32 might be a better choice.

Thanks,

C. 

> Christophe
> 
>>  {
>>      __be64 *reserve_map;
>>
>>      reserve_map = (__be64 *)(((unsigned long)initial_boot_params) +
>>              fdt_off_mem_rsvmap(initial_boot_params));
>>
>> -    /* Look for the new "reserved-regions" property in the DT */
>> -    early_reserve_mem_dt();
>> -
>> -#ifdef CONFIG_BLK_DEV_INITRD
>> -    /* Then reserve the initrd, if any */
>> -    if (initrd_start && (initrd_end > initrd_start)) {
>> -        memblock_reserve(ALIGN_DOWN(__pa(initrd_start), PAGE_SIZE),
>> -            ALIGN(initrd_end, PAGE_SIZE) -
>> -            ALIGN_DOWN(initrd_start, PAGE_SIZE));
>> -    }
>> -#endif /* CONFIG_BLK_DEV_INITRD */
>> -
>> -#ifdef CONFIG_PPC32
>> -    /*
>> +    /*
>>       * Handle the case where we might be booting from an old kexec
>>       * image that setup the mem_rsvmap as pairs of 32-bit values
>>       */
>> @@ -658,9 +645,25 @@ static void __init early_reserve_mem(void)
>>              DBG("reserving: %x -> %x\n", base_32, size_32);
>>              memblock_reserve(base_32, size_32);
>>          }
>> -        return;
>>      }
>> -#endif
>> +}
>> +
>> +static void __init early_reserve_mem(void)
>> +{
>> +    /* Look for the new "reserved-regions" property in the DT */
>> +    early_reserve_mem_dt();
>> +
>> +#ifdef CONFIG_BLK_DEV_INITRD
>> +    /* Then reserve the initrd, if any */
>> +    if (initrd_start && (initrd_end > initrd_start)) {
>> +        memblock_reserve(ALIGN_DOWN(__pa(initrd_start), PAGE_SIZE),
>> +            ALIGN(initrd_end, PAGE_SIZE) -
>> +            ALIGN_DOWN(initrd_start, PAGE_SIZE));
>> +    }
>> +#endif /* CONFIG_BLK_DEV_INITRD */
>> +
>> +    if (IS_ENABLED(CONFIG_PPC32))
>> +        early_reserve_mem_old();
>>  }
>>
>>  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>> -- 
>> 2.25.4
> 
> 


^ permalink raw reply

* [PATCH v2 2/2] powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address
From: Nicholas Piggin @ 2020-09-16  3:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin
In-Reply-To: <20200916030234.4110379-1-npiggin@gmail.com>

The copy buffer is implemented as a real address in the nest which is
translated from EA by copy, and used for memory access by paste. This
requires that it be invalidated by TLB invalidation.

TLBIE does invalidate the copy buffer, but TLBIEL does not. Add cp_abort
to the tlbiel sequence.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
v2:
- Untangle headers that were causing build failures.
- Improve the comment a bit.
- Exempt POWER9 from the workaround, as described by the comment (we
  worked this out already but I forgot about it when doing v1!)

 arch/powerpc/include/asm/synch.h       | 19 ++++++++++++++++++-
 arch/powerpc/mm/book3s64/hash_native.c |  8 ++++----
 arch/powerpc/mm/book3s64/radix_tlb.c   | 12 ++++++------
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index aca70fb43147..a2e89f27d547 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -3,8 +3,9 @@
 #define _ASM_POWERPC_SYNCH_H 
 #ifdef __KERNEL__
 
+#include <asm/cputable.h>
 #include <asm/feature-fixups.h>
-#include <asm/asm-const.h>
+#include <asm/ppc-opcode.h>
 
 #ifndef __ASSEMBLY__
 extern unsigned int __start___lwsync_fixup, __stop___lwsync_fixup;
@@ -20,6 +21,22 @@ static inline void isync(void)
 {
 	__asm__ __volatile__ ("isync" : : : "memory");
 }
+
+static inline void ppc_after_tlbiel_barrier(void)
+{
+        asm volatile("ptesync": : :"memory");
+	/*
+	 * POWER9, POWER10 need a cp_abort after tlbiel to ensure the copy
+	 * is invalidated correctly. If this is not done, the paste can take
+	 * data from the physical address that was translated at copy time.
+	 *
+	 * POWER9 in practice does not need this, because address spaces
+	 * with accelerators mapped will use tlbie (which does invalidate
+	 * the copy) to invalidate translations. It's not possible to limit
+	 * POWER10 this way due to local copy-paste.
+	 */
+        asm volatile(ASM_FTR_IFSET(PPC_CP_ABORT, "", %0) : : "i" (CPU_FTR_ARCH_31) : "memory");
+}
 #endif /* __ASSEMBLY__ */
 
 #if defined(__powerpc64__)
diff --git a/arch/powerpc/mm/book3s64/hash_native.c b/arch/powerpc/mm/book3s64/hash_native.c
index cf20e5229ce1..0203cdf48c54 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -82,7 +82,7 @@ static void tlbiel_all_isa206(unsigned int num_sets, unsigned int is)
 	for (set = 0; set < num_sets; set++)
 		tlbiel_hash_set_isa206(set, is);
 
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 }
 
 static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
@@ -110,7 +110,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
 	 */
 	tlbiel_hash_set_isa300(0, is, 0, 2, 1);
 
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 
 	asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
 }
@@ -303,7 +303,7 @@ static inline void tlbie(unsigned long vpn, int psize, int apsize,
 	asm volatile("ptesync": : :"memory");
 	if (use_local) {
 		__tlbiel(vpn, psize, apsize, ssize);
-		asm volatile("ptesync": : :"memory");
+		ppc_after_tlbiel_barrier();
 	} else {
 		__tlbie(vpn, psize, apsize, ssize);
 		fixup_tlbie_vpn(vpn, psize, apsize, ssize);
@@ -879,7 +879,7 @@ static void native_flush_hash_range(unsigned long number, int local)
 				__tlbiel(vpn, psize, psize, ssize);
 			} pte_iterate_hashed_end();
 		}
-		asm volatile("ptesync":::"memory");
+		ppc_after_tlbiel_barrier();
 	} else {
 		int lock_tlbie = !mmu_has_feature(MMU_FTR_LOCKLESS_TLBIE);
 
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..5c9d2fccacc7 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -65,7 +65,7 @@ static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
 	for (set = 1; set < num_sets; set++)
 		tlbiel_radix_set_isa300(set, is, 0, RIC_FLUSH_TLB, 1);
 
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 }
 
 void radix__tlbiel_all(unsigned int action)
@@ -296,7 +296,7 @@ static __always_inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
 
 	/* For PWC, only one flush is needed */
 	if (ric == RIC_FLUSH_PWC) {
-		asm volatile("ptesync": : :"memory");
+		ppc_after_tlbiel_barrier();
 		return;
 	}
 
@@ -304,7 +304,7 @@ static __always_inline void _tlbiel_pid(unsigned long pid, unsigned long ric)
 	for (set = 1; set < POWER9_TLB_SETS_RADIX ; set++)
 		__tlbiel_pid(pid, set, RIC_FLUSH_TLB);
 
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 	asm volatile(PPC_RADIX_INVALIDATE_ERAT_USER "; isync" : : :"memory");
 }
 
@@ -431,7 +431,7 @@ static __always_inline void _tlbiel_va(unsigned long va, unsigned long pid,
 
 	asm volatile("ptesync": : :"memory");
 	__tlbiel_va(va, pid, ap, ric);
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 }
 
 static inline void _tlbiel_va_range(unsigned long start, unsigned long end,
@@ -442,7 +442,7 @@ static inline void _tlbiel_va_range(unsigned long start, unsigned long end,
 	if (also_pwc)
 		__tlbiel_pid(pid, 0, RIC_FLUSH_PWC);
 	__tlbiel_va_range(start, end, pid, page_size, psize);
-	asm volatile("ptesync": : :"memory");
+	ppc_after_tlbiel_barrier();
 }
 
 static inline void __tlbie_va_range(unsigned long start, unsigned long end,
@@ -940,7 +940,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm,
 			if (hflush)
 				__tlbiel_va_range(hstart, hend, pid,
 						PMD_SIZE, MMU_PAGE_2M);
-			asm volatile("ptesync": : :"memory");
+			ppc_after_tlbiel_barrier();
 		} else if (cputlb_use_tlbie()) {
 			asm volatile("ptesync": : :"memory");
 			__tlbie_va_range(start, end, pid, page_size, mmu_virtual_psize);
-- 
2.23.0


^ permalink raw reply related

* [PATCH v2 1/2] powerpc: untangle cputable mce include
From: Nicholas Piggin @ 2020-09-16  3:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin

Having cputable.h include mce.h means it pulls in a bunch of low level
headers (e.g., synch.h) which then can't use CPU_FTR_ definitions.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/cputable.h | 5 -----
 arch/powerpc/kernel/cputable.c      | 1 +
 arch/powerpc/kernel/dt_cpu_ftrs.c   | 1 +
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index 32a15dc49e8c..f89205eff691 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -9,11 +9,6 @@
 
 #ifndef __ASSEMBLY__
 
-/*
- * Added to include __machine_check_early_realmode_* functions
- */
-#include <asm/mce.h>
-
 /* This structure can grow, it's real size is used by head.S code
  * via the mkdefs mechanism.
  */
diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c
index 2aa89c6b2896..b5bc2edef440 100644
--- a/arch/powerpc/kernel/cputable.c
+++ b/arch/powerpc/kernel/cputable.c
@@ -16,6 +16,7 @@
 #include <asm/oprofile_impl.h>
 #include <asm/cputable.h>
 #include <asm/prom.h>		/* for PTRRELOC on ARCH=ppc */
+#include <asm/mce.h>
 #include <asm/mmu.h>
 #include <asm/setup.h>
 
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
index f204ad79b6b5..1098863e17ee 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -17,6 +17,7 @@
 
 #include <asm/cputable.h>
 #include <asm/dt_cpu_ftrs.h>
+#include <asm/mce.h>
 #include <asm/mmu.h>
 #include <asm/oprofile_impl.h>
 #include <asm/prom.h>
-- 
2.23.0


^ permalink raw reply related

* [PATCH -next] powerpc/pseries: convert to use DEFINE_SEQ_ATTRIBUTE macro
From: Liu Shixin @ 2020-09-16  2:50 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras
  Cc: Liu Shixin, linuxppc-dev, linux-kernel

Use DEFINE_SEQ_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liu Shixin <liushixin2@huawei.com>
---
 arch/powerpc/platforms/pseries/hvCall_inst.c | 23 +++-----------------
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c b/arch/powerpc/platforms/pseries/hvCall_inst.c
index c40c62ec432e..2c59b4986ea5 100644
--- a/arch/powerpc/platforms/pseries/hvCall_inst.c
+++ b/arch/powerpc/platforms/pseries/hvCall_inst.c
@@ -70,31 +70,14 @@ static int hc_show(struct seq_file *m, void *p)
 	return 0;
 }
 
-static const struct seq_operations hcall_inst_seq_ops = {
+static const struct seq_operations hcall_inst_sops = {
         .start = hc_start,
         .next  = hc_next,
         .stop  = hc_stop,
         .show  = hc_show
 };
 
-static int hcall_inst_seq_open(struct inode *inode, struct file *file)
-{
-	int rc;
-	struct seq_file *seq;
-
-	rc = seq_open(file, &hcall_inst_seq_ops);
-	seq = file->private_data;
-	seq->private = file_inode(file)->i_private;
-
-	return rc;
-}
-
-static const struct file_operations hcall_inst_seq_fops = {
-	.open = hcall_inst_seq_open,
-	.read = seq_read,
-	.llseek = seq_lseek,
-	.release = seq_release,
-};
+DEFINE_SEQ_ATTRIBUTE(hcall_inst);
 
 #define	HCALL_ROOT_DIR		"hcall_inst"
 #define CPU_NAME_BUF_SIZE	32
@@ -149,7 +132,7 @@ static int __init hcall_inst_init(void)
 		snprintf(cpu_name_buf, CPU_NAME_BUF_SIZE, "cpu%d", cpu);
 		debugfs_create_file(cpu_name_buf, 0444, hcall_root,
 				    per_cpu(hcall_stats, cpu),
-				    &hcall_inst_seq_fops);
+				    &hcall_inst_fops);
 	}
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related

* RE: [PATCH v1] soc: fsl: rcpm: Add ACPI support
From: Ran Wang @ 2020-09-16  1:32 UTC (permalink / raw)
  To: Ard Biesheuvel, kuldip dwivedi, Leo Li,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
  Cc: Biwen Li, Samer El-Haj-Mahmoud, Arokia Samy, Paul Yang,
	Varun Sethi, tanveer
In-Reply-To: <4e008f0a-69da-d5c2-4dfc-ef8695e17f47@arm.com>

Hi Ard,

On Tuesday, September 15, 2020 7:10 PM, Ard Biesheuvel wrote:
> Subject: Re: [PATCH v1] soc: fsl: rcpm: Add ACPI support
> 
> On 9/15/20 1:06 PM, kuldip dwivedi wrote:
> > Add ACPI support in fsl RCPM driver. This is required to support ACPI
> > S3 state. S3 is the ACPI sleep state that is known as "sleep" or
> > "suspend to RAM".
> > It essentially turns off most power of the system but keeps memory
> > powered.
> >
> > Signed-off-by: tanveer <tanveer.alam@puresoftware.com>
> > Signed-off-by: kuldip dwivedi <kuldip.dwivedi@puresoftware.com>
> 
> Why does the OS need to program this device? Can't this be done by
> firmware?

This device is use to tell HW which IP (such as USB, SDHC, SATA, etc) should not be
clock gated during system enter low power state (to allow that IP work as a
wakeup source). And user does this configuration in device tree. So implement
this RCPM driver to do it in kernel rather than firmware.

Regards,
Ran

> > ---
> >
> > Notes:
> >      1. Add ACPI match table
> >      2. NXP team members are added for confirming HID changes
> >      3. There is only one node in ACPI so no need to check for
> >         current device explicitly
> >      4. These changes are tested on LX2160A and LS1046A platforms
> >
> >   drivers/soc/fsl/rcpm.c | 22 +++++++++++++++++++---
> >   1 file changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c index
> > a093dbe6d2cb..e75a436fb159 100644
> > --- a/drivers/soc/fsl/rcpm.c
> > +++ b/drivers/soc/fsl/rcpm.c
> > @@ -2,10 +2,12 @@
> >   //
> >   // rcpm.c - Freescale QorIQ RCPM driver
> >   //
> > -// Copyright 2019 NXP
> > +// Copyright 2019-2020 NXP
> > +// Copyright 2020 Puresoftware Ltd.
> >   //
> >   // Author: Ran Wang <ran.wang_1@nxp.com>
> >
> > +#include <linux/acpi.h>
> >   #include <linux/init.h>
> >   #include <linux/module.h>
> >   #include <linux/platform_device.h>
> > @@ -57,8 +59,13 @@ static int rcpm_pm_prepare(struct device *dev)
> >   				rcpm->wakeup_cells + 1);
> >
> >   		/*  Wakeup source should refer to current rcpm device */
> > -		if (ret || (np->phandle != value[0]))
> > -			continue;
> > +		if (is_acpi_node(dev->fwnode)) {
> > +			if (ret)
> > +				continue;
> > +		} else {
> > +			if (ret || (np->phandle != value[0]))
> > +				continue;
> > +		}
> >
> >   		/* Property "#fsl,rcpm-wakeup-cells" of rcpm node defines the
> >   		 * number of IPPDEXPCR register cells, and "fsl,rcpm-wakeup"
> > @@ -139,10 +146,19 @@ static const struct of_device_id rcpm_of_match[]
> = {
> >   };
> >   MODULE_DEVICE_TABLE(of, rcpm_of_match);
> >
> > +#ifdef CONFIG_ACPI
> > +static const struct acpi_device_id rcpm_acpi_match[] = {
> > +	{ "NXP0015", },
> > +	{ }
> > +};
> > +MODULE_DEVICE_TABLE(acpi, rcpm_acpi_match); #endif
> > +
> >   static struct platform_driver rcpm_driver = {
> >   	.driver = {
> >   		.name = "rcpm",
> >   		.of_match_table = rcpm_of_match,
> > +		.acpi_match_table = ACPI_PTR(rcpm_acpi_match),
> >   		.pm	= &rcpm_pm_ops,
> >   	},
> >   	.probe = rcpm_probe,
> >


^ permalink raw reply

* RE: [PATCH v1] soc: fsl: rcpm: Add ACPI support
From: Ran Wang @ 2020-09-16  1:21 UTC (permalink / raw)
  To: kuldip dwivedi, Leo Li, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
  Cc: Biwen Li, Samer El-Haj-Mahmoud, tanveer, Paul Yang,
	kuldip dwivedi, Varun Sethi, Ard Biesheuvel, Arokia Samy
In-Reply-To: <20200915110647.846-1-kuldip.dwivedi@puresoftware.com>

Hi Kuldip,

On Tuesday, September 15, 2020 7:07 PM, kuldip dwivedi wrote:
> Subject: [PATCH v1] soc: fsl: rcpm: Add ACPI support

Actually I also post a patch for this recently: https://lore.kernel.org/patchwork/patch/1299959/  :)

Regards,
Ran

> Add ACPI support in fsl RCPM driver. This is required to support ACPI S3 state.
> S3 is the ACPI sleep state that is known as "sleep" or "suspend to RAM".
> It essentially turns off most power of the system but keeps memory powered.

Actually the low power mode is to gate clocks rather than power down on Layerscape platforms.

> Signed-off-by: tanveer <tanveer.alam@puresoftware.com>
> Signed-off-by: kuldip dwivedi <kuldip.dwivedi@puresoftware.com>
> ---
> 
> Notes:
>     1. Add ACPI match table
>     2. NXP team members are added for confirming HID changes
>     3. There is only one node in ACPI so no need to check for
>        current device explicitly
>     4. These changes are tested on LX2160A and LS1046A platforms
> 
>  drivers/soc/fsl/rcpm.c | 22 +++++++++++++++++++---
>  1 file changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c index
> a093dbe6d2cb..e75a436fb159 100644
> --- a/drivers/soc/fsl/rcpm.c
> +++ b/drivers/soc/fsl/rcpm.c
> @@ -2,10 +2,12 @@
>  //
>  // rcpm.c - Freescale QorIQ RCPM driver  // -// Copyright 2019 NXP
> +// Copyright 2019-2020 NXP
> +// Copyright 2020 Puresoftware Ltd.
>  //
>  // Author: Ran Wang <ran.wang_1@nxp.com>
> 
> +#include <linux/acpi.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
>  #include <linux/platform_device.h>
> @@ -57,8 +59,13 @@ static int rcpm_pm_prepare(struct device *dev)
>  				rcpm->wakeup_cells + 1);
> 
>  		/*  Wakeup source should refer to current rcpm device */
> -		if (ret || (np->phandle != value[0]))
> -			continue;
> +		if (is_acpi_node(dev->fwnode)) {
> +			if (ret)
> +				continue;
> +		} else {
> +			if (ret || (np->phandle != value[0]))
> +				continue;
> +		}
>  		/* Property "#fsl,rcpm-wakeup-cells" of rcpm node defines the
>  		 * number of IPPDEXPCR register cells, and "fsl,rcpm-wakeup"
> @@ -139,10 +146,19 @@ static const struct of_device_id rcpm_of_match[] =
> {  };  MODULE_DEVICE_TABLE(of, rcpm_of_match);
> 
> +#ifdef CONFIG_ACPI
> +static const struct acpi_device_id rcpm_acpi_match[] = {
> +	{ "NXP0015", },
> +	{ }
> +};
> +MODULE_DEVICE_TABLE(acpi, rcpm_acpi_match); #endif
> +
>  static struct platform_driver rcpm_driver = {
>  	.driver = {
>  		.name = "rcpm",
>  		.of_match_table = rcpm_of_match,
> +		.acpi_match_table = ACPI_PTR(rcpm_acpi_match),
>  		.pm	= &rcpm_pm_ops,
>  	},
>  	.probe = rcpm_probe,
> --
> 2.17.1


^ permalink raw reply

* Re: Injecting SLB miltihit crashes kernel 5.9.0-rc5
From: Nicholas Piggin @ 2020-09-16  0:51 UTC (permalink / raw)
  To: linuxppc-dev, mahesh, Michael Ellerman, Michal Suchánek
In-Reply-To: <87bli7p5dx.fsf@mpe.ellerman.id.au>

Excerpts from Michael Ellerman's message of September 15, 2020 10:54 pm:
> Michal Suchánek <msuchanek@suse.de> writes:
>> Hello,
>>
>> Using the SLB mutihit injection test module (which I did not write so I
>> do not want to post it here) to verify updates on my 5.3 frankernekernel
>> I found that the kernel crashes with Oops: kernel bad access.
>>
>> I tested on latest upstream kernel build that I have at hand and the
>> result is te same (minus the message - nothing was logged and the kernel
>> simply rebooted).
> 
> That's disappointing.

It seems to work okay with qemu and mambo injection on upstream
(powernv_defconfig). I wonder why that nmi_enter is crashing.
Can you post the output of a successful test with the patch
reverted?


qemu injection test output - 
[  195.279885][    C0] Disabling lock debugging due to kernel taint
[  195.280891][    C0] MCE: CPU0: machine check (Warning) Host SLB Multihit DAR: 00000000deadbeef [Recovered]
[  195.282117][    C0] MCE: CPU0: NIP: [c00000000003c2b4] isa300_idle_stop_mayloss+0x68/0x6c
[  195.283631][    C0] MCE: CPU0: Initiator CPU
[  195.284432][    C0] MCE: CPU0: Probable Software error (some chance of hardware cause)
[  220.711577][   T90] MCE: CPU0: machine check (Warning) Host SLB Multihit DAR: 00000000deadbeef [Recovered]
[  220.712805][   T90] MCE: CPU0: PID: 90 Comm: yes NIP: [00007fff7fdac2e0]
[  220.713553][   T90] MCE: CPU0: Initiator CPU
[  220.714021][   T90] MCE: CPU0: Probable Software error (some chance of hardware cause)

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] ibmvfc: Avoid link down on FS9100 canister reboot
From: Martin K. Petersen @ 2020-09-16  0:49 UTC (permalink / raw)
  To: Brian King; +Cc: tyreld, linuxppc-dev, martin.petersen, linux-scsi
In-Reply-To: <1599859706-8505-1-git-send-email-brking@linux.vnet.ibm.com>


Brian,

> When a canister on a FS9100, or similar storage, running in NPIV mode,
> is rebooted, its WWPNs will fail over to another canister.

[...]

Applied to 5.10/scsi-staging, thanks! I fixed a bunch of checkpatch
warnings.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH net] ibmvnic: update MAINTAINERS
From: David Miller @ 2020-09-15 20:27 UTC (permalink / raw)
  To: drt; +Cc: netdev, linuxppc-dev
In-Reply-To: <20200915003535.819585-1-drt@linux.ibm.com>

From: Dany Madden <drt@linux.ibm.com>
Date: Mon, 14 Sep 2020 20:35:35 -0400

> Update supporters for IBM Power SRIOV Virtual NIC Device Driver. 
> Thomas Falcon is moving on to other works. Dany Madden, Lijun Pan 
> and Sukadev Bhattiprolu are the current supporters.
> 
> Signed-off-by: Dany Madden <drt@linux.ibm.com>

Applied.

^ permalink raw reply

* Re: [PATCH] scsi: ibmvfc: Fix error return in ibmvfc_probe()
From: Martin K. Petersen @ 2020-09-15 20:16 UTC (permalink / raw)
  To: mpe, benh, Jing Xiangfeng, jejb, paulus, tyreld
  Cc: linuxppc-dev, linux-kernel, Martin K . Petersen, linux-scsi
In-Reply-To: <20200907083949.154251-1-jingxiangfeng@huawei.com>

On Mon, 7 Sep 2020 16:39:49 +0800, Jing Xiangfeng wrote:

> Fix to return error code PTR_ERR() from the error handling case instead
> of 0.

Applied to 5.10/scsi-queue, thanks!

[1/1] scsi: ibmvfc: Fix error return in ibmvfc_probe()
      https://git.kernel.org/mkp/scsi/c/5e48a084f4e8

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* [PATCH v3] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
From: Scott Cheloha @ 2020-09-15 19:46 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, Laurent Dufour, David Hildenbrand,
	Rick Lindsley

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid.  In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup.  The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range.  This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
             4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
             2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
   445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
     8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
   300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
   258,091,488,691      instructions              #    0.58  insn per cycle
                                                  #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
    70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
     3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)

           105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )

Patched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
             4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
             2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
   442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
     8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
   299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
   252,731,168,193      instructions              #    0.57  insn per cycle
                                                  #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
    68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
     3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)

           104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )

This is consistent.  An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first.  On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
Changelog:

v1: https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-cheloha@linux.ibm.com/

v2:
- Move prototype for of_drconf_to_nid_single() to topology.h.
  Requested by Michael Ellerman.

v3:
- Send the right patch.  v2 is from the wrong branch, my mistake.

 arch/powerpc/include/asm/topology.h             | 3 +++
 arch/powerpc/mm/numa.c                          | 2 +-
 arch/powerpc/platforms/pseries/hotplug-memory.c | 6 ++++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..d15d9999bad6 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -86,6 +86,9 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 #endif /* CONFIG_NUMA */
 
+struct drmem_lmb;
+extern int of_drconf_to_nid_single(struct drmem_lmb *);
+
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int find_and_online_cpu_nid(int cpu);
 #else
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..63507b47164d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -430,7 +430,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
+int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 {
 	struct assoc_arrays aa = { .arrays = NULL };
 	int default_nid = NUMA_NO_NODE;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac4..9a533acf8ad0 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -611,8 +611,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	block_sz = memory_block_size_bytes();
 
-	/* Find the node id for this address. */
-	nid = memory_add_physaddr_to_nid(lmb->base_addr);
+	/* Find the node id for this LMB.  Fake one if necessary. */
+	nid = of_drconf_to_nid_single(lmb);
+	if (nid < 0 || !node_possible(nid))
+		nid = first_online_node;
 
 	/* Add the memory */
 	rc = __add_memory(nid, lmb->base_addr, block_sz);
-- 
2.24.1


^ permalink raw reply related

* Re: [PATCH v2 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
From: David Miller @ 2020-09-15 19:42 UTC (permalink / raw)
  To: npiggin
  Cc: axboe, linux-arch, peterz, aneesh.kumar, linux-kernel, linux-mm,
	sparclinux, akpm, linuxppc-dev
In-Reply-To: <1600139445.qwycwjuwdq.astroid@bobo.none>

From: Nicholas Piggin <npiggin@gmail.com>
Date: Tue, 15 Sep 2020 13:24:07 +1000

> Excerpts from David Miller's message of September 15, 2020 5:59 am:
>> From: Nicholas Piggin <npiggin@gmail.com>
>> Date: Mon, 14 Sep 2020 14:52:18 +1000
>> 
>>  ...
>>> The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
>>> optimisation could be effectively restored by sending IPIs to mm_cpumask
>>> members and having them remove themselves from mm_cpumask. This is more
>>> tricky so I leave it as an exercise for someone with a sparc64 SMP.
>>> powerpc has a (currently similarly broken) example.
>>> 
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> 
>> Sad to see this optimization go away, but what can I do:
>> 
>> Acked-by: David S. Miller <davem@davemloft.net>
>> 
> 
> Thanks Dave, any objection if we merge this via the powerpc tree
> to keep the commits together?

No objection.

^ permalink raw reply

* Re: [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
From: peterz @ 2020-09-15 18:16 UTC (permalink / raw)
  To: Michal Suchanek
  Cc: Aneesh Kumar K.V, Santosh Sivaraj, Alistair Popple,
	Greg Kroah-Hartman, Mahesh Salgaonkar, Nicholas Piggin,
	linux-kernel, Jordan Niethe, Paul Mackerras, Ganesh Goudar,
	stable, linuxppc-dev, Mike Rapoport
In-Reply-To: <20200915180659.12503-1-msuchanek@suse.de>

On Tue, Sep 15, 2020 at 08:06:59PM +0200, Michal Suchanek wrote:
> This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.
> 
> This commit causes the kernel to oops and reboot when injecting a SLB
> multihit which causes a MCE.
> 
> Before this commit a SLB multihit was corrected by the kernel and the
> system continued to operate normally.
> 
> cc: stable@vger.kernel.org
> Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
> Signed-off-by: Michal Suchanek <msuchanek@suse.de>

Ever since 69ea03b56ed2 ("hardirq/nmi: Allow nested nmi_enter()")
nmi_enter() supports nesting natively.

^ permalink raw reply

* [PATCH] Revert "powerpc/64s: machine check interrupt update NMI accounting"
From: Michal Suchanek @ 2020-09-15 18:06 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Aneesh Kumar K.V, Santosh Sivaraj, Alistair Popple,
	Greg Kroah-Hartman, Peter Zijlstra, Mahesh Salgaonkar,
	Nicholas Piggin, linux-kernel, Paul Mackerras, Ganesh Goudar,
	stable, Michal Suchanek, Jordan Niethe, Mike Rapoport
In-Reply-To: <20200915084302.GG29778@kitsune.suse.cz>

This reverts commit 116ac378bb3ff844df333e7609e7604651a0db9d.

This commit causes the kernel to oops and reboot when injecting a SLB
multihit which causes a MCE.

Before this commit a SLB multihit was corrected by the kernel and the
system continued to operate normally.

cc: stable@vger.kernel.org
Fixes: 116ac378bb3f ("powerpc/64s: machine check interrupt update NMI accounting")
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
---
 arch/powerpc/kernel/mce.c   |  7 -------
 arch/powerpc/kernel/traps.c | 18 +++---------------
 2 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index ada59f6c4298..2e13528dcc92 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -591,14 +591,10 @@ EXPORT_SYMBOL_GPL(machine_check_print_event_info);
 long notrace machine_check_early(struct pt_regs *regs)
 {
 	long handled = 0;
-	bool nested = in_nmi();
 	u8 ftrace_enabled = this_cpu_get_ftrace_enabled();
 
 	this_cpu_set_ftrace_enabled(0);
 
-	if (!nested)
-		nmi_enter();
-
 	hv_nmi_check_nonrecoverable(regs);
 
 	/*
@@ -607,9 +603,6 @@ long notrace machine_check_early(struct pt_regs *regs)
 	if (ppc_md.machine_check_early)
 		handled = ppc_md.machine_check_early(regs);
 
-	if (!nested)
-		nmi_exit();
-
 	this_cpu_set_ftrace_enabled(ftrace_enabled);
 
 	return handled;
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index d1ebe152f210..7853b770918d 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -827,19 +827,7 @@ void machine_check_exception(struct pt_regs *regs)
 {
 	int recover = 0;
 
-	/*
-	 * BOOK3S_64 does not call this handler as a non-maskable interrupt
-	 * (it uses its own early real-mode handler to handle the MCE proper
-	 * and then raises irq_work to call this handler when interrupts are
-	 * enabled).
-	 *
-	 * This is silly. The BOOK3S_64 should just call a different function
-	 * rather than expecting semantics to magically change. Something
-	 * like 'non_nmi_machine_check_exception()', perhaps?
-	 */
-	const bool nmi = !IS_ENABLED(CONFIG_PPC_BOOK3S_64);
-
-	if (nmi) nmi_enter();
+	nmi_enter();
 
 	__this_cpu_inc(irq_stat.mce_exceptions);
 
@@ -865,7 +853,7 @@ void machine_check_exception(struct pt_regs *regs)
 	if (check_io_access(regs))
 		goto bail;
 
-	if (nmi) nmi_exit();
+	nmi_exit();
 
 	die("Machine check", regs, SIGBUS);
 
@@ -876,7 +864,7 @@ void machine_check_exception(struct pt_regs *regs)
 	return;
 
 bail:
-	if (nmi) nmi_exit();
+	nmi_exit();
 }
 
 void SMIException(struct pt_regs *regs)
-- 
2.28.0


^ permalink raw reply related

* [PATCH v2] pseries/hotplug-memory: hot-add: skip redundant LMB lookup
From: Scott Cheloha @ 2020-09-15 17:55 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Nathan Lynch, Michal Suchanek, Laurent Dufour, David Hildenbrand,
	Rick Lindsley

During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
to determine which node id (nid) to use when later calling __add_memory().

This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
appropriate nid for a given address by looking up the LMB containing the
address and then passing that LMB to of_drconf_to_nid_single() to get the
nid.  In dlpar_add_lmb() we get this address from the LMB itself.

In short, we have a pointer to an LMB and then we are searching for
that LMB *again* in order to find its nid.

If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
can skip the redundant lookup.  The only error handling we need to
duplicate from memory_add_physaddr_to_nid() is the fallback to the
default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
an invalid nid.

Skipping the extra lookup makes hot-add operations faster, especially
on machines with many LMBs.

Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
LMBs on an upatched kernel took ~3.5 hours while a patched kernel
completed the same operation in ~2 hours:

Unpatched (12450 seconds):
Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

Patched (7065 seconds):
Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
[...]
Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added

It should be noted that the speedup grows more substantial when
hot-adding LMBs at the end of the drconf range.  This is because we
are skipping a linear LMB search.

To see the distinction, consider smaller hot-add test on the same
LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
LMBs completed less than 1 second faster on a patched kernel:

Unpatched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
             4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
             2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
   445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
     8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
   300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
   258,091,488,691      instructions              #    0.58  insn per cycle
                                                  #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
    70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
     3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)

           105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )

Patched:
 Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):

        104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
             4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
             2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
               394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
   442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
     8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
   299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
   252,731,168,193      instructions              #    0.57  insn per cycle
                                                  #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
    68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
     3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)

           104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )

This is consistent.  An add-by-count hot-add operation adds LMBs
greedily, so LMBs near the start of the drconf range are considered
first.  On an otherwise idle LPAR with so many LMBs we would expect to
find the LMBs we need near the start of the drconf range, hence the
smaller speedup.

Signed-off-by: Scott Cheloha <cheloha@linux.ibm.com>
---
Changelog:

v1: https://lore.kernel.org/linuxppc-dev/20200910175637.2865160-1-cheloha@linux.ibm.com/

v2:
- Move prototype for of_drconf_to_nid_single() to topology.h.
  Requested by Michael Ellerman.

 arch/powerpc/include/asm/topology.h           |   2 +
 arch/powerpc/mm/numa.c                        |   2 +-
 .../platforms/pseries/hotplug-memory.c        | 101 +++++++-----------
 3 files changed, 44 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..afd7e0513a65 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -96,6 +96,8 @@ static inline int find_and_online_cpu_nid(int cpu)
 
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
+extern int of_drconf_to_nid_single(struct drmem_lmb *);
+
 #include <asm-generic/topology.h>
 
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 1f61fa2148b5..63507b47164d 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -430,7 +430,7 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
  * This is like of_node_to_nid_single() for memory represented in the
  * ibm,dynamic-reconfiguration-memory node.
  */
-static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
+int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 {
 	struct assoc_arrays aa = { .arrays = NULL };
 	int default_nid = NUMA_NO_NODE;
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 0ea976d1cac4..f4474ef91fe5 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -388,108 +388,87 @@ static int dlpar_remove_lmb(struct drmem_lmb *lmb)
 static int dlpar_memory_remove_by_count(u32 lmbs_to_remove)
 {
 	struct drmem_lmb *lmb;
-	int lmbs_removed = 0;
-	int lmbs_available = 0;
+	u32 lmbs_available, lmbs_removed;
 	int rc;
+	boolean readd;
 
-	pr_info("Attempting to hot-remove %d LMB(s)\n", lmbs_to_remove);
+	lmbs_available = lmbs_removed = 0;
+	readd = false;
 
-	if (lmbs_to_remove == 0)
-		return -EINVAL;
+	pr_info("attempting to hot-remove %u LMB(s)\n", lmbs_to_remove);
 
 	/* Validate that there are enough LMBs to satisfy the request */
 	for_each_drmem_lmb(lmb) {
-		if (lmb_is_removable(lmb))
-			lmbs_available++;
-
 		if (lmbs_available == lmbs_to_remove)
 			break;
+		if (lmb_is_removable(lmb))
+			lmbs_available++;
 	}
 
 	if (lmbs_available < lmbs_to_remove) {
-		pr_info("Not enough LMBs available (%d of %d) to satisfy request\n",
+		pr_info("hot-remove failed: insufficient LMB(s): have %u/%u\n",
 			lmbs_available, lmbs_to_remove);
 		return -EINVAL;
 	}
 
 	for_each_drmem_lmb(lmb) {
-		rc = dlpar_remove_lmb(lmb);
-		if (rc)
+		if (lmbs_removed == lmbs_to_remove)
+			break;
+		if (dlpar_remove_lmb(lmb))
 			continue;
 
-		/* Mark this lmb so we can add it later if all of the
-		 * requested LMBs cannot be removed.
+		/*
+		 * Success!  Mark the LMB so we can readd it later if
+		 * the request fails.
 		 */
 		drmem_mark_lmb_reserved(lmb);
-
 		lmbs_removed++;
-		if (lmbs_removed == lmbs_to_remove)
-			break;
+		pr_debug("hot-removed LMB %u\n", lmb->drc_index);
 	}
 
 	if (lmbs_removed != lmbs_to_remove) {
-		pr_err("Memory hot-remove failed, adding LMB's back\n");
-
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
+		pr_err("hot-remove failed: readding LMB(s)\n");
+		readd = true;
+	}
 
-			rc = dlpar_add_lmb(lmb);
-			if (rc)
-				pr_err("Failed to add LMB back, drc index %x\n",
+	for_each_drmem_lmb(lmb) {
+		if (!drmem_lmb_reserved(lmb))
+			continue;
+		if (readd) {
+			if (dlpar_add_lmb(lmb)) {
+				pr_err("failed to readd LMB %u\n",
 				       lmb->drc_index);
-
-			drmem_remove_lmb_reservation(lmb);
-		}
-
-		rc = -EINVAL;
-	} else {
-		for_each_drmem_lmb(lmb) {
-			if (!drmem_lmb_reserved(lmb))
-				continue;
-
+			}
+		} else
 			dlpar_release_drc(lmb->drc_index);
-			pr_info("Memory at %llx was hot-removed\n",
-				lmb->base_addr);
-
-			drmem_remove_lmb_reservation(lmb);
-		}
-		rc = 0;
+		drmem_remove_lmb_reservation(lmb);
 	}
 
-	return rc;
+	return (readd) ? -EINVAL : 0;
 }
 
 static int dlpar_memory_remove_by_index(u32 drc_index)
 {
 	struct drmem_lmb *lmb;
-	int lmb_found;
 	int rc;
 
-	pr_info("Attempting to hot-remove LMB, drc index %x\n", drc_index);
-
 	lmb_found = 0;
 	for_each_drmem_lmb(lmb) {
 		if (lmb->drc_index == drc_index) {
-			lmb_found = 1;
 			rc = dlpar_remove_lmb(lmb);
-			if (!rc)
+			if (!rc) {
 				dlpar_release_drc(lmb->drc_index);
-
-			break;
+				pr_info("hot-removed LMB %u\n", drc_index);
+			} else {
+				pr_err("failed to hot-remove LMB %u\n",
+				       drc_index);
+			}
+			return rc;
 		}
 	}
 
-	if (!lmb_found)
-		rc = -EINVAL;
-
-	if (rc)
-		pr_info("Failed to hot-remove memory at %llx\n",
-			lmb->base_addr);
-	else
-		pr_info("Memory at %llx was hot-removed\n", lmb->base_addr);
-
-	return rc;
+	pr_err("failed to hot-remove LMB %u: no such LMB\n", drc_index);
+	return -EINVAL;
 }
 
 static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index)
@@ -611,8 +590,10 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
 
 	block_sz = memory_block_size_bytes();
 
-	/* Find the node id for this address. */
-	nid = memory_add_physaddr_to_nid(lmb->base_addr);
+	/* Find the node id for this LMB.  Fake one if necessary. */
+	nid = of_drconf_to_nid_single(lmb);
+	if (nid < 0 || !node_possible(nid))
+		nid = first_online_node;
 
 	/* Add the memory */
 	rc = __add_memory(nid, lmb->base_addr, block_sz);
-- 
2.24.1


^ permalink raw reply related

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
From: John Hubbard @ 2020-09-15 17:31 UTC (permalink / raw)
  To: Vasily Gorbik, Jason Gunthorpe
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, Dave Hansen, linux-power, LKML,
	Andrew Morton, Linus Torvalds, Mike Rapoport
In-Reply-To: <patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours>

On 9/11/20 1:36 PM, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> ...
>          pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                           unsigned int flags, struct page **pages, int *nr)
> {
>          unsigned long next;
>          pud_t *pudp;
> 
>          // pud_offset returns &p4d itself (a pointer to a value on stack)
>          pudp = pud_offset(&p4d, addr);
>          do {
>                  // on second iteratation reading "random" stack value
>                  pud_t pud = READ_ONCE(*pudp);
> 
>                  // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                  next = pud_addr_end(addr, end);
>                  ...
>          } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>          return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---

Looks cleaner than I'd dared hope for. :)

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> v2: added brackets &pgd -> &(pgd)
> 
>   arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>   include/linux/pgtable.h         | 10 ++++++++
>   mm/gup.c                        | 18 +++++++-------
>   3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>   
>   #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>   
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>   {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>   }
> +#define p4d_offset_lockless p4d_offset_lockless
>   
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>   {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>   }
>   #define pud_offset pud_offset
>   
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>   {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>   }
>   #define pmd_offset pmd_offset
>   
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>   #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>   #endif
>   
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>   /*
>    * p?d_leaf() - true if this entry is a final mapping to a physical address.
>    * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>   	return 1;
>   }
>   
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>   		unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pmd_t *pmdp;
>   
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>   	do {
>   		pmd_t pmd = READ_ONCE(*pmdp);
>   
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>   	return 1;
>   }
>   
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	pud_t *pudp;
>   
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>   	do {
>   		pud_t pud = READ_ONCE(*pudp);
>   
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>   					 PUD_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (pudp++, addr = next, addr != end);
>   
>   	return 1;
>   }
>   
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>   			 unsigned int flags, struct page **pages, int *nr)
>   {
>   	unsigned long next;
>   	p4d_t *p4dp;
>   
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>   	do {
>   		p4d_t p4d = READ_ONCE(*p4dp);
>   
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>   					 P4D_SHIFT, next, flags, pages, nr))
>   				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>   			return 0;
>   	} while (p4dp++, addr = next, addr != end);
>   
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>   					 PGDIR_SHIFT, next, flags, pages, nr))
>   				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>   			return;
>   	} while (pgdp++, addr = next, addr != end);
>   }
> 


^ permalink raw reply

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
From: Mike Rapoport @ 2020-09-15 17:18 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, Dave Hansen, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Christian Borntraeger, Richard Weinberger,
	linux-x86, Russell King, Jason Gunthorpe, Ingo Molnar,
	Catalin Marinas, Andrey Ryabinin, Gerald Schaefer, Heiko Carstens,
	Arnd Bergmann, John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, linux-mm,
	linux-power, LKML, Andrew Morton, Linus Torvalds
In-Reply-To: <patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours>

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 7eb01a5459cd..b55561cc8786 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_t *pgd, unsigned long address)
>  
>  #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address)
>  
> -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
> +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address)
>  {
> -	if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> -		return (p4d_t *) pgd_deref(*pgd) + p4d_index(address);
> -	return (p4d_t *) pgd;
> +	if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1)
> +		return (p4d_t *) pgd_deref(pgd) + p4d_index(address);
> +	return (p4d_t *) pgdp;
>  }
> +#define p4d_offset_lockless p4d_offset_lockless
>  
> -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address)
>  {
> -	if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> -		return (pud_t *) p4d_deref(*p4d) + pud_index(address);
> -	return (pud_t *) p4d;
> +	return p4d_offset_lockless(pgdp, *pgdp, address);
> +}
> +
> +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address)
> +{
> +	if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2)
> +		return (pud_t *) p4d_deref(p4d) + pud_index(address);
> +	return (pud_t *) p4dp;
> +}
> +#define pud_offset_lockless pud_offset_lockless
> +
> +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address)
> +{
> +	return pud_offset_lockless(p4dp, *p4dp, address);
>  }
>  #define pud_offset pud_offset
>  
> -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
> +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address)
> +{
> +	if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> +		return (pmd_t *) pud_deref(pud) + pmd_index(address);
> +	return (pmd_t *) pudp;
> +}
> +#define pmd_offset_lockless pmd_offset_lockless
> +
> +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address)
>  {
> -	if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3)
> -		return (pmd_t *) pud_deref(*pud) + pmd_index(address);
> -	return (pmd_t *) pud;
> +	return pmd_offset_lockless(pudp, *pudp, address);
>  }
>  #define pmd_offset pmd_offset
>  
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e8cbc2e795d5..90654cb63e9e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask;
>  #define mm_pmd_folded(mm)	__is_defined(__PAGETABLE_PMD_FOLDED)
>  #endif
>  
> +#ifndef p4d_offset_lockless
> +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address)
> +#endif
> +#ifndef pud_offset_lockless
> +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address)
> +#endif
> +#ifndef pmd_offset_lockless
> +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address)
> +#endif
> +
>  /*
>   * p?d_leaf() - true if this entry is a final mapping to a physical address.
>   * This differs from p?d_huge() by the fact that they are always available (if
> diff --git a/mm/gup.c b/mm/gup.c
> index e5739a1974d5..578bf5bd8bf8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  	return 1;
>  }
>  
> -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
> +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end,
>  		unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pmd_t *pmdp;
>  
> -	pmdp = pmd_offset(&pud, addr);
> +	pmdp = pmd_offset_lockless(pudp, pud, addr);
>  	do {
>  		pmd_t pmd = READ_ONCE(*pmdp);
>  
> @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
>  	return 1;
>  }
>  
> -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
> +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	pud_t *pudp;
>  
> -	pudp = pud_offset(&p4d, addr);
> +	pudp = pud_offset_lockless(p4dp, p4d, addr);
>  	do {
>  		pud_t pud = READ_ONCE(*pudp);
>  
> @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pud_val(pud)), addr,
>  					 PUD_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pmd_range(pud, addr, next, flags, pages, nr))
> +		} else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (pudp++, addr = next, addr != end);
>  
>  	return 1;
>  }
>  
> -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
> +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end,
>  			 unsigned int flags, struct page **pages, int *nr)
>  {
>  	unsigned long next;
>  	p4d_t *p4dp;
>  
> -	p4dp = p4d_offset(&pgd, addr);
> +	p4dp = p4d_offset_lockless(pgdp, pgd, addr);
>  	do {
>  		p4d_t p4d = READ_ONCE(*p4dp);
>  
> @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr,
>  					 P4D_SHIFT, next, flags, pages, nr))
>  				return 0;
> -		} else if (!gup_pud_range(p4d, addr, next, flags, pages, nr))
> +		} else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr))
>  			return 0;
>  	} while (p4dp++, addr = next, addr != end);
>  
> @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long addr, unsigned long end,
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
>  					 PGDIR_SHIFT, next, flags, pages, nr))
>  				return;
> -		} else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr))
> +		} else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr))
>  			return;
>  	} while (pgdp++, addr = next, addr != end);
>  }
> -- 
> ⣿⣿⣿⣿⢋⡀⣀⠹⣿⣿⣿⣿
> ⣿⣿⣿⣿⠠⣶⡦⠀⣿⣿⣿⣿
> ⣿⣿⣿⠏⣴⣮⣴⣧⠈⢿⣿⣿
> ⣿⣿⡏⢰⣿⠖⣠⣿⡆⠈⣿⣿
> ⣿⢛⣵⣄⠙⣶⣶⡟⣅⣠⠹⣿
> ⣿⣜⣛⠻⢎⣉⣉⣀⠿⣫⣵⣿

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
From: Jason Gunthorpe @ 2020-09-15 17:14 UTC (permalink / raw)
  To: Vasily Gorbik
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	John Hubbard, Jeff Dike, linux-um, Borislav Petkov,
	Andy Lutomirski, Thomas Gleixner, linux-arm, Dave Hansen,
	linux-power, LKML, Andrew Morton, Linus Torvalds, Mike Rapoport
In-Reply-To: <patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours>

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
> 
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> ...
>         pudp = pud_offset(&p4d, addr);
> 
> This function passes a reference on that local value copy to pXd_offset,
> and might get the very same pointer in return. This happens when the
> level is folded (on most arches), and that pointer should not be iterated.
> 
> On s390 due to the fact that each task might have different 5,4 or
> 3-level address translation and hence different levels folded the logic
> is more complex and non-iteratable pointer to a local copy leads to
> severe problems.
> 
> Here is an example of what happens with gup_fast on s390, for a task
> with 3-levels paging, crossing a 2 GB pud boundary:
> 
> // addr = 0x1007ffff000, end = 0x10080001000
> static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
>                          unsigned int flags, struct page **pages, int *nr)
> {
>         unsigned long next;
>         pud_t *pudp;
> 
>         // pud_offset returns &p4d itself (a pointer to a value on stack)
>         pudp = pud_offset(&p4d, addr);
>         do {
>                 // on second iteratation reading "random" stack value
>                 pud_t pud = READ_ONCE(*pudp);
> 
>                 // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
>                 next = pud_addr_end(addr, end);
>                 ...
>         } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
> 
>         return 1;
> }
> 
> This happens since s390 moved to common gup code with
> commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust")
> and commit 1a42010cdc26 ("s390/mm: convert to the generic
> get_user_pages_fast code"). s390 tried to mimic static level folding by
> changing pXd_offset primitives to always calculate top level page table
> offset in pgd_offset and just return the value passed when pXd_offset
> has to act as folded.
> 
> What is crucial for gup_fast and what has been overlooked is
> that PxD_SIZE/MASK and thus pXd_addr_end should also change
> correspondingly. And the latter is not possible with dynamic folding.
> 
> To fix the issue in addition to pXd values pass original
> pXdp pointers down to gup_pXd_range functions. And introduce
> pXd_offset_lockless helpers, which take an additional pXd
> entry value parameter. This has already been discussed in
> https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
> 
> Cc: <stable@vger.kernel.org> # 5.2+
> Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code")
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
> ---
> v2: added brackets &pgd -> &(pgd)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Regards,
Jason

^ permalink raw reply

* Re: [PATCH v2] mm/gup: fix gup_fast with dynamic page table folding
From: Vasily Gorbik @ 2020-09-15 17:09 UTC (permalink / raw)
  To: Andrew Morton, Jason Gunthorpe, John Hubbard
  Cc: Peter Zijlstra, Dave Hansen, linux-mm, Paul Mackerras,
	linux-sparc, Alexander Gordeev, Claudio Imbrenda, Will Deacon,
	linux-arch, linux-s390, Richard Weinberger, linux-x86,
	Russell King, Christian Borntraeger, Ingo Molnar, Catalin Marinas,
	Andrey Ryabinin, Gerald Schaefer, Heiko Carstens, Arnd Bergmann,
	Jeff Dike, linux-um, Borislav Petkov, Andy Lutomirski,
	Thomas Gleixner, linux-arm, Dave Hansen, linux-power, LKML,
	Linus Torvalds, Mike Rapoport
In-Reply-To: <patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours>

On Fri, Sep 11, 2020 at 10:36:43PM +0200, Vasily Gorbik wrote:
> Currently to make sure that every page table entry is read just once
> gup_fast walks perform READ_ONCE and pass pXd value down to the next
> gup_pXd_range function by value e.g.:
...snip...
> ---
> v2: added brackets &pgd -> &(pgd)
> 
>  arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++++----------
>  include/linux/pgtable.h         | 10 ++++++++
>  mm/gup.c                        | 18 +++++++-------
>  3 files changed, 49 insertions(+), 21 deletions(-)

Andrew, any chance you would pick this up?

There is an Ack from Linus. And I haven't seen any objections from Jason or John.
This seems to be as safe for other architectures as possible.

@Jason and John
Any acks/nacks?

Thank you,
Vasily

^ permalink raw reply

* Re: [PATCH v2 2/7] powerpc/prom: Introduce early_reserve_mem_old()
From: Christophe Leroy @ 2020-09-15 16:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Christophe Leroy, linuxppc-dev
In-Reply-To: <20200914211007.2285999-3-clg@kaod.org>

Cédric Le Goater <clg@kaod.org> a écrit :

> and condition its call with IS_ENABLED(CONFIG_PPC32). This fixes a
> compile error with W=1.
>
> arch/powerpc/kernel/prom.c: In function ‘early_reserve_mem’:
> arch/powerpc/kernel/prom.c:625:10: error: variable ‘reserve_map’ set  
> but not used [-Werror=unused-but-set-variable]
>   __be64 *reserve_map;
>           ^~~~~~~~~~~
> cc1: all warnings being treated as errors
>
> Cc: Christophe Leroy <christophe.leroy@c-s.fr>

@csgroup.eu instead of @c-s.fr please

> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/kernel/prom.c | 37 ++++++++++++++++++++-----------------
>  1 file changed, 20 insertions(+), 17 deletions(-)

That's a lot of changes for a tiny warning.

You could make it easy by just replacing the #ifdef by:

         if (!IS_ENABLED(CONFIG_PPC32))
                 return;

>
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index d8a2fb87ba0c..c958b67cf1a5 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -620,27 +620,14 @@ static void __init early_reserve_mem_dt(void)
>  	}
>  }
>
> -static void __init early_reserve_mem(void)
> +static void __init early_reserve_mem_old(void)

Why _old ? Do you mean ppc32 are old ? Modern ADSL boxes like for  
instance the famous French freebox have powerpc32 microcontroller.
Eventually you could name it _ppc32, but I don't think that's the good  
way, see above.

Christophe

>  {
>  	__be64 *reserve_map;
>
>  	reserve_map = (__be64 *)(((unsigned long)initial_boot_params) +
>  			fdt_off_mem_rsvmap(initial_boot_params));
>
> -	/* Look for the new "reserved-regions" property in the DT */
> -	early_reserve_mem_dt();
> -
> -#ifdef CONFIG_BLK_DEV_INITRD
> -	/* Then reserve the initrd, if any */
> -	if (initrd_start && (initrd_end > initrd_start)) {
> -		memblock_reserve(ALIGN_DOWN(__pa(initrd_start), PAGE_SIZE),
> -			ALIGN(initrd_end, PAGE_SIZE) -
> -			ALIGN_DOWN(initrd_start, PAGE_SIZE));
> -	}
> -#endif /* CONFIG_BLK_DEV_INITRD */
> -
> -#ifdef CONFIG_PPC32
> -	/*
> +	/*
>  	 * Handle the case where we might be booting from an old kexec
>  	 * image that setup the mem_rsvmap as pairs of 32-bit values
>  	 */
> @@ -658,9 +645,25 @@ static void __init early_reserve_mem(void)
>  			DBG("reserving: %x -> %x\n", base_32, size_32);
>  			memblock_reserve(base_32, size_32);
>  		}
> -		return;
>  	}
> -#endif
> +}
> +
> +static void __init early_reserve_mem(void)
> +{
> +	/* Look for the new "reserved-regions" property in the DT */
> +	early_reserve_mem_dt();
> +
> +#ifdef CONFIG_BLK_DEV_INITRD
> +	/* Then reserve the initrd, if any */
> +	if (initrd_start && (initrd_end > initrd_start)) {
> +		memblock_reserve(ALIGN_DOWN(__pa(initrd_start), PAGE_SIZE),
> +			ALIGN(initrd_end, PAGE_SIZE) -
> +			ALIGN_DOWN(initrd_start, PAGE_SIZE));
> +	}
> +#endif /* CONFIG_BLK_DEV_INITRD */
> +
> +	if (IS_ENABLED(CONFIG_PPC32))
> +		early_reserve_mem_old();
>  }
>
>  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> --
> 2.25.4



^ permalink raw reply

* Re: [5.9.0-rc5-20200914] Kernel crash while running LTP(mlock201)
From: Sachin Sant @ 2020-09-15 16:31 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-next, linuxppc-dev
In-Reply-To: <20200915130907.GE5449@casper.infradead.org>


> On 15-Sep-2020, at 6:39 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Tue, Sep 15, 2020 at 09:24:38PM +1000, Michael Ellerman wrote:
>> Sachin Sant <sachinp@linux.vnet.ibm.com> writes:
>>> While running LTP tests (specifically mlock201) against next-20200914 tree
>>> on a POWER9 LPAR results in following crash.
>> 
>> Looks the same as:
>> 
>> https://lore.kernel.org/linux-mm/20200914085545.GB28738@shao2-debian/
> 
> https://lore.kernel.org/linux-mm/20200914112738.GM6583@casper.infradead.org/

Thanks. The patch fixes the problem for me.

Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com>

thanks
-Sachin


^ permalink raw reply

* Re: [PATCH 14/15] selftests/clone3: Avoid OS-defined clone_args
From: Christian Brauner @ 2020-09-15 16:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thadeu Lima de Souza Cascardo, Will Drewry, linux-xtensa,
	linux-kernel, Andy Lutomirski, Max Filippov, linux-arm-kernel,
	linux-kselftest, linux-mips, linuxppc-dev, Christian Brauner
In-Reply-To: <20200912110820.597135-15-keescook@chromium.org>

On Sat, Sep 12, 2020 at 04:08:19AM -0700, Kees Cook wrote:
> As the UAPI headers start to appear in distros, we need to avoid
> outdated versions of struct clone_args to be able to test modern
> features. Additionally pull in the syscall numbers correctly.
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---

Hm, with this patch applied I'm getting:

gcc -g -I../../../../usr/include/    clone3_set_tid.c /home/brauner/src/git/linux/linux/tools/testing/selftests/kselftest_harness.h /home/brauner/src/git/linux/linux/tools/testing/selftests/kselftest.h -lcap -o /home/brauner/src/git/linux/linux/tools/testing/selftests/clone3/clone3_set_tid
In file included from clone3_set_tid.c:24:
clone3_selftests.h:37:8: error: redefinition of ‘struct clone_args’
   37 | struct clone_args {
      |        ^~~~~~~~~~
In file included from clone3_set_tid.c:12:
/usr/include/linux/sched.h:92:8: note: originally defined here
   92 | struct clone_args {
      |        ^~~~~~~~~~
make: *** [../lib.mk:140: /home/brauner/src/git/linux/linux/tools/testing/selftests/clone3/clone3_set_tid] Error 1

One trick to avoid this could be:

#ifndef CLONE_ARGS_SIZE_VER0
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
#endif

#ifndef CLONE_ARGS_SIZE_VER1
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
#endif

#ifndef CLONE_ARGS_SIZE_VER2
#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
#endif

struct __clone_args {
	__aligned_u64 flags;
	__aligned_u64 pidfd;
	__aligned_u64 child_tid;
	__aligned_u64 parent_tid;
	__aligned_u64 exit_signal;
	__aligned_u64 stack;
	__aligned_u64 stack_size;
	__aligned_u64 tls;
	__aligned_u64 set_tid;
	__aligned_u64 set_tid_size;
	__aligned_u64 cgroup;
};

static pid_t sys_clone3(struct __clone_args *args, size_t size)
{
	return syscall(__NR_clone3, args, size);
}

Christian

^ permalink raw reply

* Re: [PATCH 15/15] selftests/seccomp: Use __NR_mknodat instead of __NR_mknod
From: Christian Brauner @ 2020-09-15 16:16 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thadeu Lima de Souza Cascardo, Will Drewry, linux-xtensa,
	linux-kernel, Andy Lutomirski, Max Filippov, linux-arm-kernel,
	linux-kselftest, linux-mips, linuxppc-dev, Christian Brauner
In-Reply-To: <20200912110820.597135-16-keescook@chromium.org>

On Sat, Sep 12, 2020 at 04:08:20AM -0700, Kees Cook wrote:
> The __NR_mknod syscall doesn't exist on arm64 (only __NR_mknodat).
> Switch to the modern syscall.
> 
> Fixes: ad5682184a81 ("selftests/seccomp: Check for EPOLLHUP for user_notif")
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---

Thanks! Looks good.
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply

* Re: [PATCH 11/15] selftests/seccomp: Remove SYSCALL_NUM_RET_SHARE_REG in favor of SYSCALL_RET_SET
From: Christian Brauner @ 2020-09-15 16:11 UTC (permalink / raw)
  To: Kees Cook
  Cc: Thadeu Lima de Souza Cascardo, Will Drewry, linux-xtensa,
	linux-kernel, Andy Lutomirski, Max Filippov, linux-arm-kernel,
	linux-kselftest, linux-mips, linuxppc-dev, Christian Brauner
In-Reply-To: <20200912110820.597135-12-keescook@chromium.org>

On Sat, Sep 12, 2020 at 04:08:16AM -0700, Kees Cook wrote:
> Instead of special-casing the specific case of shared registers, create
> a default SYSCALL_RET_SET() macro (mirroring SYSCALL_NUM_SET()), that
> writes to the SYSCALL_RET register. For architectures that can't set the
> return value (for whatever reason), they can define SYSCALL_RET_SET()
> without an associated SYSCALL_RET() macro. This also paves the way for
> architectures that need to do special things to set the return value
> (e.g. powerpc).
> 
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---

Looks good!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox