* [PATCH v2] powerpc/pseries: Avoid using addr_to_pfn in realmode
From: Ganesh Goudar @ 2020-07-09 16:39 UTC (permalink / raw)
To: mpe, linuxppc-dev; +Cc: mahesh, Ganesh Goudar, npiggin, aneesh.kumar
When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in realmode, which can be fatal as it may try to access
memory outside RMO region.
To fix this use addr_to_pfn after switching to virtual mode.
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
---
V2: Leave bare metal code and save_mce_event as is.
---
arch/powerpc/platforms/pseries/ras.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
index f3736fcd98fc..def875815e92 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -610,16 +610,8 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
if (mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED)
eaddr = be64_to_cpu(mce_log->effective_address);
- if (mce_log->sub_err_type & UE_LOGICAL_ADDR_PROVIDED) {
+ if (mce_log->sub_err_type & UE_LOGICAL_ADDR_PROVIDED)
paddr = be64_to_cpu(mce_log->logical_address);
- } else if (mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED) {
- unsigned long pfn;
-
- pfn = addr_to_pfn(regs, eaddr);
- if (pfn != ULONG_MAX)
- paddr = pfn << PAGE_SHIFT;
- }
-
break;
case MC_ERROR_TYPE_SLB:
mce_err.error_type = MCE_ERROR_TYPE_SLB;
@@ -725,6 +717,16 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
* SLB multihit is done by now.
*/
mtmsr(mfmsr() | MSR_IR | MSR_DR);
+
+ /* Use addr_to_pfn after switching to virtual mode */
+ if (!paddr && error_type == MC_ERROR_TYPE_UE &&
+ mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED) {
+ unsigned long pfn;
+
+ pfn = addr_to_pfn(regs, eaddr);
+ if (pfn != ULONG_MAX)
+ paddr = pfn << PAGE_SHIFT;
+ }
save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
&mce_err, regs->nip, eaddr, paddr);
--
2.17.2
^ permalink raw reply related
* Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
From: Waiman Long @ 2020-07-09 16:06 UTC (permalink / raw)
To: Michael Ellerman, Nicholas Piggin, linuxppc-dev
Cc: linux-arch, Peter Zijlstra, Boqun Feng, linux-kernel, kvm-ppc,
virtualization, Ingo Molnar, Will Deacon
In-Reply-To: <874kqhvu1v.fsf@mpe.ellerman.id.au>
On 7/9/20 6:53 AM, Michael Ellerman wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> arch/powerpc/include/asm/paravirt.h | 28 ++++++++
>> arch/powerpc/include/asm/qspinlock.h | 66 +++++++++++++++++++
>> arch/powerpc/include/asm/qspinlock_paravirt.h | 7 ++
>> arch/powerpc/platforms/pseries/Kconfig | 5 ++
>> arch/powerpc/platforms/pseries/setup.c | 6 +-
>> include/asm-generic/qspinlock.h | 2 +
> Another ack?
>
I am OK with adding the #ifdef around queued_spin_lock().
Acked-by: Waiman Long <longman@redhat.com>
>> diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h
>> index 7a8546660a63..f2d51f929cf5 100644
>> --- a/arch/powerpc/include/asm/paravirt.h
>> +++ b/arch/powerpc/include/asm/paravirt.h
>> @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count)
>> {
>> ___bad_yield_to_preempted(); /* This would be a bug */
>> }
>> +
>> +extern void ___bad_yield_to_any(void);
>> +static inline void yield_to_any(void)
>> +{
>> + ___bad_yield_to_any(); /* This would be a bug */
>> +}
> Why do we do that rather than just not defining yield_to_any() at all
> and letting the build fail on that?
>
> There's a condition somewhere that we know will false at compile time
> and drop the call before linking?
>
>> diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
>> new file mode 100644
>> index 000000000000..750d1b5e0202
>> --- /dev/null
>> +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
>> @@ -0,0 +1,7 @@
>> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> +#ifndef __ASM_QSPINLOCK_PARAVIRT_H
>> +#define __ASM_QSPINLOCK_PARAVIRT_H
> _ASM_POWERPC_QSPINLOCK_PARAVIRT_H please.
>
>> +
>> +EXPORT_SYMBOL(__pv_queued_spin_unlock);
> Why's that in a header? Should that (eventually) go with the generic implementation?
The PV qspinlock implementation is not that generic at the moment. Even
though native qspinlock is used by a number of archs, PV qspinlock is
only currently used in x86. This is certainly an area that needs
improvement.
>> diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig
>> index 24c18362e5ea..756e727b383f 100644
>> --- a/arch/powerpc/platforms/pseries/Kconfig
>> +++ b/arch/powerpc/platforms/pseries/Kconfig
>> @@ -25,9 +25,14 @@ config PPC_PSERIES
>> select SWIOTLB
>> default y
>>
>> +config PARAVIRT_SPINLOCKS
>> + bool
>> + default n
> default n is the default.
>
>> diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c
>> index 2db8469e475f..747a203d9453 100644
>> --- a/arch/powerpc/platforms/pseries/setup.c
>> +++ b/arch/powerpc/platforms/pseries/setup.c
>> @@ -771,8 +771,12 @@ static void __init pSeries_setup_arch(void)
>> if (firmware_has_feature(FW_FEATURE_LPAR)) {
>> vpa_init(boot_cpuid);
>>
>> - if (lppaca_shared_proc(get_lppaca()))
>> + if (lppaca_shared_proc(get_lppaca())) {
>> static_branch_enable(&shared_processor);
>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>> + pv_spinlocks_init();
>> +#endif
>> + }
> We could avoid the ifdef with this I think?
>
> diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
> index 434615f1d761..6ec72282888d 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -10,5 +10,9 @@
> #include <asm/simple_spinlock.h>
> #endif
>
> +#ifndef CONFIG_PARAVIRT_SPINLOCKS
> +static inline void pv_spinlocks_init(void) { }
> +#endif
> +
> #endif /* __KERNEL__ */
> #endif /* __ASM_SPINLOCK_H */
>
>
> cheers
>
We don't really need to do a pv_spinlocks_init() if pv_kick() isn't
supported.
Cheers,
Longman
^ permalink raw reply
* Re: [PATCH 10/20] Documentation: kbuild/kconfig-language: eliminate duplicated word
From: Masahiro Yamada @ 2020-07-09 15:31 UTC (permalink / raw)
To: Randy Dunlap
Cc: kvm, open list:DOCUMENTATION, David Airlie, kgdb-bugreport,
linux-fpga, Liviu Dudau, dri-devel, Douglas Anderson,
Paul Cercueil, keyrings, Paul Mackerras, linux-i2c, Pavel Machek,
Srinivas Pandruvada, Mihail Atanassov, linux-leds, linux-s390,
Daniel Thompson, linux-scsi, Jonathan Corbet, Matthew Wilcox,
Halil Pasic, Jarkko Sakkinen, James Wang, linux-input,
Mali DP Maintainers, Derek Kiernan, linux-mips, Dragan Cvetic,
Wu Hao, Tony Krowiak, Linux Kbuild mailing list,
James E.J. Bottomley, Jiri Kosina, Hannes Reinecke, linux-block,
Thomas Bogendoerfer, Jacek Anaszewski, linux-mm, Dan Williams,
Andrew Morton, Mimi Zohar, Jens Axboe, Michal Marek,
Martin K. Petersen, Pierre Morel, Linux Kernel Mailing List,
Wolfram Sang, Daniel Vetter, Jason Wessel, Paolo Bonzini,
linux-integrity, linuxppc-dev, Mike Rapoport, Dan Murphy
In-Reply-To: <20200707180414.10467-11-rdunlap@infradead.org>
On Wed, Jul 8, 2020 at 3:06 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Drop the doubled word "the".
>
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: linux-doc@vger.kernel.org
> Cc: Masahiro Yamada <masahiroy@kernel.org>
I guess this series will go in via the doc sub-system.
If so, please feel free to add:
Acked-by: Masahiro Yamada <masahiroy@kernel.org>
> Cc: Michal Marek <michal.lkml@markovi.net>
> Cc: linux-kbuild@vger.kernel.org
> ---
> Documentation/kbuild/kconfig-language.rst | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-next-20200701.orig/Documentation/kbuild/kconfig-language.rst
> +++ linux-next-20200701/Documentation/kbuild/kconfig-language.rst
> @@ -681,7 +681,7 @@ translate Kconfig logic into boolean for
> find dead code / features (always inactive), 114 dead features were found in
> Linux using this methodology [1]_ (Section 8: Threats to validity).
>
> -Confirming this could prove useful as Kconfig stands as one of the the leading
> +Confirming this could prove useful as Kconfig stands as one of the leading
> industrial variability modeling languages [1]_ [2]_. Its study would help
> evaluate practical uses of such languages, their use was only theoretical
> and real world requirements were not well understood. As it stands though
--
Best Regards
Masahiro Yamada
^ permalink raw reply
* /sys/kernel/debug/kmemleak empty despite kmemleak reports
From: Paul Menzel @ 2020-07-09 14:37 UTC (permalink / raw)
To: Catalin Marinas, Michael Ellerman; +Cc: linuxppc-dev
Dear Linux folks,
Despite Linux 5.8-rc4 reporting memory leaks on the IBM POWER 8 S822LC,
the file does not contain more information.
> $ dmesg
> […] > [48662.953323] perf: interrupt took too long (2570 > 2500), lowering
kernel.perf_event_max_sample_rate to 77750
> [48854.810636] perf: interrupt took too long (3216 > 3212), lowering kernel.perf_event_max_sample_rate to 62000
> [52300.044518] perf: interrupt took too long (4244 > 4020), lowering kernel.perf_event_max_sample_rate to 47000
> [52751.373083] perf: interrupt took too long (5373 > 5305), lowering kernel.perf_event_max_sample_rate to 37000
> [53354.000363] perf: interrupt took too long (6793 > 6716), lowering kernel.perf_event_max_sample_rate to 29250
> [53850.215606] perf: interrupt took too long (8672 > 8491), lowering kernel.perf_event_max_sample_rate to 23000
> [57542.266099] perf: interrupt took too long (10940 > 10840), lowering kernel.perf_event_max_sample_rate to 18250
> [57559.645404] perf: interrupt took too long (13714 > 13675), lowering kernel.perf_event_max_sample_rate to 14500
> [61608.697728] Can't find PMC that caused IRQ
> [71774.463111] kmemleak: 12 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> [92372.044785] process '@/usr/bin/gnatmake-5' started with executable stack
> [92849.380672] FS-Cache: Loaded
> [92849.417269] FS-Cache: Netfs 'nfs' registered for caching
> [92849.595974] NFS: Registering the id_resolver key type
> [92849.596000] Key type id_resolver registered
> [92849.596000] Key type id_legacy registered
> [101808.079143] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> [106904.323471] Can't find PMC that caused IRQ
> [129416.391456] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> [158171.604221] kmemleak: 34 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
> $ sudo cat /sys/kernel/debug/kmemleak
> $
Kind regards,
Paul
^ permalink raw reply
* Re: [PATCH v2 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges
From: Christophe Leroy @ 2020-07-09 13:55 UTC (permalink / raw)
To: Santosh Sivaraj, linuxppc-dev
Cc: Ganesh Goudar, Vaibhav Jain, Oliver, Mahesh Salgaonkar,
Aneesh Kumar K.V
In-Reply-To: <20200709135142.721504-2-santosh@fossix.org>
Le 09/07/2020 à 15:51, Santosh Sivaraj a écrit :
> Subscribe to the MCE notification and add the physical address which
> generated a memory error to nvdimm bad range.
>
> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
> Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> arch/powerpc/platforms/pseries/papr_scm.c | 96 ++++++++++++++++++++++-
> 1 file changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
> index 9c569078a09fd..90729029ca010 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -13,9 +13,11 @@
> #include <linux/platform_device.h>
> #include <linux/delay.h>
> #include <linux/seq_buf.h>
> +#include <linux/nd.h>
>
> #include <asm/plpar_wrappers.h>
> #include <asm/papr_pdsm.h>
> +#include <asm/mce.h>
>
> #define BIND_ANY_ADDR (~0ul)
>
> @@ -80,6 +82,7 @@ struct papr_scm_priv {
> struct resource res;
> struct nd_region *region;
> struct nd_interleave_set nd_set;
> + struct list_head region_list;
>
> /* Protect dimm health data from concurrent read/writes */
> struct mutex health_mutex;
> @@ -91,6 +94,9 @@ struct papr_scm_priv {
> u64 health_bitmap;
> };
>
> +LIST_HEAD(papr_nd_regions);
> +DEFINE_MUTEX(papr_ndr_lock);
> +
> static int drc_pmem_bind(struct papr_scm_priv *p)
> {
> unsigned long ret[PLPAR_HCALL_BUFSIZE];
> @@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> dev_info(dev, "Region registered with target node %d and online node %d",
> target_nid, online_nid);
>
> + mutex_lock(&papr_ndr_lock);
> + list_add_tail(&p->region_list, &papr_nd_regions);
> + mutex_unlock(&papr_ndr_lock);
> +
> return 0;
>
> err: nvdimm_bus_unregister(p->bus);
> @@ -766,6 +776,68 @@ err: nvdimm_bus_unregister(p->bus);
> return -ENXIO;
> }
>
> +static void papr_scm_add_badblock(struct nd_region *region,
> + struct nvdimm_bus *bus, u64 phys_addr)
> +{
> + u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
> +
> + if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
> + pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
> + return;
> + }
> +
> + pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
> + aligned_addr, aligned_addr + L1_CACHE_BYTES);
> +
> + nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
> +}
> +
> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
> + void *data)
> +{
> + struct machine_check_event *evt = data;
> + struct papr_scm_priv *p;
> + u64 phys_addr;
> + bool found = false;
> +
> + if (evt->error_type != MCE_ERROR_TYPE_UE)
> + return NOTIFY_DONE;
> +
> + if (list_empty(&papr_nd_regions))
> + return NOTIFY_DONE;
> +
> + /*
> + * The physical address obtained here is PAGE_SIZE aligned, so get the
> + * exact address from the effective address
> + */
> + phys_addr = evt->u.ue_error.physical_address +
> + (evt->u.ue_error.effective_address & ~PAGE_MASK);
> +
> + if (!evt->u.ue_error.physical_address_provided ||
> + !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
> + return NOTIFY_DONE;
> +
> + /* mce notifier is called from a process context, so mutex is safe */
> + mutex_lock(&papr_ndr_lock);
> + list_for_each_entry(p, &papr_nd_regions, region_list) {
> + if (phys_addr >= p->res.start && phys_addr <= p->res.end) {
> + found = true;
> + break;
> + }
> + }
> +
> + if (found)
> + papr_scm_add_badblock(p->region, p->bus, phys_addr);
> +
> + mutex_unlock(&papr_ndr_lock);
> +
> + return found ? NOTIFY_OK : NOTIFY_DONE;
> +}
> +
> +static struct notifier_block mce_ue_nb = {
> + .notifier_call = handle_mce_ue
> +};
> +
> static int papr_scm_probe(struct platform_device *pdev)
> {
> struct device_node *dn = pdev->dev.of_node;
> @@ -866,6 +938,10 @@ static int papr_scm_remove(struct platform_device *pdev)
> {
> struct papr_scm_priv *p = platform_get_drvdata(pdev);
>
> + mutex_lock(&papr_ndr_lock);
> + list_del(&p->region_list);
> + mutex_unlock(&papr_ndr_lock);
> +
> nvdimm_bus_unregister(p->bus);
> drc_pmem_unbind(p);
> kfree(p->bus_desc.provider_name);
> @@ -888,7 +964,25 @@ static struct platform_driver papr_scm_driver = {
> },
> };
>
> -module_platform_driver(papr_scm_driver);
> +static int __init papr_scm_init(void)
> +{
> + int ret;
> +
> + ret = platform_driver_register(&papr_scm_driver);
> + if (!ret)
> + mce_register_notifier(&mce_ue_nb);
> +
> + return ret;
> +}
> +module_init(papr_scm_init);
> +
> +static void __exit papr_scm_exit(void)
> +{
> + mce_unregister_notifier(&mce_ue_nb);
> + platform_driver_unregister(&papr_scm_driver);
> +}
> +module_exit(papr_scm_exit);
> +
> MODULE_DEVICE_TABLE(of, papr_scm_match);
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("IBM Corporation");
>
^ permalink raw reply
* Re: [PATCH v2 1/2] powerpc/mce: Add MCE notification chain
From: Christophe Leroy @ 2020-07-09 13:54 UTC (permalink / raw)
To: Santosh Sivaraj, linuxppc-dev
Cc: Ganesh Goudar, Vaibhav Jain, Oliver, Mahesh Salgaonkar,
Aneesh Kumar K.V
In-Reply-To: <20200709135142.721504-1-santosh@fossix.org>
Le 09/07/2020 à 15:51, Santosh Sivaraj a écrit :
> Introduce notification chain which lets us know about uncorrected memory
> errors(UE). This would help prospective users in pmem or nvdimm subsystem
> to track bad blocks for better handling of persistent memory allocations.
>
> Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
> arch/powerpc/include/asm/mce.h | 2 ++
> arch/powerpc/kernel/mce.c | 15 +++++++++++++++
> 2 files changed, 17 insertions(+)
>
> v2: Address comments from Christophe.
>
> RESEND: Sending the two patches together so the dependencies are clear. The
> earlier patch reviews are here [1]; rebase the patches on top on 5.8-rc4
>
> [1]: https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganeshgr@linux.ibm.com/
>
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index 376a395daf329..7bdd0cd4f2de0 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct machine_check_event *evt,
> unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
> extern void mce_common_process_ue(struct pt_regs *regs,
> struct mce_error_info *mce_err);
> +int mce_register_notifier(struct notifier_block *nb);
> +int mce_unregister_notifier(struct notifier_block *nb);
> #ifdef CONFIG_PPC_BOOK3S_64
> void flush_and_reload_slb(void);
> #endif /* CONFIG_PPC_BOOK3S_64 */
> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
> index fd90c0eda2290..b7b3ed4e61937 100644
> --- a/arch/powerpc/kernel/mce.c
> +++ b/arch/powerpc/kernel/mce.c
> @@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
>
> DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
>
> +static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
> +
> +int mce_register_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_register(&mce_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(mce_register_notifier);
> +
> +int mce_unregister_notifier(struct notifier_block *nb)
> +{
> + return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
> +}
> +EXPORT_SYMBOL_GPL(mce_unregister_notifier);
> +
> static void mce_set_error_info(struct machine_check_event *mce,
> struct mce_error_info *mce_err)
> {
> @@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct *work)
> while (__this_cpu_read(mce_ue_count) > 0) {
> index = __this_cpu_read(mce_ue_count) - 1;
> evt = this_cpu_ptr(&mce_ue_event_queue[index]);
> + blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
> #ifdef CONFIG_MEMORY_FAILURE
> /*
> * This should probably queued elsewhere, but
>
^ permalink raw reply
* [PATCH v2 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges
From: Santosh Sivaraj @ 2020-07-09 13:51 UTC (permalink / raw)
To: linuxppc-dev
Cc: Oliver, Aneesh Kumar K.V, Santosh Sivaraj, Mahesh Salgaonkar,
Ganesh Goudar, Vaibhav Jain
In-Reply-To: <20200709135142.721504-1-santosh@fossix.org>
Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
---
arch/powerpc/platforms/pseries/papr_scm.c | 96 ++++++++++++++++++++++-
1 file changed, 95 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..90729029ca010 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
#include <linux/platform_device.h>
#include <linux/delay.h>
#include <linux/seq_buf.h>
+#include <linux/nd.h>
#include <asm/plpar_wrappers.h>
#include <asm/papr_pdsm.h>
+#include <asm/mce.h>
#define BIND_ANY_ADDR (~0ul)
@@ -80,6 +82,7 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+ struct list_head region_list;
/* Protect dimm health data from concurrent read/writes */
struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
};
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
static int drc_pmem_bind(struct papr_scm_priv *p)
{
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online node %d",
target_nid, online_nid);
+ mutex_lock(&papr_ndr_lock);
+ list_add_tail(&p->region_list, &papr_nd_regions);
+ mutex_unlock(&papr_ndr_lock);
+
return 0;
err: nvdimm_bus_unregister(p->bus);
@@ -766,6 +776,68 @@ err: nvdimm_bus_unregister(p->bus);
return -ENXIO;
}
+static void papr_scm_add_badblock(struct nd_region *region,
+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+ u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+ if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+ pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+ return;
+ }
+
+ pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+ aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+ nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+ void *data)
+{
+ struct machine_check_event *evt = data;
+ struct papr_scm_priv *p;
+ u64 phys_addr;
+ bool found = false;
+
+ if (evt->error_type != MCE_ERROR_TYPE_UE)
+ return NOTIFY_DONE;
+
+ if (list_empty(&papr_nd_regions))
+ return NOTIFY_DONE;
+
+ /*
+ * The physical address obtained here is PAGE_SIZE aligned, so get the
+ * exact address from the effective address
+ */
+ phys_addr = evt->u.ue_error.physical_address +
+ (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+ if (!evt->u.ue_error.physical_address_provided ||
+ !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+ return NOTIFY_DONE;
+
+ /* mce notifier is called from a process context, so mutex is safe */
+ mutex_lock(&papr_ndr_lock);
+ list_for_each_entry(p, &papr_nd_regions, region_list) {
+ if (phys_addr >= p->res.start && phys_addr <= p->res.end) {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+ mutex_unlock(&papr_ndr_lock);
+
+ return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+ .notifier_call = handle_mce_ue
+};
+
static int papr_scm_probe(struct platform_device *pdev)
{
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +938,10 @@ static int papr_scm_remove(struct platform_device *pdev)
{
struct papr_scm_priv *p = platform_get_drvdata(pdev);
+ mutex_lock(&papr_ndr_lock);
+ list_del(&p->region_list);
+ mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +964,25 @@ static struct platform_driver papr_scm_driver = {
},
};
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+ int ret;
+
+ ret = platform_driver_register(&papr_scm_driver);
+ if (!ret)
+ mce_register_notifier(&mce_ue_nb);
+
+ return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+ mce_unregister_notifier(&mce_ue_nb);
+ platform_driver_unregister(&papr_scm_driver);
+}
+module_exit(papr_scm_exit);
+
MODULE_DEVICE_TABLE(of, papr_scm_match);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("IBM Corporation");
--
2.26.2
^ permalink raw reply related
* [PATCH v2 1/2] powerpc/mce: Add MCE notification chain
From: Santosh Sivaraj @ 2020-07-09 13:51 UTC (permalink / raw)
To: linuxppc-dev
Cc: Oliver, Aneesh Kumar K.V, Santosh Sivaraj, Mahesh Salgaonkar,
Ganesh Goudar, Vaibhav Jain
Introduce notification chain which lets us know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.
Signed-off-by: Santosh Sivaraj <santosh@fossix.org>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
---
arch/powerpc/include/asm/mce.h | 2 ++
arch/powerpc/kernel/mce.c | 15 +++++++++++++++
2 files changed, 17 insertions(+)
v2: Address comments from Christophe.
RESEND: Sending the two patches together so the dependencies are clear. The
earlier patch reviews are here [1]; rebase the patches on top on 5.8-rc4
[1]: https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganeshgr@linux.ibm.com/
diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..7bdd0cd4f2de0 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct machine_check_event *evt,
unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
extern void mce_common_process_ue(struct pt_regs *regs,
struct mce_error_info *mce_err);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
#ifdef CONFIG_PPC_BOOK3S_64
void flush_and_reload_slb(void);
#endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+ return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
static void mce_set_error_info(struct machine_check_event *mce,
struct mce_error_info *mce_err)
{
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct *work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+ blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
#ifdef CONFIG_MEMORY_FAILURE
/*
* This should probably queued elsewhere, but
--
2.26.2
^ permalink raw reply related
* Re: Failure to build librseq on ppc
From: Mathieu Desnoyers @ 2020-07-09 13:43 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: Boqun Feng, linuxppc-dev, Michael Jeanson
In-Reply-To: <20200709001837.GD3598@gate.crashing.org>
----- On Jul 8, 2020, at 8:18 PM, Segher Boessenkool segher@kernel.crashing.org wrote:
> On Wed, Jul 08, 2020 at 08:01:23PM -0400, Mathieu Desnoyers wrote:
>> > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)
>> > > \
>> > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t" \
>> >
>> > The way this hardcodes r17 *will* break, btw. The compiler will not
>> > likely want to use r17 as long as your code (after inlining etc.!) stays
>> > small, but there is Murphy's law.
>>
>> r17 is in the clobber list, so it should be ok.
>
> What protects r17 *after* this asm statement?
As discussed in the other leg of the thread (with the code example),
r17 is in the clobber list of all asm statements using this macro, and
is used as a temporary register within each inline asm.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: Failure to build librseq on ppc
From: Mathieu Desnoyers @ 2020-07-09 13:33 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: Boqun Feng, linuxppc-dev, Michael Jeanson
In-Reply-To: <20200709001003.GB3598@gate.crashing.org>
----- On Jul 8, 2020, at 8:10 PM, Segher Boessenkool segher@kernel.crashing.org wrote:
> Hi!
>
> On Wed, Jul 08, 2020 at 10:00:01AM -0400, Mathieu Desnoyers wrote:
[...]
>
>> -#define STORE_WORD "std "
>> -#define LOAD_WORD "ld "
>> -#define LOADX_WORD "ldx "
>> +#define STORE_WORD(arg) "std%U[" __rseq_str(arg) "]%X[" __rseq_str(arg)
>> "] " /* To memory ("m" constraint) */
>> +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
>> /* From memory ("m" constraint) */
>
> That cannot work (you typoed "ld" here).
Indeed, I noticed it before pushing to master (lwd -> ld).
>
> Some more advice about this code, pretty generic stuff:
Let's take an example to support the discussion here. I'm taking it from
master branch (after a cleanup changing e.g. LOAD_WORD into RSEQ_LOAD_LONG).
So for powerpc32 we have (code edited to remove testing instrumentation):
#define __rseq_str_1(x) #x
#define __rseq_str(x) __rseq_str_1(x)
#define RSEQ_STORE_LONG(arg) "stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] " /* To memory ("m" constraint) */
#define RSEQ_STORE_INT(arg) RSEQ_STORE_LONG(arg) /* To memory ("m" constraint) */
#define RSEQ_LOAD_LONG(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] " /* From memory ("m" constraint) */
#define RSEQ_LOAD_INT(arg) RSEQ_LOAD_LONG(arg) /* From memory ("m" constraint) */
#define RSEQ_LOADX_LONG "lwzx " /* From base register ("b" constraint) */
#define RSEQ_CMP_LONG "cmpw "
#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \
start_ip, post_commit_offset, abort_ip) \
".pushsection __rseq_cs, \"aw\"\n\t" \
".balign 32\n\t" \
__rseq_str(label) ":\n\t" \
".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \
/* 32-bit only supported on BE */ \
".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
".popsection\n\t" \
".pushsection __rseq_cs_ptr_array, \"aw\"\n\t" \
".long 0x0, " __rseq_str(label) "b\n\t" \
".popsection\n\t"
/*
* Exit points of a rseq critical section consist of all instructions outside
* of the critical section where a critical section can either branch to or
* reach through the normal course of its execution. The abort IP and the
* post-commit IP are already part of the __rseq_cs section and should not be
* explicitly defined as additional exit points. Knowing all exit points is
* useful to assist debuggers stepping over the critical section.
*/
#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip) \
".pushsection __rseq_exit_point_array, \"aw\"\n\t" \
/* 32-bit only supported on BE */ \
".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(exit_ip) "\n\t" \
".popsection\n\t"
#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t" \
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t" \
RSEQ_STORE_INT(rseq_cs) "%%r17, %[" __rseq_str(rseq_cs) "]\n\t" \
__rseq_str(label) ":\n\t"
#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \
__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip, \
(post_commit_ip - start_ip), abort_ip)
#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \
RSEQ_LOAD_INT(current_cpu_id) "%%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t" \
"bne- cr7, " __rseq_str(label) "\n\t"
#define RSEQ_ASM_DEFINE_ABORT(label, abort_label) \
".pushsection __rseq_failure, \"ax\"\n\t" \
".long " __rseq_str(RSEQ_SIG) "\n\t" \
__rseq_str(label) ":\n\t" \
"b %l[" __rseq_str(abort_label) "]\n\t" \
".popsection\n\t"
#define RSEQ_ASM_OP_CMPEQ(var, expect, label) \
RSEQ_LOAD_LONG(var) "%%r17, %[" __rseq_str(var) "]\n\t" \
RSEQ_CMP_LONG "cr7, %%r17, %[" __rseq_str(expect) "]\n\t" \
"bne- cr7, " __rseq_str(label) "\n\t"
#define RSEQ_ASM_OP_FINAL_STORE(value, var, post_commit_label) \
RSEQ_STORE_LONG(var) "%[" __rseq_str(value) "], %[" __rseq_str(var) "]\n\t" \
__rseq_str(post_commit_label) ":\n\t"
static inline __attribute__((always_inline))
int rseq_cmpeqv_storev(intptr_t *v, intptr_t expect, intptr_t newv, int cpu)
{
__asm__ __volatile__ goto (
RSEQ_ASM_DEFINE_TABLE(3, 1f, 2f, 4f) /* start, commit, abort */
RSEQ_ASM_DEFINE_EXIT_POINT(1f, %l[cmpfail])
/* Start rseq by storing table entry pointer into rseq_cs. */
RSEQ_ASM_STORE_RSEQ_CS(1, 3b, rseq_cs)
/* cmp cpuid */
RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, 4f)
/* cmp @v equal to @expect */
RSEQ_ASM_OP_CMPEQ(v, expect, %l[cmpfail])
/* final store */
RSEQ_ASM_OP_FINAL_STORE(newv, v, 2)
RSEQ_ASM_DEFINE_ABORT(4, abort)
: /* gcc asm goto does not allow outputs */
: [cpu_id] "r" (cpu),
[current_cpu_id] "m" (__rseq_abi.cpu_id),
[rseq_cs] "m" (__rseq_abi.rseq_cs),
[v] "m" (*v),
[expect] "r" (expect),
[newv] "r" (newv)
: "memory", "cc", "r17"
: abort, cmpfail
);
return 0;
abort:
RSEQ_INJECT_FAILED
return -1;
cmpfail:
return 1;
}
>
> The way this all uses r17 will likely not work reliably.
r17 is only used as a temporary register within the inline assembler, and it is
in the clobber list. In which scenario would it not work reliably ?
> The way multiple asm statements are used seems to have missing
> dependencies between the statements.
I'm not sure I follow here. Note that we are injecting the CPP macros into
a single inline asm statement as strings.
>
> Don't try to work *against* the compiler. You will not win.
>
> Alternatively, write assembler code, if that is what you actually want
> to do? Not C code.
>
> And done macro-mess this, you want to be able to debug it, and you need
> other people to be able to read it!
I understand that looking at macros can be cumbersome from the perspective
of a reviewer only interested in a single architecture,
However, from my perspective, as a maintainer who must maintain similar code
for x86 32/64, powerpc 32/64, arm, aarch64, s390, s390x, mips 32/64, and likely
other architectures in the future, the macros abstracting 32-bit and 64-bit
allow to eliminate code duplication for each architecture with 32-bit and 64-bit
variants, which is better for maintainability.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* [PATCH v3 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory
From: Aneesh Kumar K.V @ 2020-07-09 13:19 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao
In-Reply-To: <20200709131925.922266-1-aneesh.kumar@linux.ibm.com>
To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.
Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range. That was added that as part of
commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")
But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.
With this change KVM guest will also be doing linear mapping with
2M page size.
The actual TLB benefit of mapping guest page table entries with
hugepage size can only be materialized if the partition scoped
entries are also using the same or higher page size. A guest using
1G hugetlbfs backing guest memory can have a performance impact with
the above change.
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/include/asm/book3s/64/mmu.h | 5 ++
arch/powerpc/mm/book3s64/radix_pgtable.c | 81 ++++++++++++++++++++----
arch/powerpc/platforms/powernv/setup.c | 10 ++-
3 files changed, 84 insertions(+), 12 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h b/arch/powerpc/include/asm/book3s/64/mmu.h
index 5393a535240c..15aae924f41c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -82,6 +82,11 @@ extern unsigned int mmu_pid_bits;
/* Base PID to allocate from */
extern unsigned int mmu_base_pid;
+/*
+ * memory block size used with radix translation.
+ */
+extern unsigned int __ro_after_init radix_mem_block_size;
+
#define PRTB_SIZE_SHIFT (mmu_pid_bits + 4)
#define PRTB_ENTRIES (1ul << mmu_pid_bits)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d5a01b9aadc9..bba45fc0b7b2 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,6 +15,7 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/string_helpers.h>
+#include <linux/memory.h>
#include <asm/pgalloc.h>
#include <asm/mmu_context.h>
@@ -33,6 +34,7 @@
unsigned int mmu_pid_bits;
unsigned int mmu_base_pid;
+unsigned int radix_mem_block_size __ro_after_init;
static __ref void *early_alloc_pgtable(unsigned long size, int nid,
unsigned long region_start, unsigned long region_end)
@@ -265,6 +267,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
static int __meminit create_physical_mapping(unsigned long start,
unsigned long end,
+ unsigned long max_mapping_size,
int nid, pgprot_t _prot)
{
unsigned long vaddr, addr, mapping_size = 0;
@@ -278,6 +281,8 @@ static int __meminit create_physical_mapping(unsigned long start,
int rc;
gap = next_boundary(addr, end) - addr;
+ if (gap > max_mapping_size)
+ gap = max_mapping_size;
previous_size = mapping_size;
prev_exec = exec;
@@ -328,8 +333,9 @@ static void __init radix_init_pgtable(void)
/* We don't support slb for radix */
mmu_slb_size = 0;
+
/*
- * Create the linear mapping, using standard page size for now
+ * Create the linear mapping
*/
for_each_memblock(memory, reg) {
/*
@@ -345,6 +351,7 @@ static void __init radix_init_pgtable(void)
WARN_ON(create_physical_mapping(reg->base,
reg->base + reg->size,
+ radix_mem_block_size,
-1, PAGE_KERNEL));
}
@@ -485,6 +492,47 @@ static int __init radix_dt_scan_page_sizes(unsigned long node,
return 1;
}
+static int __init probe_memory_block_size(unsigned long node, const char *uname, int
+ depth, void *data)
+{
+ unsigned long *mem_block_size = (unsigned long *)data;
+ const __be64 *prop;
+ int len;
+
+ if (depth != 1)
+ return 0;
+
+ if (strcmp(uname, "ibm,dynamic-reconfiguration-memory"))
+ return 0;
+
+ prop = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
+ if (!prop || len < sizeof(__be64))
+ /*
+ * Nothing in the device tree
+ */
+ *mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+ else
+ *mem_block_size = be64_to_cpup(prop);
+ return 1;
+}
+
+static unsigned long radix_memory_block_size(void)
+{
+ unsigned long mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+
+ /*
+ * OPAL firmware feature is set by now. Hence we are ok
+ * to test OPAL feature.
+ */
+ if (firmware_has_feature(FW_FEATURE_OPAL))
+ mem_block_size = 1UL * 1024 * 1024 * 1024;
+ else
+ of_scan_flat_dt(probe_memory_block_size, &mem_block_size);
+
+ return mem_block_size;
+}
+
+
void __init radix__early_init_devtree(void)
{
int rc;
@@ -493,17 +541,27 @@ void __init radix__early_init_devtree(void)
* Try to find the available page sizes in the device-tree
*/
rc = of_scan_flat_dt(radix_dt_scan_page_sizes, NULL);
- if (rc != 0) /* Found */
- goto found;
+ if (!rc) {
+ /*
+ * No page size details found in device tree.
+ * Let's assume we have page 4k and 64k support
+ */
+ mmu_psize_defs[MMU_PAGE_4K].shift = 12;
+ mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
+
+ mmu_psize_defs[MMU_PAGE_64K].shift = 16;
+ mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
+ }
+
/*
- * let's assume we have page 4k and 64k support
+ * Max mapping size used when mapping pages. We don't use
+ * ppc_md.memory_block_size() here because this get called
+ * early and we don't have machine probe called yet. Also
+ * the pseries implementation only check for ibm,lmb-size.
+ * All hypervisor supporting radix do expose that device
+ * tree node.
*/
- mmu_psize_defs[MMU_PAGE_4K].shift = 12;
- mmu_psize_defs[MMU_PAGE_4K].ap = 0x0;
-
- mmu_psize_defs[MMU_PAGE_64K].shift = 16;
- mmu_psize_defs[MMU_PAGE_64K].ap = 0x5;
-found:
+ radix_mem_block_size = radix_memory_block_size();
return;
}
@@ -855,7 +913,8 @@ int __meminit radix__create_section_mapping(unsigned long start,
return -1;
}
- return create_physical_mapping(__pa(start), __pa(end), nid, prot);
+ return create_physical_mapping(__pa(start), __pa(end),
+ radix_mem_block_size, nid, prot);
}
int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c
index 3bc188da82ba..7fcb88623081 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -399,7 +399,15 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int secondary)
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
static unsigned long pnv_memory_block_size(void)
{
- return 256UL * 1024 * 1024;
+ /*
+ * We map the kernel linear region with 1GB large pages on radix. For
+ * memory hot unplug to work our memory block size must be at least
+ * this size.
+ */
+ if (radix_enabled())
+ return radix_mem_block_size;
+ else
+ return 256UL * 1024 * 1024;
}
#endif
--
2.26.2
^ permalink raw reply related
* [PATCH v3 3/4] powerpc/mm/radix: Remove split_kernel_mapping()
From: Aneesh Kumar K.V @ 2020-07-09 13:19 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao
In-Reply-To: <20200709131925.922266-1-aneesh.kumar@linux.ibm.com>
From: Bharata B Rao <bharata@linux.ibm.com>
We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:
1. Recursive locking
--------------------
Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.
2. BUG: sleeping function called from in_atomic() context
---------------------------------------------------------
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()
3. Bad unlock unbalance
-----------------------
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.
These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.
In preparation for such a change remove page table splitting support.
This essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 95 +++++-------------------
1 file changed, 19 insertions(+), 76 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 46ad2da3087a..d5a01b9aadc9 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,7 +15,6 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/string_helpers.h>
-#include <linux/stop_machine.h>
#include <asm/pgalloc.h>
#include <asm/mmu_context.h>
@@ -722,32 +721,6 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
}
-struct change_mapping_params {
- pte_t *pte;
- unsigned long start;
- unsigned long end;
- unsigned long aligned_start;
- unsigned long aligned_end;
-};
-
-static int __meminit stop_machine_change_mapping(void *data)
-{
- struct change_mapping_params *params =
- (struct change_mapping_params *)data;
-
- if (!data)
- return -1;
-
- spin_unlock(&init_mm.page_table_lock);
- pte_clear(&init_mm, params->aligned_start, params->pte);
- create_physical_mapping(__pa(params->aligned_start),
- __pa(params->start), -1, PAGE_KERNEL);
- create_physical_mapping(__pa(params->end), __pa(params->aligned_end),
- -1, PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock);
- return 0;
-}
-
static void remove_pte_table(pte_t *pte_start, unsigned long addr,
unsigned long end)
{
@@ -776,52 +749,6 @@ static void remove_pte_table(pte_t *pte_start, unsigned long addr,
}
}
-/*
- * clear the pte and potentially split the mapping helper
- */
-static void __meminit split_kernel_mapping(unsigned long addr, unsigned long end,
- unsigned long size, pte_t *pte)
-{
- unsigned long mask = ~(size - 1);
- unsigned long aligned_start = addr & mask;
- unsigned long aligned_end = addr + size;
- struct change_mapping_params params;
- bool split_region = false;
-
- if ((end - addr) < size) {
- /*
- * We're going to clear the PTE, but not flushed
- * the mapping, time to remap and flush. The
- * effects if visible outside the processor or
- * if we are running in code close to the
- * mapping we cleared, we are in trouble.
- */
- if (overlaps_kernel_text(aligned_start, addr) ||
- overlaps_kernel_text(end, aligned_end)) {
- /*
- * Hack, just return, don't pte_clear
- */
- WARN_ONCE(1, "Linear mapping %lx->%lx overlaps kernel "
- "text, not splitting\n", addr, end);
- return;
- }
- split_region = true;
- }
-
- if (split_region) {
- params.pte = pte;
- params.start = addr;
- params.end = end;
- params.aligned_start = addr & ~(size - 1);
- params.aligned_end = min_t(unsigned long, aligned_end,
- (unsigned long)__va(memblock_end_of_DRAM()));
- stop_machine(stop_machine_change_mapping, ¶ms, NULL);
- return;
- }
-
- pte_clear(&init_mm, addr, pte);
-}
-
static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
unsigned long end)
{
@@ -837,7 +764,12 @@ static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
continue;
if (pmd_is_leaf(*pmd)) {
- split_kernel_mapping(addr, end, PMD_SIZE, (pte_t *)pmd);
+ if (!IS_ALIGNED(addr, PMD_SIZE) ||
+ !IS_ALIGNED(next, PMD_SIZE)) {
+ WARN_ONCE(1, "%s: unaligned range\n", __func__);
+ continue;
+ }
+ pte_clear(&init_mm, addr, (pte_t *)pmd);
continue;
}
@@ -862,7 +794,12 @@ static void remove_pud_table(pud_t *pud_start, unsigned long addr,
continue;
if (pud_is_leaf(*pud)) {
- split_kernel_mapping(addr, end, PUD_SIZE, (pte_t *)pud);
+ if (!IS_ALIGNED(addr, PUD_SIZE) ||
+ !IS_ALIGNED(next, PUD_SIZE)) {
+ WARN_ONCE(1, "%s: unaligned range\n", __func__);
+ continue;
+ }
+ pte_clear(&init_mm, addr, (pte_t *)pud);
continue;
}
@@ -890,7 +827,13 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
continue;
if (p4d_is_leaf(*p4d)) {
- split_kernel_mapping(addr, end, P4D_SIZE, (pte_t *)p4d);
+ if (!IS_ALIGNED(addr, P4D_SIZE) ||
+ !IS_ALIGNED(next, P4D_SIZE)) {
+ WARN_ONCE(1, "%s: unaligned range\n", __func__);
+ continue;
+ }
+
+ pte_clear(&init_mm, addr, (pte_t *)pgd);
continue;
}
--
2.26.2
^ permalink raw reply related
* [PATCH v3 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable
From: Aneesh Kumar K.V @ 2020-07-09 13:19 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K . V, Bharata B Rao
In-Reply-To: <20200709131925.922266-1-aneesh.kumar@linux.ibm.com>
From: Bharata B Rao <bharata@linux.ibm.com>
remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.
Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/mm/book3s64/radix_pgtable.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 85806a6bed4d..46ad2da3087a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -707,6 +707,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
pud_clear(pud);
}
+static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+ pud_t *pud;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ pud = pud_start + i;
+ if (!pud_none(*pud))
+ return;
+ }
+
+ pud_free(&init_mm, pud_start);
+ p4d_clear(p4d);
+}
+
struct change_mapping_params {
pte_t *pte;
unsigned long start;
@@ -881,6 +896,7 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
pud_base = (pud_t *)p4d_page_vaddr(*p4d);
remove_pud_table(pud_base, addr, next);
+ free_pud_table(pud_base, p4d);
}
spin_unlock(&init_mm.page_table_lock);
--
2.26.2
^ permalink raw reply related
* [PATCH v3 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings
From: Aneesh Kumar K.V @ 2020-07-09 13:19 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao
In-Reply-To: <20200709131925.922266-1-aneesh.kumar@linux.ibm.com>
We can hit the following BUG_ON during memory unplug:
kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c000000000093308] pmd_fragment_free+0x48/0xc0
LR [c00000000147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc000008050000000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74
This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.
This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.
Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:
1. 64T system
-------------
64T RAM would need 64G for vmemmap with struct page size being 64B.
128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K
2. 2G system
------------
2G RAM would need 2M for vmemmap with struct page size being 64B.
1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)
With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/include/asm/book3s/64/pgalloc.h | 16 +++++++++++++++-
arch/powerpc/mm/book3s64/pgtable.c | 5 ++++-
arch/powerpc/mm/book3s64/radix_pgtable.c | 15 +++++++++++----
arch/powerpc/mm/pgtable-frag.c | 3 +++
4 files changed, 33 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 69c5b051734f..e1af0b394ceb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -107,9 +107,23 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
return pud;
}
+static inline void __pud_free(pud_t *pud)
+{
+ struct page *page = virt_to_page(pud);
+
+ /*
+ * Early pud pages allocated via memblock allocator
+ * can't be directly freed to slab
+ */
+ if (PageReserved(page))
+ free_reserved_page(page);
+ else
+ kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+}
+
static inline void pud_free(struct mm_struct *mm, pud_t *pud)
{
- kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+ return __pud_free(pud);
}
static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index c58ad1049909..85de5b574dd7 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -339,6 +339,9 @@ void pmd_fragment_free(unsigned long *pmd)
{
struct page *page = virt_to_page(pmd);
+ if (PageReserved(page))
+ return free_reserved_page(page);
+
BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
pgtable_pmd_page_dtor(page);
@@ -356,7 +359,7 @@ static inline void pgtable_free(void *table, int index)
pmd_fragment_free(table);
break;
case PUD_INDEX:
- kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), table);
+ __pud_free(table);
break;
#if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE)
/* 16M hugepd directory at pud level */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index bb00e0cba119..85806a6bed4d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -56,6 +56,13 @@ static __ref void *early_alloc_pgtable(unsigned long size, int nid,
return ptr;
}
+/*
+ * When allocating pud or pmd pointers, we allocate a complete page
+ * of PAGE_SIZE rather than PUD_TABLE_SIZE or PMD_TABLE_SIZE. This
+ * is to ensure that the page obtained from the memblock allocator
+ * can be completely used as page table page and can be freed
+ * correctly when the page table entries are removed.
+ */
static int early_map_kernel_page(unsigned long ea, unsigned long pa,
pgprot_t flags,
unsigned int map_page_size,
@@ -72,8 +79,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
pgdp = pgd_offset_k(ea);
p4dp = p4d_offset(pgdp, ea);
if (p4d_none(*p4dp)) {
- pudp = early_alloc_pgtable(PUD_TABLE_SIZE, nid,
- region_start, region_end);
+ pudp = early_alloc_pgtable(PAGE_SIZE, nid,
+ region_start, region_end);
p4d_populate(&init_mm, p4dp, pudp);
}
pudp = pud_offset(p4dp, ea);
@@ -82,8 +89,8 @@ static int early_map_kernel_page(unsigned long ea, unsigned long pa,
goto set_the_pte;
}
if (pud_none(*pudp)) {
- pmdp = early_alloc_pgtable(PMD_TABLE_SIZE, nid,
- region_start, region_end);
+ pmdp = early_alloc_pgtable(PAGE_SIZE, nid, region_start,
+ region_end);
pud_populate(&init_mm, pudp, pmdp);
}
pmdp = pmd_offset(pudp, ea);
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index ee4bd6d38602..97ae4935da79 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -110,6 +110,9 @@ void pte_fragment_free(unsigned long *table, int kernel)
{
struct page *page = virt_to_page(table);
+ if (PageReserved(page))
+ return free_reserved_page(page);
+
BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
if (!kernel)
--
2.26.2
^ permalink raw reply related
* [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: Aneesh Kumar K.V @ 2020-07-09 13:19 UTC (permalink / raw)
To: linuxppc-dev, mpe; +Cc: Aneesh Kumar K.V, Bharata B Rao
This is the next version of the fixes for memory unplug on radix.
The issues and the fix are described in the actual patches.
Changes from v2:
- Address review feedback
Changes from v1:
- Added back patch to drop split_kernel_mapping
- Most of the split_kernel_mapping related issues are now described
in the removal patch
- drop pte fragment change
- use lmb size as the max mapping size.
- Radix baremetal now use memory block size of 1G.
Changes from v0:
- Rebased to latest kernel.
- Took care of p4d changes.
- Addressed Aneesh's review feedback:
- Added comments.
- Indentation fixed.
- Dropped the 1st patch (setting DRCONF_MEM_HOTREMOVABLE lmb flags) as
it is debatable if this flag should be set in the device tree by OS
and not by platform in case of hotplug. This can be looked at separately.
(The fixes in this patchset remain valid without the dropped patch)
- Dropped the last patch that removed split_kernel_mapping() to ensure
that spilitting code is available for any radix guest running on
platforms that don't set DRCONF_MEM_HOTREMOVABLE.
Aneesh Kumar K.V (2):
powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
mappings
powerpc/mm/radix: Create separate mappings for hot-plugged memory
Bharata B Rao (2):
powerpc/mm/radix: Free PUD table when freeing pagetable
powerpc/mm/radix: Remove split_kernel_mapping()
arch/powerpc/include/asm/book3s/64/mmu.h | 5 +
arch/powerpc/include/asm/book3s/64/pgalloc.h | 16 +-
arch/powerpc/mm/book3s64/pgtable.c | 5 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 197 +++++++++++--------
arch/powerpc/mm/pgtable-frag.c | 3 +
arch/powerpc/platforms/powernv/setup.c | 10 +-
6 files changed, 147 insertions(+), 89 deletions(-)
--
2.26.2
^ permalink raw reply
* [PATCH] powerpc/watchpoint/ptrace: Introduce PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31
From: Ravi Bangoria @ 2020-07-09 12:29 UTC (permalink / raw)
To: mpe, mikey
Cc: christophe.leroy, ravi.bangoria, linux-kernel, paulus, pedromfc,
naveen.n.rao, linuxppc-dev
PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31 can be used to determine
whether we are running on an ISA 3.1 compliant machine. Which is
needed to determine DAR behaviour, 512 byte boundary limit etc.
This was requested by Pedro Miraglia Franco de Carvalho for
extending watchpoint features in gdb. Note that availability of
2nd DAWR is independent of this flag and should be checked using
ppc_debug_info->num_data_bps.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
arch/powerpc/include/uapi/asm/ptrace.h | 1 +
arch/powerpc/kernel/ptrace/ptrace-noadv.c | 5 ++++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/include/uapi/asm/ptrace.h b/arch/powerpc/include/uapi/asm/ptrace.h
index f5f1ccc740fc..0a87bcd4300a 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -222,6 +222,7 @@ struct ppc_debug_info {
#define PPC_DEBUG_FEATURE_DATA_BP_RANGE 0x0000000000000004
#define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x0000000000000008
#define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x0000000000000010
+#define PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31 0x0000000000000020
#ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/ptrace/ptrace-noadv.c b/arch/powerpc/kernel/ptrace/ptrace-noadv.c
index 697c7e4b5877..b2de874d650b 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-noadv.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-noadv.c
@@ -52,8 +52,11 @@ void ppc_gethwdinfo(struct ppc_debug_info *dbginfo)
dbginfo->sizeof_condition = 0;
if (IS_ENABLED(CONFIG_HAVE_HW_BREAKPOINT)) {
dbginfo->features = PPC_DEBUG_FEATURE_DATA_BP_RANGE;
- if (dawr_enabled())
+ if (dawr_enabled()) {
dbginfo->features |= PPC_DEBUG_FEATURE_DATA_BP_DAWR;
+ if (cpu_has_feature(CPU_FTR_ARCH_31))
+ dbginfo->features |= PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31;
+ }
} else {
dbginfo->features = 0;
}
--
2.26.2
^ permalink raw reply related
* RE: [PATCH 14/20] Documentation: misc/xilinx_sdfec: eliminate duplicated word
From: Dragan Cvetic @ 2020-07-09 11:38 UTC (permalink / raw)
To: Randy Dunlap, linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org, linux-doc@vger.kernel.org, David Airlie,
kgdb-bugreport@lists.sourceforge.net, linux-fpga@vger.kernel.org,
Liviu Dudau, dri-devel@lists.freedesktop.org,
linux-mips@vger.kernel.org, Paul Cercueil,
keyrings@vger.kernel.org, Paul Mackerras,
linux-i2c@vger.kernel.org, Pavel Machek, Srinivas Pandruvada,
Mihail Atanassov, linux-leds@vger.kernel.org,
linux-s390@vger.kernel.org, Daniel Thompson,
linux-scsi@vger.kernel.org, Jonathan Corbet, Masahiro Yamada,
Matthew Wilcox, Halil Pasic, Jarkko Sakkinen, James Wang,
linux-input@vger.kernel.org, Mali DP Maintainers, Wu Hao,
Tony Krowiak, linux-kbuild@vger.kernel.org, James E.J. Bottomley,
Jiri Kosina, Hannes Reinecke, linux-block@vger.kernel.org,
Thomas Bogendoerfer, Jacek Anaszewski, linux-mm@vger.kernel.org,
Dan Williams, Andrew Morton, Mimi Zohar, Jens Axboe, Michal Marek,
Martin K. Petersen, Pierre Morel, Douglas Anderson, Wolfram Sang,
Derek Kiernan, Daniel Vetter, Jason Wessel, Paolo Bonzini,
linux-integrity@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
Mike Rapoport, Dan Murphy
In-Reply-To: <20200707180414.10467-15-rdunlap@infradead.org>
> -----Original Message-----
> From: Randy Dunlap <rdunlap@infradead.org>
> Sent: Tuesday 7 July 2020 19:04
> To: linux-kernel@vger.kernel.org
> Cc: Randy Dunlap <rdunlap@infradead.org>; Jonathan Corbet <corbet@lwn.net>; linux-doc@vger.kernel.org; linux-
> mm@vger.kernel.org; Mike Rapoport <rppt@kernel.org>; Jens Axboe <axboe@kernel.dk>; linux-block@vger.kernel.org; Jason
> Wessel <jason.wessel@windriver.com>; Daniel Thompson <daniel.thompson@linaro.org>; Douglas Anderson
> <dianders@chromium.org>; kgdb-bugreport@lists.sourceforge.net; Wu Hao <hao.wu@intel.com>; linux-fpga@vger.kernel.org;
> James Wang <james.qian.wang@arm.com>; Liviu Dudau <liviu.dudau@arm.com>; Mihail Atanassov <mihail.atanassov@arm.com>;
> Mali DP Maintainers <malidp@foss.arm.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; dri-
> devel@lists.freedesktop.org; Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>; Jiri Kosina <jikos@kernel.org>; linux-
> input@vger.kernel.org; Wolfram Sang <wsa@kernel.org>; linux-i2c@vger.kernel.org; Masahiro Yamada <masahiroy@kernel.org>;
> Michal Marek <michal.lkml@markovi.net>; linux-kbuild@vger.kernel.org; Jacek Anaszewski <jacek.anaszewski@gmail.com>; Pavel
> Machek <pavel@ucw.cz>; Dan Murphy <dmurphy@ti.com>; linux-leds@vger.kernel.org; Dan Williams <dan.j.williams@intel.com>;
> Paul Cercueil <paul@crapouillou.net>; Thomas Bogendoerfer <tsbogend@alpha.franken.de>; linux-mips@vger.kernel.org; Derek
> Kiernan <dkiernan@xilinx.com>; Dragan Cvetic <draganc@xilinx.com>; Michael Ellerman <mpe@ellerman.id.au>; Benjamin
> Herrenschmidt <benh@kernel.crashing.org>; Paul Mackerras <paulus@samba.org>; linuxppc-dev@lists.ozlabs.org; Tony Krowiak
> <akrowiak@linux.ibm.com>; Pierre Morel <pmorel@linux.ibm.com>; Halil Pasic <pasic@linux.ibm.com>; linux-s390@vger.kernel.org;
> Matthew Wilcox <willy@infradead.org>; Hannes Reinecke <hare@suse.com>; linux-scsi@vger.kernel.org; James E.J. Bottomley
> <jejb@linux.ibm.com>; Martin K. Petersen <martin.petersen@oracle.com>; Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>;
> Mimi Zohar <zohar@linux.ibm.com>; linux-integrity@vger.kernel.org; keyrings@vger.kernel.org; Paolo Bonzini
> <pbonzini@redhat.com>; kvm@vger.kernel.org; Andrew Morton <akpm@linux-foundation.org>
> Subject: [PATCH 14/20] Documentation: misc/xilinx_sdfec: eliminate duplicated word
>
> Drop the doubled word "the".
>
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: linux-doc@vger.kernel.org
> Cc: Derek Kiernan <derek.kiernan@xilinx.com>
> Cc: Dragan Cvetic <dragan.cvetic@xilinx.com>
> ---
> Documentation/misc-devices/xilinx_sdfec.rst | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-next-20200701.orig/Documentation/misc-devices/xilinx_sdfec.rst
> +++ linux-next-20200701/Documentation/misc-devices/xilinx_sdfec.rst
> @@ -78,7 +78,7 @@ application interfaces:
> - open: Implements restriction that only a single file descriptor can be open per SD-FEC instance at any time
> - release: Allows another file descriptor to be open, that is after current file descriptor is closed
> - poll: Provides a method to monitor for SD-FEC Error events
> - - unlocked_ioctl: Provides the the following ioctl commands that allows the application configure the SD-FEC core:
> + - unlocked_ioctl: Provides the following ioctl commands that allows the application configure the SD-FEC core:
>
> - :c:macro:`XSDFEC_START_DEV`
> - :c:macro:`XSDFEC_STOP_DEV`
Acked-by: Dragan Cvetic <dragan.cvetic@xilinx.com>
Thanks Randy
Dragan
^ permalink raw reply
* Re: [PATCH v2 04/10] powerpc/perf: Add power10_feat to dt_cpu_ftrs
From: Athira Rajeev @ 2020-07-09 11:07 UTC (permalink / raw)
To: Michael Ellerman; +Cc: Michael Neuling, maddy, linuxppc-dev
In-Reply-To: <87y2nu2r7f.fsf@mpe.ellerman.id.au>
[-- Attachment #1: Type: text/plain, Size: 1829 bytes --]
> On 08-Jul-2020, at 4:45 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Athira Rajeev <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> writes:
>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>>
>> Add power10 feature function to dt_cpu_ftrs.c along
>> with a power10 specific init() to initialize pmu sprs.
>>
>> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/reg.h | 3 +++
>> arch/powerpc/kernel/cpu_setup_power.S | 7 +++++++
>> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++++++++++++++++++++++++++
>> 3 files changed, 36 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 21a1b2d..900ada1 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -1068,6 +1068,9 @@
>> #define MMCR0_PMC2_LOADMISSTIME 0x5
>> #endif
>>
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define MMCRA_BHRB_DISABLE 0x0000002000000000
>> +
>> /*
>> * SPRG usage:
>> *
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa7..e8b3370c 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -233,3 +233,10 @@ __init_PMU_ISA207:
>> li r5,0
>> mtspr SPRN_MMCRS,r5
>> blr
>> +
>> +__init_PMU_ISA31:
>> + li r5,0
>> + mtspr SPRN_MMCR3,r5
>> + LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
>> + mtspr SPRN_MMCRA,r5
>> + blr
>
> This doesn't seem like it belongs in this patch. It's not called?
Yes, you are right, this needs to be called from `__setup_cpu_power10`.
Since we didn’t had setup part for power10 in the tree initially, missed it.
I will include this update in V3
Thanks
Athira
>
> cheers
[-- Attachment #2: Type: text/html, Size: 7476 bytes --]
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
From: Pratik Sampat @ 2020-07-09 11:21 UTC (permalink / raw)
To: ego; +Cc: Ravi Bangoria, pratik.r.sampat, linux-kernel, paulus,
linuxppc-dev
In-Reply-To: <20200709090948.GB24354@in.ibm.com>
On 09/07/20 2:39 pm, Gautham R Shenoy wrote:
> On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:
>> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
>> stop levels < 4.
> Adding Ravi Bangoria <ravi.bangoria@linux.ibm.com> to the cc.
>
>> Therefore save the values of these SPRs before entering a "stop"
>> state and restore their values on wakeup.
>>
>> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
>
> The saving and restoration looks good to me.
>> ---
>> arch/powerpc/platforms/powernv/idle.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
>> index 19d94d021357..471d4a65b1fa 100644
>> --- a/arch/powerpc/platforms/powernv/idle.c
>> +++ b/arch/powerpc/platforms/powernv/idle.c
>> @@ -600,6 +600,8 @@ struct p9_sprs {
>> u64 iamr;
>> u64 amor;
>> u64 uamor;
>> + u64 dawr0;
>> + u64 dawrx0;
>> };
>>
>> static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>> @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>> sprs.tscr = mfspr(SPRN_TSCR);
>> if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>> sprs.ldbar = mfspr(SPRN_LDBAR);
>> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>> + sprs.dawr0 = mfspr(SPRN_DAWR0);
>> + sprs.dawrx0 = mfspr(SPRN_DAWRX0);
>> + }
>>
>
> But this is within the if condition which says
>
> if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)
>
> This if condition is meant for stop4 and stop5 since these are stop
> levels that have OPAL_PM_LOSE_HYP_CONTEXT set.
>
> Since we can lose DAWR*, on states that lose limited hypervisor
> context, such as stop0-2, we need to unconditionally save them
> like AMR, IAMR etc.
>
Right, shallow states too loose DAWR/X. Thanks for pointing it out.
I'll fix this and resend.
>> sprs_saved = true;
>>
>> @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>> mtspr(SPRN_MMCR2, sprs.mmcr2);
>> if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>> mtspr(SPRN_LDBAR, sprs.ldbar);
>> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
>> + mtspr(SPRN_DAWR0, sprs.dawr0);
>> + mtspr(SPRN_DAWRX0, sprs.dawrx0);
>> + }
>
> Likewise, we need to unconditionally restore these SPRs.
>
>
>> mtspr(SPRN_SPRG3, local_paca->sprg_vdso);
>>
>> --
>> 2.25.4
>>
Thanks
Pratik
^ permalink raw reply
* Re: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure
From: Michael Ellerman @ 2020-07-09 11:22 UTC (permalink / raw)
To: Haren Myneni; +Cc: tulioqm, abali, linuxppc-dev, rzinsly
In-Reply-To: <f8af60fd4167c9c04ee5ab47147b9e95bcb3b9ff.camel@linux.ibm.com>
Haren Myneni <haren@linux.ibm.com> writes:
> DMA controller uses CC=5 internally for translation fault handling. So
> OS should be using CC=250 and should report this error to the user space
> when NX encounters address translation failure on the request buffer.
That doesn't really explain *why* the OS must use CC=250.
Is it documented somewhere that 5 is for hardware use, and 250 is for
software?
> This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates
> CSB.CC with this proper error code for user space.
We still have:
#define CSB_CC_TRANSLATION (5)
And it's very unclear where one or the other should be used.
Can one or the other get a name that makes the distinction clear.
cheers
> diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/powerpc/vas-api.rst
> index 1217c2f..78627cc 100644
> --- a/Documentation/powerpc/vas-api.rst
> +++ b/Documentation/powerpc/vas-api.rst
> @@ -213,7 +213,7 @@ request buffers are not in memory. The operating system handles the fault by
> updating CSB with the following data:
>
> csb.flags = CSB_V;
> - csb.cc = CSB_CC_TRANSLATION;
> + csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> csb.ce = CSB_CE_TERMINATION;
> csb.address = fault_address;
>
> diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/include/asm/icswx.h
> index 965b1f3..b1c9a57 100644
> --- a/arch/powerpc/include/asm/icswx.h
> +++ b/arch/powerpc/include/asm/icswx.h
> @@ -77,6 +77,8 @@ struct coprocessor_completion_block {
> #define CSB_CC_CHAIN (37)
> #define CSB_CC_SEQUENCE (38)
> #define CSB_CC_HW (39)
> +/* User space address traslation failure */
> +#define CSB_CC_ADDRESS_TRANSLATION (250)
>
> #define CSB_SIZE (0x10)
> #define CSB_ALIGN CSB_SIZE
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/powerpc/platforms/powernv/vas-fault.c
> index 266a6ca..33e89d4 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
> csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
>
> memset(&csb, 0, sizeof(csb));
> - csb.cc = CSB_CC_TRANSLATION;
> + csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> csb.ce = CSB_CE_TERMINATION;
> csb.cs = 0;
> csb.count = 0;
> --
> 1.8.3.1
^ permalink raw reply
* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Alex Ghiti @ 2020-07-09 11:11 UTC (permalink / raw)
To: Palmer Dabbelt
Cc: aou, Anup Patel, linux-kernel, Atish Patra, paulus, zong.li,
Paul Walmsley, linux-riscv, linuxppc-dev
In-Reply-To: <mhng-831c4073-aefa-4aa0-a583-6a17f9aff9b7@palmerdabbelt-glaptop1>
Hi Palmer,
Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :
> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), alex@ghiti.fr wrote:
>> This is a preparatory patch for relocatable kernel.
>>
>> The kernel used to be linked at PAGE_OFFSET address and used to be loaded
>> physically at the beginning of the main memory. Therefore, we could use
>> the linear mapping for the kernel mapping.
>>
>> But the relocated kernel base address will be different from PAGE_OFFSET
>> and since in the linear mapping, two different virtual addresses cannot
>> point to the same physical address, the kernel mapping needs to lie
>> outside
>> the linear mapping.
>
> I know it's been a while, but I keep opening this up to review it and just
> can't get over how ugly it is to put the kernel's linear map in the vmalloc
> region.
>
> I guess I don't understand why this is necessary at all. Specifically: why
> can't we just relocate the kernel within the linear map? That would let
> the
> bootloader put the kernel wherever it wants, modulo the physical memory
> size we
> support. We'd need to handle the regions that are coupled to the kernel's
> execution address, but we could just put them in an explicit memory region
> which is what we should probably be doing anyway.
Virtual relocation in the linear mapping requires to move the kernel
physically too. Zong implemented this physical move in its KASLR RFC
patchset, which is cumbersome since finding an available physical spot
is harder than just selecting a virtual range in the vmalloc range.
In addition, having the kernel mapping in the linear mapping prevents
the use of hugepage for the linear mapping resulting in performance loss
(at least for the GB that encompasses the kernel).
Why do you find this "ugly" ? The vmalloc region is just a bunch of
available virtual addresses to whatever purpose we want, and as noted by
Zong, arm64 uses the same scheme.
>
>> In addition, because modules and BPF must be close to the kernel (inside
>> +-2GB window), the kernel is placed at the end of the vmalloc zone minus
>> 2GB, which leaves room for modules and BPF. The kernel could not be
>> placed at the beginning of the vmalloc zone since other vmalloc
>> allocations from the kernel could get all the +-2GB window around the
>> kernel which would prevent new modules and BPF programs to be loaded.
>
> Well, that's not enough to make sure this doesn't happen -- it's just
> enough to
> make sure it doesn't happen very quickily. That's the same boat we're
> already
> in, though, so it's not like it's worse.
Indeed, that's not worse, I haven't found a way to reserve vmalloc area
without actually allocating it.
>
>> Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
>> Reviewed-by: Zong Li <zong.li@sifive.com>
>> ---
>> arch/riscv/boot/loader.lds.S | 3 +-
>> arch/riscv/include/asm/page.h | 10 +++++-
>> arch/riscv/include/asm/pgtable.h | 38 ++++++++++++++-------
>> arch/riscv/kernel/head.S | 3 +-
>> arch/riscv/kernel/module.c | 4 +--
>> arch/riscv/kernel/vmlinux.lds.S | 3 +-
>> arch/riscv/mm/init.c | 58 +++++++++++++++++++++++++-------
>> arch/riscv/mm/physaddr.c | 2 +-
>> 8 files changed, 88 insertions(+), 33 deletions(-)
>>
>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>> index 47a5003c2e28..62d94696a19c 100644
>> --- a/arch/riscv/boot/loader.lds.S
>> +++ b/arch/riscv/boot/loader.lds.S
>> @@ -1,13 +1,14 @@
>> /* SPDX-License-Identifier: GPL-2.0 */
>>
>> #include <asm/page.h>
>> +#include <asm/pgtable.h>
>>
>> OUTPUT_ARCH(riscv)
>> ENTRY(_start)
>>
>> SECTIONS
>> {
>> - . = PAGE_OFFSET;
>> + . = KERNEL_LINK_ADDR;
>>
>> .payload : {
>> *(.payload)
>> diff --git a/arch/riscv/include/asm/page.h
>> b/arch/riscv/include/asm/page.h
>> index 2d50f76efe48..48bb09b6a9b7 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>
>> #ifdef CONFIG_MMU
>> extern unsigned long va_pa_offset;
>> +extern unsigned long va_kernel_pa_offset;
>> extern unsigned long pfn_base;
>> #define ARCH_PFN_OFFSET (pfn_base)
>> #else
>> #define va_pa_offset 0
>> +#define va_kernel_pa_offset 0
>> #define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
>> #endif /* CONFIG_MMU */
>>
>> extern unsigned long max_low_pfn;
>> extern unsigned long min_low_pfn;
>> +extern unsigned long kernel_virt_addr;
>>
>> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) +
>> va_pa_offset))
>> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
>> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) -
>> va_pa_offset)
>> +#define kernel_mapping_va_to_pa(x) \
>> + ((unsigned long)(x) - va_kernel_pa_offset)
>> +#define __va_to_pa_nodebug(x) \
>> + (((x) >= PAGE_OFFSET) ? \
>> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>
>> #ifdef CONFIG_DEBUG_VIRTUAL
>> extern phys_addr_t __virt_to_phys(unsigned long x);
>> diff --git a/arch/riscv/include/asm/pgtable.h
>> b/arch/riscv/include/asm/pgtable.h
>> index 35b60035b6b0..94ef3b49dfb6 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -11,23 +11,29 @@
>>
>> #include <asm/pgtable-bits.h>
>>
>> -#ifndef __ASSEMBLY__
>> -
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> -#include <asm/page.h>
>> -#include <asm/tlbflush.h>
>> -#include <linux/mm_types.h>
>> -
>> -#ifdef CONFIG_MMU
>> +#ifndef CONFIG_MMU
>> +#define KERNEL_VIRT_ADDR PAGE_OFFSET
>> +#define KERNEL_LINK_ADDR PAGE_OFFSET
>> +#else
>> +/*
>> + * Leave 2GB for modules and BPF that must lie within a 2GB range around
>> + * the kernel.
>> + */
>> +#define KERNEL_VIRT_ADDR (VMALLOC_END - SZ_2G + 1)
>> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
>
> At a bare minimum this is going to make a mess of the 32-bit port, as
> non-relocatable kernels are now going to get linked at 1GiB which is
> where user
> code is supposed to live. That's an easy fix, though, as the 32-bit stuff
> doesn't need any module address restrictions.
Indeed, I will take a look at that.
>
>> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
>> #define VMALLOC_END (PAGE_OFFSET - 1)
>> #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
>>
>> #define BPF_JIT_REGION_SIZE (SZ_128M)
>> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>> -#define BPF_JIT_REGION_END (VMALLOC_END)
>> +#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
>> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START +
>> BPF_JIT_REGION_SIZE)
>> +
>> +#ifdef CONFIG_64BIT
>> +#define VMALLOC_MODULE_START BPF_JIT_REGION_END
>> +#define VMALLOC_MODULE_END (((unsigned long)&_start & PAGE_MASK) +
>> SZ_2G)
>> +#endif
>>
>> /*
>> * Roughly size the vmemmap space to be large enough to fit enough
>> @@ -57,9 +63,16 @@
>> #define FIXADDR_SIZE PGDIR_SIZE
>> #endif
>> #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
>> -
>> #endif
>>
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Page Upper Directory not used in RISC-V */
>> +#include <asm-generic/pgtable-nopud.h>
>> +#include <asm/page.h>
>> +#include <asm/tlbflush.h>
>> +#include <linux/mm_types.h>
>> +
>> #ifdef CONFIG_64BIT
>> #include <asm/pgtable-64.h>
>> #else
>> @@ -483,6 +496,7 @@ static inline void __kernel_map_pages(struct page
>> *page, int numpages, int enabl
>>
>> #define kern_addr_valid(addr) (1) /* FIXME */
>>
>> +extern char _start[];
>> extern void *dtb_early_va;
>> void setup_bootmem(void);
>> void paging_init(void);
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 98a406474e7d..8f5bb7731327 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -49,7 +49,8 @@ ENTRY(_start)
>> #ifdef CONFIG_MMU
>> relocate:
>> /* Relocate return address */
>> - li a1, PAGE_OFFSET
>> + la a1, kernel_virt_addr
>> + REG_L a1, 0(a1)
>> la a2, _start
>> sub a1, a1, a2
>> add ra, ra, a1
>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>> index 8bbe5dbe1341..1a8fbe05accf 100644
>> --- a/arch/riscv/kernel/module.c
>> +++ b/arch/riscv/kernel/module.c
>> @@ -392,12 +392,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const
>> char *strtab,
>> }
>>
>> #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>> -#define VMALLOC_MODULE_START \
>> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>> void *module_alloc(unsigned long size)
>> {
>> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>> - VMALLOC_END, GFP_KERNEL,
>> + VMALLOC_MODULE_END, GFP_KERNEL,
>> PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>> __builtin_return_address(0));
>> }
>> diff --git a/arch/riscv/kernel/vmlinux.lds.S
>> b/arch/riscv/kernel/vmlinux.lds.S
>> index 0339b6bbe11a..a9abde62909f 100644
>> --- a/arch/riscv/kernel/vmlinux.lds.S
>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>> @@ -4,7 +4,8 @@
>> * Copyright (C) 2017 SiFive
>> */
>>
>> -#define LOAD_OFFSET PAGE_OFFSET
>> +#include <asm/pgtable.h>
>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>> #include <asm/vmlinux.lds.h>
>> #include <asm/page.h>
>> #include <asm/cache.h>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 736de6c8739f..71da78914645 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -22,6 +22,9 @@
>>
>> #include "../kernel/head.h"
>>
>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>> +EXPORT_SYMBOL(kernel_virt_addr);
>> +
>> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>> __page_aligned_bss;
>> EXPORT_SYMBOL(empty_zero_page);
>> @@ -178,8 +181,12 @@ void __init setup_bootmem(void)
>> }
>>
>> #ifdef CONFIG_MMU
>> +/* Offset between linear mapping virtual address and kernel load
>> address */
>> unsigned long va_pa_offset;
>> EXPORT_SYMBOL(va_pa_offset);
>> +/* Offset between kernel mapping virtual address and kernel load
>> address */
>> +unsigned long va_kernel_pa_offset;
>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>> unsigned long pfn_base;
>> EXPORT_SYMBOL(pfn_base);
>>
>> @@ -271,7 +278,7 @@ static phys_addr_t __init alloc_pmd(uintptr_t va)
>> if (mmu_enabled)
>> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>
>> - pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> + pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>> BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>> return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>> }
>> @@ -372,14 +379,30 @@ static uintptr_t __init
>> best_map_size(phys_addr_t base, phys_addr_t size)
>> #error "setup_vm() is called from head.S before relocate so it should
>> not use absolute addressing."
>> #endif
>>
>> +static uintptr_t load_pa, load_sz;
>> +
>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t
>> map_size)
>> +{
>> + uintptr_t va, end_va;
>> +
>> + end_va = kernel_virt_addr + load_sz;
>> + for (va = kernel_virt_addr; va < end_va; va += map_size)
>> + create_pgd_mapping(pgdir, va,
>> + load_pa + (va - kernel_virt_addr),
>> + map_size, PAGE_KERNEL_EXEC);
>> +}
>> +
>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>> {
>> uintptr_t va, end_va;
>> - uintptr_t load_pa = (uintptr_t)(&_start);
>> - uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>> uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>
>> + load_pa = (uintptr_t)(&_start);
>> + load_sz = (uintptr_t)(&_end) - load_pa;
>> +
>> va_pa_offset = PAGE_OFFSET - load_pa;
>> + va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> +
>> pfn_base = PFN_DOWN(load_pa);
>>
>> /*
>> @@ -402,26 +425,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>> /* Setup trampoline PGD and PMD */
>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>> (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> - create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>> + create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>> #else
>> /* Setup trampoline PGD */
>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>> load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>> #endif
>>
>> /*
>> - * Setup early PGD covering entire kernel which will allows
>> + * Setup early PGD covering entire kernel which will allow
>> * us to reach paging_init(). We map all memory banks later
>> * in setup_vm_final() below.
>> */
>> - end_va = PAGE_OFFSET + load_sz;
>> - for (va = PAGE_OFFSET; va < end_va; va += map_size)
>> - create_pgd_mapping(early_pg_dir, va,
>> - load_pa + (va - PAGE_OFFSET),
>> - map_size, PAGE_KERNEL_EXEC);
>> + create_kernel_page_table(early_pg_dir, map_size);
>>
>> /* Create fixed mapping for early FDT parsing */
>> end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
>> @@ -441,6 +460,7 @@ static void __init setup_vm_final(void)
>> uintptr_t va, map_size;
>> phys_addr_t pa, start, end;
>> struct memblock_region *reg;
>> + static struct vm_struct vm_kernel = { 0 };
>>
>> /* Set mmu_enabled flag */
>> mmu_enabled = true;
>> @@ -467,10 +487,22 @@ static void __init setup_vm_final(void)
>> for (pa = start; pa < end; pa += map_size) {
>> va = (uintptr_t)__va(pa);
>> create_pgd_mapping(swapper_pg_dir, va, pa,
>> - map_size, PAGE_KERNEL_EXEC);
>> + map_size, PAGE_KERNEL);
>> }
>> }
>>
>> + /* Map the kernel */
>> + create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>> +
>> + /* Reserve the vmalloc area occupied by the kernel */
>> + vm_kernel.addr = (void *)kernel_virt_addr;
>> + vm_kernel.phys_addr = load_pa;
>> + vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>> + vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>> + vm_kernel.caller = __builtin_return_address(0);
>> +
>> + vm_area_add_early(&vm_kernel);
>> +
>> /* Clear fixmap PTE and PMD mappings */
>> clear_fixmap(FIX_PTE);
>> clear_fixmap(FIX_PMD);
>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>> index e8e4dcd39fed..35703d5ef5fd 100644
>> --- a/arch/riscv/mm/physaddr.c
>> +++ b/arch/riscv/mm/physaddr.c
>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>
>> phys_addr_t __phys_addr_symbol(unsigned long x)
>> {
>> - unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>> + unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>> unsigned long kernel_end = (unsigned long)_end;
>>
>> /*
Alex
^ permalink raw reply
* Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping
From: Michael Ellerman @ 2020-07-09 11:13 UTC (permalink / raw)
To: Laurent Dufour, bharata
Cc: Ram Pai, linux-kernel, kvm-ppc, paulus, sathnaga, sukadev,
linuxppc-dev, bauerman
In-Reply-To: <0588d16a-8548-0f55-1132-400807a390a1@linux.ibm.com>
Laurent Dufour <ldufour@linux.ibm.com> writes:
> Le 08/07/2020 à 13:25, Bharata B Rao a écrit :
>> On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:
>>> When a secure memslot is dropped, all the pages backed in the secure device
>>> (aka really backed by secure memory by the Ultravisor) should be paged out
>>> to a normal page. Previously, this was achieved by triggering the page
>>> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
>>>
>>> This can't work when hot unplugging a memory slot because the memory slot
>>> is flagged as invalid and gfn_to_pfn() is then not trying to access the
>>> page, so the page fault mechanism is not triggered.
>>>
>>> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
>>> simpler to directly calling it instead of triggering such a mechanism. This
>>> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
>>> memslot.
>>
>> Yes, this appears much simpler.
>
> Thanks Bharata for reviewing this.
>
>>
>>>
>>> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
>>> the call to __kvmppc_svm_page_out() is made.
>>> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
>>> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
>>> addition, the mmap_sem is help in read mode during that time, not in write
>>> mode since the virual memory layout is not impacted, and
>>> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
>>>
>>> Cc: Ram Pai <linuxram@us.ibm.com>
>>> Cc: Bharata B Rao <bharata@linux.ibm.com>
>>> Cc: Paul Mackerras <paulus@ozlabs.org>
>>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>>> ---
>>> arch/powerpc/kvm/book3s_hv_uvmem.c | 54 ++++++++++++++++++++----------
>>> 1 file changed, 37 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> index 852cc9ae6a0b..479ddf16d18c 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct vm_area_struct *vma,
>>> * fault on them, do fault time migration to replace the device PTEs in
>>> * QEMU page table with normal PTEs from newly allocated pages.
>>> */
>>> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>>> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>>> struct kvm *kvm, bool skip_page_out)
>>> {
>>> int i;
>>> struct kvmppc_uvmem_page_pvt *pvt;
>>> - unsigned long pfn, uvmem_pfn;
>>> - unsigned long gfn = free->base_gfn;
>>> + struct page *uvmem_page;
>>> + struct vm_area_struct *vma = NULL;
>>> + unsigned long uvmem_pfn, gfn;
>>> + unsigned long addr, end;
>>> +
>>> + down_read(&kvm->mm->mmap_sem);
>>
>> You should be using mmap_read_lock(kvm->mm) with recent kernels.
>
> Absolutely, shame on me, I reviewed Michel's series about that!
>
> Paul, Michael, could you fix that when pulling this patch or should I sent a
> whole new series?
Paul will take this series, so up to him.
cheers
^ permalink raw reply
* Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
From: Peter Zijlstra @ 2020-07-09 11:03 UTC (permalink / raw)
To: Michael Ellerman
Cc: linux-arch, linuxppc-dev, Boqun Feng, linux-kernel,
Nicholas Piggin, virtualization, Ingo Molnar, kvm-ppc,
Waiman Long, Will Deacon
In-Reply-To: <874kqhvu1v.fsf@mpe.ellerman.id.au>
On Thu, Jul 09, 2020 at 08:53:16PM +1000, Michael Ellerman wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
>
> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> > ---
> > arch/powerpc/include/asm/paravirt.h | 28 ++++++++
> > arch/powerpc/include/asm/qspinlock.h | 66 +++++++++++++++++++
> > arch/powerpc/include/asm/qspinlock_paravirt.h | 7 ++
> > arch/powerpc/platforms/pseries/Kconfig | 5 ++
> > arch/powerpc/platforms/pseries/setup.c | 6 +-
> > include/asm-generic/qspinlock.h | 2 +
>
> Another ack?
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
^ permalink raw reply
* Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
From: Gautham R Shenoy @ 2020-07-09 9:09 UTC (permalink / raw)
To: Pratik Rajesh Sampat
Cc: ego, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev,
Ravi Bangoria
In-Reply-To: <20200703124640.42820-2-psampat@linux.ibm.com>
On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.
Adding Ravi Bangoria <ravi.bangoria@linux.ibm.com> to the cc.
> Therefore save the values of these SPRs before entering a "stop"
> state and restore their values on wakeup.
>
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
The saving and restoration looks good to me.
> ---
> arch/powerpc/platforms/powernv/idle.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 19d94d021357..471d4a65b1fa 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -600,6 +600,8 @@ struct p9_sprs {
> u64 iamr;
> u64 amor;
> u64 uamor;
> + u64 dawr0;
> + u64 dawrx0;
> };
>
> static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> sprs.tscr = mfspr(SPRN_TSCR);
> if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
> sprs.ldbar = mfspr(SPRN_LDBAR);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + sprs.dawr0 = mfspr(SPRN_DAWR0);
> + sprs.dawrx0 = mfspr(SPRN_DAWRX0);
> + }
>
But this is within the if condition which says
if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)
This if condition is meant for stop4 and stop5 since these are stop
levels that have OPAL_PM_LOSE_HYP_CONTEXT set.
Since we can lose DAWR*, on states that lose limited hypervisor
context, such as stop0-2, we need to unconditionally save them
like AMR, IAMR etc.
> sprs_saved = true;
>
> @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> mtspr(SPRN_MMCR2, sprs.mmcr2);
> if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
> mtspr(SPRN_LDBAR, sprs.ldbar);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + mtspr(SPRN_DAWR0, sprs.dawr0);
> + mtspr(SPRN_DAWRX0, sprs.dawrx0);
> + }
Likewise, we need to unconditionally restore these SPRs.
>
> mtspr(SPRN_SPRG3, local_paca->sprg_vdso);
>
> --
> 2.25.4
>
^ permalink raw reply
* Re: [PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
From: Gautham R Shenoy @ 2020-07-09 9:01 UTC (permalink / raw)
To: Pratik Rajesh Sampat
Cc: ego, pratik.r.sampat, linux-kernel, paulus, linuxppc-dev
In-Reply-To: <20200703124640.42820-1-psampat@linux.ibm.com>
On Fri, Jul 03, 2020 at 06:16:39PM +0530, Pratik Rajesh Sampat wrote:
> POWER9 onwards the support for the registers HID1, HID4, HID5 has been
> receded.
> Although mfspr on the above registers worked in Power9, In Power10
> simulator is unrecognized. Moving their assignment under the
> check for machines lower than Power9
>
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
Nice catch.
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
> arch/powerpc/platforms/powernv/idle.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 2dd467383a88..19d94d021357 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
> */
> uint64_t lpcr_val = mfspr(SPRN_LPCR);
> uint64_t hid0_val = mfspr(SPRN_HID0);
> - uint64_t hid1_val = mfspr(SPRN_HID1);
> - uint64_t hid4_val = mfspr(SPRN_HID4);
> - uint64_t hid5_val = mfspr(SPRN_HID5);
> uint64_t hmeer_val = mfspr(SPRN_HMEER);
> uint64_t msr_val = MSR_IDLE;
> uint64_t psscr_val = pnv_deepest_stop_psscr_val;
> @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
>
> /* Only p8 needs to set extra HID regiters */
> if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> + uint64_t hid1_val = mfspr(SPRN_HID1);
> + uint64_t hid4_val = mfspr(SPRN_HID4);
> + uint64_t hid5_val = mfspr(SPRN_HID5);
>
> rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
> if (rc != 0)
> --
> 2.25.4
>
--
Thanks and Regards
gautham.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox