LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v3] powerpc/pseries: Avoid using addr_to_pfn in realmode
From: Ganesh @ 2020-07-22 10:37 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev, mpe; +Cc: aneesh.kumar, mahesh
In-Reply-To: <1595325962.zikec6nkw7.astroid@bobo.none>

[-- Attachment #1: Type: text/plain, Size: 9104 bytes --]



On 7/21/20 3:38 PM, Nicholas Piggin wrote:
> Excerpts from Ganesh Goudar's message of July 20, 2020 6:03 pm:
>> When an UE or memory error exception is encountered the MCE handler
>> tries to find the pfn using addr_to_pfn() which takes effective
>> address as an argument, later pfn is used to poison the page where
>> memory error occurred, recent rework in this area made addr_to_pfn
>> to run in realmode, which can be fatal as it may try to access
>> memory outside RMO region.
>>
>> To fix this have separate functions for realmode and virtual mode
>> handling and let addr_to_pfn to run in virtual mode.
> You didn't really explain what you moved around. You added some
> helper functions, but what does it actually do differently now? Can you
> explain that in the changelog?

Sure, ill rephrase the changelog, here I have moved all that we can and we must
do in virtual mode to new helper function which runs in virtual mode, like filling
mce error info, using addr_to_pfn and calling save_mce_event().

>
> Thanks,
> Nick
>
>> Without this fix following kernel crash is seen on hitting UE.
>>
>> [  485.128036] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  485.128040] LE SMP NR_CPUS=2048 NUMA pSeries
>> [  485.128047] Modules linked in:
>> [  485.128067] CPU: 15 PID: 6536 Comm: insmod Kdump: loaded Tainted: G OE 5.7.0 #22
>> [  485.128074] NIP:  c00000000009b24c LR: c0000000000398d8 CTR: c000000000cd57c0
>> [  485.128078] REGS: c000000003f1f970 TRAP: 0300   Tainted: G OE (5.7.0)
>> [  485.128082] MSR:  8000000000001003 <SF,ME,RI,LE>  CR: 28008284  XER: 00000001
>> [  485.128088] CFAR: c00000000009b190 DAR: c0000001fab00000 DSISR: 40000000 IRQMASK: 1
>> [  485.128088] GPR00: 0000000000000001 c000000003f1fbf0 c000000001634300 0000b0fa01000000
>> [  485.128088] GPR04: d000000002220000 0000000000000000 00000000fab00000 0000000000000022
>> [  485.128088] GPR08: c0000001fab00000 0000000000000000 c0000001fab00000 c000000003f1fc14
>> [  485.128088] GPR12: 0000000000000008 c000000003ff5880 d000000002100008 0000000000000000
>> [  485.128088] GPR16: 000000000000ff20 000000000000fff1 000000000000fff2 d0000000021a1100
>> [  485.128088] GPR20: d000000002200000 c00000015c893c50 c000000000d49b28 c00000015c893c50
>> [  485.128088] GPR24: d0000000021a0d08 c0000000014e5da8 d0000000021a0818 000000000000000a
>> [  485.128088] GPR28: 0000000000000008 000000000000000a c0000000017e2970 000000000000000a
>> [  485.128125] NIP [c00000000009b24c] __find_linux_pte+0x11c/0x310
>> [  485.128130] LR [c0000000000398d8] addr_to_pfn+0x138/0x170
>> [  485.128133] Call Trace:
>> [  485.128135] Instruction dump:
>> [  485.128138] 3929ffff 7d4a3378 7c883c36 7d2907b4 794a1564 7d294038 794af082 3900ffff
>> [  485.128144] 79291f24 790af00e 78e70020 7d095214 <7c69502a> 2fa30000 419e011c 70690040
>> [  485.128152] ---[ end trace d34b27e29ae0e340 ]---
>>
>> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
>> ---
>> V2: Leave bare metal code and save_mce_event as is.
>>
>> V3: Have separate functions for realmode and virtual mode handling.
>> ---
>>   arch/powerpc/platforms/pseries/ras.c | 119 ++++++++++++++++-----------
>>   1 file changed, 70 insertions(+), 49 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c
>> index f3736fcd98fc..32fe3fad86b8 100644
>> --- a/arch/powerpc/platforms/pseries/ras.c
>> +++ b/arch/powerpc/platforms/pseries/ras.c
>> @@ -522,18 +522,55 @@ int pSeries_system_reset_exception(struct pt_regs *regs)
>>   	return 0; /* need to perform reset */
>>   }
>>   
>> +static int mce_handle_err_realmode(int disposition, u8 error_type)
>> +{
>> +#ifdef CONFIG_PPC_BOOK3S_64
>> +	if (disposition == RTAS_DISP_NOT_RECOVERED) {
>> +		switch (error_type) {
>> +		case	MC_ERROR_TYPE_SLB:
>> +		case	MC_ERROR_TYPE_ERAT:
>> +			/*
>> +			 * Store the old slb content in paca before flushing.
>> +			 * Print this when we go to virtual mode.
>> +			 * There are chances that we may hit MCE again if there
>> +			 * is a parity error on the SLB entry we trying to read
>> +			 * for saving. Hence limit the slb saving to single
>> +			 * level of recursion.
>> +			 */
>> +			if (local_paca->in_mce == 1)
>> +				slb_save_contents(local_paca->mce_faulty_slbs);
>> +			flush_and_reload_slb();
>> +			disposition = RTAS_DISP_FULLY_RECOVERED;
>> +			break;
>> +		default:
>> +			break;
>> +		}
>> +	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
>> +		/* Platform corrected itself but could be degraded */
>> +		pr_err("MCE: limited recovery, system may be degraded\n");
>> +		disposition = RTAS_DISP_FULLY_RECOVERED;
>> +	}
>> +#endif
>> +	return disposition;
>> +}
>>   
>> -static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>> +static int mce_handle_err_virtmode(struct pt_regs *regs,
>> +				   struct rtas_error_log *errp,
>> +				   struct pseries_mc_errorlog *mce_log,
>> +				   int disposition)
>>   {
>>   	struct mce_error_info mce_err = { 0 };
>> -	unsigned long eaddr = 0, paddr = 0;
>> -	struct pseries_errorlog *pseries_log;
>> -	struct pseries_mc_errorlog *mce_log;
>> -	int disposition = rtas_error_disposition(errp);
>>   	int initiator = rtas_error_initiator(errp);
>>   	int severity = rtas_error_severity(errp);
>> +	unsigned long eaddr = 0, paddr = 0;
>>   	u8 error_type, err_sub_type;
>>   
>> +	if (!mce_log)
>> +		goto out;
>> +
>> +	error_type = mce_log->error_type;
>> +	err_sub_type = rtas_mc_error_sub_type(mce_log);
>> +
>>   	if (initiator == RTAS_INITIATOR_UNKNOWN)
>>   		mce_err.initiator = MCE_INITIATOR_UNKNOWN;
>>   	else if (initiator == RTAS_INITIATOR_CPU)
>> @@ -572,18 +609,7 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>>   	mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
>>   	mce_err.error_class = MCE_ECLASS_UNKNOWN;
>>   
>> -	if (!rtas_error_extended(errp))
>> -		goto out;
>> -
>> -	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> -	if (pseries_log == NULL)
>> -		goto out;
>> -
>> -	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> -	error_type = mce_log->error_type;
>> -	err_sub_type = rtas_mc_error_sub_type(mce_log);
>> -
>> -	switch (mce_log->error_type) {
>> +	switch (error_type) {
>>   	case MC_ERROR_TYPE_UE:
>>   		mce_err.error_type = MCE_ERROR_TYPE_UE;
>>   		mce_common_process_ue(regs, &mce_err);
>> @@ -683,37 +709,32 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>>   		mce_err.error_type = MCE_ERROR_TYPE_UNKNOWN;
>>   		break;
>>   	}
>> +out:
>> +	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
>> +		       &mce_err, regs->nip, eaddr, paddr);
>> +	return disposition;
>> +}
>>   
>> -#ifdef CONFIG_PPC_BOOK3S_64
>> -	if (disposition == RTAS_DISP_NOT_RECOVERED) {
>> -		switch (error_type) {
>> -		case	MC_ERROR_TYPE_SLB:
>> -		case	MC_ERROR_TYPE_ERAT:
>> -			/*
>> -			 * Store the old slb content in paca before flushing.
>> -			 * Print this when we go to virtual mode.
>> -			 * There are chances that we may hit MCE again if there
>> -			 * is a parity error on the SLB entry we trying to read
>> -			 * for saving. Hence limit the slb saving to single
>> -			 * level of recursion.
>> -			 */
>> -			if (local_paca->in_mce == 1)
>> -				slb_save_contents(local_paca->mce_faulty_slbs);
>> -			flush_and_reload_slb();
>> -			disposition = RTAS_DISP_FULLY_RECOVERED;
>> -			break;
>> -		default:
>> -			break;
>> -		}
>> -	} else if (disposition == RTAS_DISP_LIMITED_RECOVERY) {
>> -		/* Platform corrected itself but could be degraded */
>> -		printk(KERN_ERR "MCE: limited recovery, system may "
>> -		       "be degraded\n");
>> -		disposition = RTAS_DISP_FULLY_RECOVERED;
>> -	}
>> -#endif
>> +static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>> +{
>> +	struct pseries_errorlog *pseries_log;
>> +	struct pseries_mc_errorlog *mce_log = NULL;
>> +	int disposition = rtas_error_disposition(errp);
>> +	u8 error_type, err_sub_type;
>> +
>> +	if (!rtas_error_extended(errp))
>> +		goto out;
>> +
>> +	pseries_log = get_pseries_errorlog(errp, PSERIES_ELOG_SECT_ID_MCE);
>> +	if (!pseries_log)
>> +		goto out;
>> +
>> +	mce_log = (struct pseries_mc_errorlog *)pseries_log->data;
>> +	error_type = mce_log->error_type;
>> +	err_sub_type = rtas_mc_error_sub_type(mce_log);
>> +
>> +	disposition = mce_handle_err_realmode(disposition, error_type);
>>   
>> -out:
>>   	/*
>>   	 * Enable translation as we will be accessing per-cpu variables
>>   	 * in save_mce_event() which may fall outside RMO region, also
>> @@ -724,10 +745,10 @@ static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
>>   	 * Note: All the realmode handling like flushing SLB entries for
>>   	 *       SLB multihit is done by now.
>>   	 */
>> +out:
>>   	mtmsr(mfmsr() | MSR_IR | MSR_DR);
>> -	save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
>> -			&mce_err, regs->nip, eaddr, paddr);
>> -
>> +	disposition = mce_handle_err_virtmode(regs, errp, mce_log,
>> +					      disposition);
>>   	return disposition;
>>   }
>>   
>> -- 
>> 2.17.2
>>
>>


[-- Attachment #2: Type: text/html, Size: 9381 bytes --]

^ permalink raw reply

* Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs
From: Athira Rajeev @ 2020-07-22  8:07 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <CACzsE9r9fy22hScRm7yz5OeZH9jXA+97hEfAOo-Nk_EPwW-_Dw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9478 bytes --]



> On 22-Jul-2020, at 9:48 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
> 
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev
> <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> wrote:
>> 
>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>> 
>> PowerISA v3.1 includes new performance monitoring unit(PMU)
>> special purpose registers (SPRs). They are
>> 
>> Monitor Mode Control Register 3 (MMCR3)
>> Sampled Instruction Event Register 2 (SIER2)
>> Sampled Instruction Event Register 3 (SIER3)
>> 
>> MMCR3 is added for further sampling related configuration
>> control. SIER2/SIER3 are added to provide additional
>> information about the sampled instruction.
>> 
>> Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
>> handling of these new SPRs, updates the struct thread_struct
>> to include these new SPRs, include MMCR3 in struct mmcr_regs.
>> This is needed to support programming of MMCR3 SPR during
>> event_[enable/disable]. Patch also adds the sysfs support
>> for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs.
>> 
>> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/perf_event_server.h |  2 ++
>> arch/powerpc/include/asm/processor.h         |  4 ++++
>> arch/powerpc/include/asm/reg.h               |  6 ++++++
>> arch/powerpc/kernel/sysfs.c                  |  8 ++++++++
>> arch/powerpc/perf/core-book3s.c              | 29 ++++++++++++++++++++++++++++
>> 5 files changed, 49 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
>> index 14b8dc1..832450a 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -22,6 +22,7 @@ struct mmcr_regs {
>>        unsigned long mmcr1;
>>        unsigned long mmcr2;
>>        unsigned long mmcra;
>> +       unsigned long mmcr3;
>> };
>> /*
>>  * This struct provides the constants and functions needed to
>> @@ -75,6 +76,7 @@ struct power_pmu {
>> #define PPMU_HAS_SIER          0x00000040 /* Has SIER */
>> #define PPMU_ARCH_207S         0x00000080 /* PMC is architecture v2.07S */
>> #define PPMU_NO_SIAR           0x00000100 /* Do not use SIAR */
>> +#define PPMU_ARCH_310S         0x00000200 /* Has MMCR3, SIER2 and SIER3 */
> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.
>> 

Ok,
This change will need to be done in all places which are currently using PPMU_ARCH_310S

>> /*
>>  * Values for flags to get_alternatives()
>> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
>> index 52a6783..a466e94 100644
>> --- a/arch/powerpc/include/asm/processor.h
>> +++ b/arch/powerpc/include/asm/processor.h
>> @@ -272,6 +272,10 @@ struct thread_struct {
>>        unsigned        mmcr0;
>> 
>>        unsigned        used_ebb;
>> +       unsigned long   mmcr3;
>> +       unsigned long   sier2;
>> +       unsigned long   sier3;
>> +
>> #endif
>> };
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 88e6c78..21a1b2d 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -876,7 +876,9 @@
>> #define   MMCR0_FCHV   0x00000001UL /* freeze conditions in hypervisor mode */
>> #define SPRN_MMCR1     798
>> #define SPRN_MMCR2     785
>> +#define SPRN_MMCR3     754
>> #define SPRN_UMMCR2    769
>> +#define SPRN_UMMCR3    738
>> #define SPRN_MMCRA     0x312
>> #define   MMCRA_SDSYNC 0x80000000UL /* SDAR synced with SIAR */
>> #define   MMCRA_SDAR_DCACHE_MISS 0x40000000UL
>> @@ -918,6 +920,10 @@
>> #define   SIER_SIHV            0x1000000       /* Sampled MSR_HV */
>> #define   SIER_SIAR_VALID      0x0400000       /* SIAR contents valid */
>> #define   SIER_SDAR_VALID      0x0200000       /* SDAR contents valid */
>> +#define SPRN_SIER2     752
>> +#define SPRN_SIER3     753
>> +#define SPRN_USIER2    736
>> +#define SPRN_USIER3    737
>> #define SPRN_SIAR      796
>> #define SPRN_SDAR      797
>> #define SPRN_TACR      888
>> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
>> index 571b325..46b4ebc 100644
>> --- a/arch/powerpc/kernel/sysfs.c
>> +++ b/arch/powerpc/kernel/sysfs.c
>> @@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
>> SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
>> 
>> SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
>> +SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
>> 
>> static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
>> +static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
>> #endif /* HAS_PPC_PMC56 */
>> 
>> 
>> @@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
>> #ifdef CONFIG_PMU_SYSFS
>>        if (cpu_has_feature(CPU_FTR_MMCRA))
>>                device_create_file(s, &dev_attr_mmcra);
>> +
>> +       if (cpu_has_feature(CPU_FTR_ARCH_31))
>> +               device_create_file(s, &dev_attr_mmcr3);
>> #endif /* CONFIG_PMU_SYSFS */
>> 
>>        if (cpu_has_feature(CPU_FTR_PURR)) {
>> @@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
>> #ifdef CONFIG_PMU_SYSFS
>>        if (cpu_has_feature(CPU_FTR_MMCRA))
>>                device_remove_file(s, &dev_attr_mmcra);
>> +
>> +       if (cpu_has_feature(CPU_FTR_ARCH_31))
>> +               device_remove_file(s, &dev_attr_mmcr3);
>> #endif /* CONFIG_PMU_SYSFS */
>> 
>>        if (cpu_has_feature(CPU_FTR_PURR)) {
>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index f4d07b5..ca32fc0 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -72,6 +72,11 @@ struct cpu_hw_events {
>> /*
>>  * 32-bit doesn't have MMCRA but does have an MMCR2,
>>  * and a few other names are different.
>> + * Also 32-bit doesn't have MMCR3, SIER2 and SIER3.
>> + * Define them as zero knowing that any code path accessing
>> + * these registers (via mtspr/mfspr) are done under ppmu flag
>> + * check for PPMU_ARCH_310S and we will not enter that code path
>> + * for 32-bit.
>>  */
>> #ifdef CONFIG_PPC32
>> 
>> @@ -85,6 +90,9 @@ struct cpu_hw_events {
>> #define MMCR0_PMCC_U6          0
>> 
>> #define SPRN_MMCRA             SPRN_MMCR2
>> +#define SPRN_MMCR3             0
>> +#define SPRN_SIER2             0
>> +#define SPRN_SIER3             0
>> #define MMCRA_SAMPLE_ENABLE    0
>> 
>> static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>> @@ -581,6 +589,11 @@ static void ebb_switch_out(unsigned long mmcr0)
>>        current->thread.sdar  = mfspr(SPRN_SDAR);
>>        current->thread.mmcr0 = mmcr0 & MMCR0_USER_MASK;
>>        current->thread.mmcr2 = mfspr(SPRN_MMCR2) & MMCR2_USER_MASK;
>> +       if (ppmu->flags & PPMU_ARCH_310S) {
>> +               current->thread.mmcr3 = mfspr(SPRN_MMCR3);
> Like MMCR0_USER_MASK and MMCR2_USER_MASK do we need a MMCR3_USER_MASK
> here, or is there no need?

Jordan

We don’t need user mask for MMCR3 and other new SPRs ( SIER2/3) . Incase of MMCR3, we dont have any Freeze control bits and incase of SIER2/3, it is similar to SIER (where HW handles the masking of the bits), hence we didn't add any user_mask for these SPRs

Thanks
Athira

>> +               current->thread.sier2 = mfspr(SPRN_SIER2);
>> +               current->thread.sier3 = mfspr(SPRN_SIER3);
>> +       }
>> }
>> 
>> static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>> @@ -620,6 +633,12 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>>         * instead manage the MMCR2 entirely by itself.
>>         */
>>        mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2 | current->thread.mmcr2);
>> +
>> +       if (ppmu->flags & PPMU_ARCH_310S) {
>> +               mtspr(SPRN_MMCR3, current->thread.mmcr3);
>> +               mtspr(SPRN_SIER2, current->thread.sier2);
>> +               mtspr(SPRN_SIER3, current->thread.sier3);
>> +       }
>> out:
>>        return mmcr0;
>> }
>> @@ -840,6 +859,11 @@ void perf_event_print_debug(void)
>>                pr_info("EBBRR: %016lx BESCR: %016lx\n",
>>                        mfspr(SPRN_EBBRR), mfspr(SPRN_BESCR));
>>        }
>> +
>> +       if (ppmu->flags & PPMU_ARCH_310S) {
>> +               pr_info("MMCR3: %016lx SIER2: %016lx SIER3: %016lx\n",
>> +                       mfspr(SPRN_MMCR3), mfspr(SPRN_SIER2), mfspr(SPRN_SIER3));
>> +       }
>> #endif
>>        pr_info("SIAR:  %016lx SDAR:  %016lx SIER:  %016lx\n",
>>                mfspr(SPRN_SIAR), sdar, sier);
>> @@ -1305,6 +1329,8 @@ static void power_pmu_enable(struct pmu *pmu)
>>        if (!cpuhw->n_added) {
>>                mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>>                mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
>> +               if (ppmu->flags & PPMU_ARCH_310S)
>> +                       mtspr(SPRN_MMCR3, cpuhw->mmcr.mmcr3);
>>                goto out_enable;
>>        }
>> 
>> @@ -1348,6 +1374,9 @@ static void power_pmu_enable(struct pmu *pmu)
>>        if (ppmu->flags & PPMU_ARCH_207S)
>>                mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2);
>> 
>> +       if (ppmu->flags & PPMU_ARCH_310S)
>> +               mtspr(SPRN_MMCR3, cpuhw->mmcr.mmcr3);
>> +
>>        /*
>>         * Read off any pre-existing events that need to move
>>         * to another PMC.
>> --
>> 1.8.3.1


[-- Attachment #2: Type: text/html, Size: 23566 bytes --]

^ permalink raw reply

* Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs
From: Athira Rajeev @ 2020-07-22  7:55 UTC (permalink / raw)
  To: Jordan Niethe
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <CACzsE9oBw1ZrJLqOAg1QqPrQgSoVbEdPh_ax7mU_kcWNyfyAcg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 7014 bytes --]



> On 22-Jul-2020, at 10:11 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
> 
> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
> <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> wrote:
>> 
>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>> 
>> Add power10 feature function to dt_cpu_ftrs.c along
>> with a power10 specific init() to initialize pmu sprs,
>> sets the oprofile_cpu_type and cpu_features. This will
>> enable performance monitoring unit(PMU) for Power10
>> in CPU features with "performance-monitor-power10".
>> 
>> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
>> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
>> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
>> feature at boot for power10.
>> 
>> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>> ---
>> arch/powerpc/include/asm/reg.h        |  3 +++
>> arch/powerpc/kernel/cpu_setup_power.S |  8 ++++++++
>> arch/powerpc/kernel/dt_cpu_ftrs.c     | 26 ++++++++++++++++++++++++++
>> 3 files changed, 37 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 21a1b2d..900ada1 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -1068,6 +1068,9 @@
>> #define MMCR0_PMC2_LOADMISSTIME        0x5
>> #endif
>> 
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define MMCRA_BHRB_DISABLE     0x0000002000000000
> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.


Hi Jordan

Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
need to define this for 32-bit to satisfy core-book3s to compile as below:

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 900ada10762c..7e271657b412 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -888,6 +888,8 @@
 #define   MMCRA_SLOT   0x07000000UL /* SLOT bits (37-39) */
 #define   MMCRA_SLOT_SHIFT     24
 #define   MMCRA_SAMPLE_ENABLE 0x00000001UL /* enable sampling */
+/* BHRB disable bit for PowerISA v3.10 */
+#define   MMCRA_BHRB_DISABLE  0x0000002000000000
 #define   POWER6_MMCRA_SDSYNC 0x0000080000000000ULL    /* SDAR/SIAR synced */
 #define   POWER6_MMCRA_SIHV   0x0000040000000000ULL
 #define   POWER6_MMCRA_SIPR   0x0000020000000000ULL
@@ -1068,9 +1070,6 @@
 #define MMCR0_PMC2_LOADMISSTIME        0x5
 #endif
 
-/* BHRB disable bit for PowerISA v3.10 */
-#define MMCRA_BHRB_DISABLE     0x0000002000000000
-
 /*
  * SPRG usage:
  *
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 36baae666387..88068f20827c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -94,6 +94,7 @@ static unsigned int freeze_events_kernel = MMCR0_FCS;
 #define SPRN_SIER2             0
 #define SPRN_SIER3             0
 #define MMCRA_SAMPLE_ENABLE    0
+#define MMCRA_BHRB_DISABLE     0
 
 static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
 {



>> +
>> /*
>>  * SPRG usage:
>>  *
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa7..b8e0d1e 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
>> _GLOBAL(__setup_cpu_power10)
>>        mflr    r11
>>        bl      __init_FSCR_power10
>> +       bl      __init_PMU_ISA31
> So we set MMCRA here but then aren't we still going to call __init_PMU
> which will overwrite that?
> Would this setting MMCRA also need to be handled in __restore_cpu_power10?

Thanks for this nice catch !  When I rebased code initial phase, we didn’t had power10 part filled in.
It was a miss from my side in adding PMu init functions and thanks for pointing this out. 
Below patch will call __init_PMU functions in setup and restore. Please check if this looks good

--
diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
index efdcfa714106..e672a6c5fd7c 100644
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
 _GLOBAL(__setup_cpu_power10)
 	mflr	r11
 	bl	__init_FSCR_power10
+	bl	__init_PMU
+	bl	__init_PMU_ISA31
+	bl	__init_PMU_HV
 	b	1f
 
 _GLOBAL(__setup_cpu_power9)
@@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
 _GLOBAL(__restore_cpu_power10)
 	mflr	r11
 	bl	__init_FSCR_power10
+	bl	__init_PMU
+	bl	__init_PMU_ISA31
+	bl	__init_PMU_HV
 	b	1f
 
 _GLOBAL(__restore_cpu_power9)
@@ -233,3 +239,10 @@ __init_PMU_ISA207:
 	li	r5,0
 	mtspr	SPRN_MMCRS,r5
 	blr
+
+__init_PMU_ISA31:
+	li	r5,0
+	mtspr	SPRN_MMCR3,r5
+	LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
+	mtspr	SPRN_MMCRA,r5
+	blr
>>        b       1f
>> 
>> _GLOBAL(__setup_cpu_power9)
>> @@ -233,3 +234,10 @@ __init_PMU_ISA207:
>>        li      r5,0
>>        mtspr   SPRN_MMCRS,r5
>>        blr
>> +
>> +__init_PMU_ISA31:
>> +       li      r5,0
>> +       mtspr   SPRN_MMCR3,r5
>> +       LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
>> +       mtspr   SPRN_MMCRA,r5
>> +       blr
>> diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
>> index 3a40951..f482286 100644
>> --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
>> +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
>> @@ -450,6 +450,31 @@ static int __init feat_enable_pmu_power9(struct dt_cpu_feature *f)
>>        return 1;
>> }
>> 
>> +static void init_pmu_power10(void)
>> +{
>> +       init_pmu_power9();
>> +
>> +       mtspr(SPRN_MMCR3, 0);
>> +       mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
>> +}
>> +
>> +static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
>> +{
>> +       hfscr_pmu_enable();
>> +
>> +       init_pmu_power10();
>> +       init_pmu_registers = init_pmu_power10;
>> +
>> +       cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA;
>> +       cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT;
>> +
>> +       cur_cpu_spec->num_pmcs          = 6;
>> +       cur_cpu_spec->pmc_type          = PPC_PMC_IBM;
>> +       cur_cpu_spec->oprofile_cpu_type = "ppc64/power10";
>> +
>> +       return 1;
>> +}
>> +
>> static int __init feat_enable_tm(struct dt_cpu_feature *f)
>> {
>> #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>> @@ -639,6 +664,7 @@ struct dt_cpu_feature_match {
>>        {"pc-relative-addressing", feat_enable, 0},
>>        {"machine-check-power9", feat_enable_mce_power9, 0},
>>        {"performance-monitor-power9", feat_enable_pmu_power9, 0},
>> +       {"performance-monitor-power10", feat_enable_pmu_power10, 0},
>>        {"event-based-branch-v3", feat_enable, 0},
>>        {"random-number-generator", feat_enable, 0},
>>        {"system-call-vectored", feat_disable, 0},
>> --
>> 1.8.3.1


[-- Attachment #2: Type: text/html, Size: 23836 bytes --]

^ permalink raw reply related

* Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs
From: Michael Ellerman @ 2020-07-22 12:03 UTC (permalink / raw)
  To: Jordan Niethe, Athira Rajeev
  Cc: Gautham R Shenoy, mikey, maddy, kvm, kvm-ppc, svaidyan, acme,
	jolsa, linuxppc-dev
In-Reply-To: <CACzsE9r9fy22hScRm7yz5OeZH9jXA+97hEfAOo-Nk_EPwW-_Dw@mail.gmail.com>

Jordan Niethe <jniethe5@gmail.com> writes:
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>>
>> PowerISA v3.1 includes new performance monitoring unit(PMU)
>> special purpose registers (SPRs). They are
...
>>
>> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
>> index 14b8dc1..832450a 100644
>> --- a/arch/powerpc/include/asm/perf_event_server.h
>> +++ b/arch/powerpc/include/asm/perf_event_server.h
>> @@ -75,6 +76,7 @@ struct power_pmu {
>>  #define PPMU_HAS_SIER          0x00000040 /* Has SIER */
>>  #define PPMU_ARCH_207S         0x00000080 /* PMC is architecture v2.07S */
>>  #define PPMU_NO_SIAR           0x00000100 /* Do not use SIAR */
>> +#define PPMU_ARCH_310S         0x00000200 /* Has MMCR3, SIER2 and SIER3 */

> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.

The "S" is no longer needed as there's no Book S vs Book E distinction
in ISA v3.1.

So I changed it to PPMU_ARCH_31.

>> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
>> index f4d07b5..ca32fc0 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -581,6 +589,11 @@ static void ebb_switch_out(unsigned long mmcr0)
>>         current->thread.sdar  = mfspr(SPRN_SDAR);
>>         current->thread.mmcr0 = mmcr0 & MMCR0_USER_MASK;
>>         current->thread.mmcr2 = mfspr(SPRN_MMCR2) & MMCR2_USER_MASK;
>> +       if (ppmu->flags & PPMU_ARCH_310S) {
>> +               current->thread.mmcr3 = mfspr(SPRN_MMCR3);

> Like MMCR0_USER_MASK and MMCR2_USER_MASK do we need a MMCR3_USER_MASK
> here, or is there no need?

mmcr0 and mmcr2 are visible via ptrace, so masking them here means we
don't expose any bits to userspace via ptrace that aren't also visible
by reading the register.

So at least while mmcr3 is not exposed via ptrace it's safe to not mask
it, if there are even any sensitive bits in it.

cheers

^ permalink raw reply

* Re: [v3 04/15] powerpc/perf: Add support for ISA3.1 PMU SPRs
From: Jordan Niethe @ 2020-07-22 10:52 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <F54F7D50-8255-428A-B79E-EE1E2D70074B@linux.vnet.ibm.com>

On Wed, Jul 22, 2020 at 6:07 PM Athira Rajeev
<atrajeev@linux.vnet.ibm.com> wrote:
>
>
>
> On 22-Jul-2020, at 9:48 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
>
> On Sat, Jul 18, 2020 at 1:02 AM Athira Rajeev
> <atrajeev@linux.vnet.ibm.com> wrote:
>
>
> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>
> PowerISA v3.1 includes new performance monitoring unit(PMU)
> special purpose registers (SPRs). They are
>
> Monitor Mode Control Register 3 (MMCR3)
> Sampled Instruction Event Register 2 (SIER2)
> Sampled Instruction Event Register 3 (SIER3)
>
> MMCR3 is added for further sampling related configuration
> control. SIER2/SIER3 are added to provide additional
> information about the sampled instruction.
>
> Patch adds new PPMU flag called "PPMU_ARCH_310S" to support
> handling of these new SPRs, updates the struct thread_struct
> to include these new SPRs, include MMCR3 in struct mmcr_regs.
> This is needed to support programming of MMCR3 SPR during
> event_[enable/disable]. Patch also adds the sysfs support
> for the MMCR3 SPR along with SPRN_ macros for these new pmu sprs.
>
> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> ---
> arch/powerpc/include/asm/perf_event_server.h |  2 ++
> arch/powerpc/include/asm/processor.h         |  4 ++++
> arch/powerpc/include/asm/reg.h               |  6 ++++++
> arch/powerpc/kernel/sysfs.c                  |  8 ++++++++
> arch/powerpc/perf/core-book3s.c              | 29 ++++++++++++++++++++++++++++
> 5 files changed, 49 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
> index 14b8dc1..832450a 100644
> --- a/arch/powerpc/include/asm/perf_event_server.h
> +++ b/arch/powerpc/include/asm/perf_event_server.h
> @@ -22,6 +22,7 @@ struct mmcr_regs {
>        unsigned long mmcr1;
>        unsigned long mmcr2;
>        unsigned long mmcra;
> +       unsigned long mmcr3;
> };
> /*
>  * This struct provides the constants and functions needed to
> @@ -75,6 +76,7 @@ struct power_pmu {
> #define PPMU_HAS_SIER          0x00000040 /* Has SIER */
> #define PPMU_ARCH_207S         0x00000080 /* PMC is architecture v2.07S */
> #define PPMU_NO_SIAR           0x00000100 /* Do not use SIAR */
> +#define PPMU_ARCH_310S         0x00000200 /* Has MMCR3, SIER2 and SIER3 */
>
> We elsewhere have CPU_FTR_ARCH_31, so should this be PPMU_ARCH_31S to
> be consistent.
>
>
>
> Ok,
> This change will need to be done in all places which are currently using PPMU_ARCH_310S
>
> /*
>  * Values for flags to get_alternatives()
> diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
> index 52a6783..a466e94 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -272,6 +272,10 @@ struct thread_struct {
>        unsigned        mmcr0;
>
>        unsigned        used_ebb;
> +       unsigned long   mmcr3;
> +       unsigned long   sier2;
> +       unsigned long   sier3;
> +
> #endif
> };
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 88e6c78..21a1b2d 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -876,7 +876,9 @@
> #define   MMCR0_FCHV   0x00000001UL /* freeze conditions in hypervisor mode */
> #define SPRN_MMCR1     798
> #define SPRN_MMCR2     785
> +#define SPRN_MMCR3     754
> #define SPRN_UMMCR2    769
> +#define SPRN_UMMCR3    738
> #define SPRN_MMCRA     0x312
> #define   MMCRA_SDSYNC 0x80000000UL /* SDAR synced with SIAR */
> #define   MMCRA_SDAR_DCACHE_MISS 0x40000000UL
> @@ -918,6 +920,10 @@
> #define   SIER_SIHV            0x1000000       /* Sampled MSR_HV */
> #define   SIER_SIAR_VALID      0x0400000       /* SIAR contents valid */
> #define   SIER_SDAR_VALID      0x0200000       /* SDAR contents valid */
> +#define SPRN_SIER2     752
> +#define SPRN_SIER3     753
> +#define SPRN_USIER2    736
> +#define SPRN_USIER3    737
> #define SPRN_SIAR      796
> #define SPRN_SDAR      797
> #define SPRN_TACR      888
> diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
> index 571b325..46b4ebc 100644
> --- a/arch/powerpc/kernel/sysfs.c
> +++ b/arch/powerpc/kernel/sysfs.c
> @@ -622,8 +622,10 @@ void ppc_enable_pmcs(void)
> SYSFS_PMCSETUP(pmc8, SPRN_PMC8);
>
> SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
> +SYSFS_PMCSETUP(mmcr3, SPRN_MMCR3);
>
> static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
> +static DEVICE_ATTR(mmcr3, 0600, show_mmcr3, store_mmcr3);
> #endif /* HAS_PPC_PMC56 */
>
>
> @@ -886,6 +888,9 @@ static int register_cpu_online(unsigned int cpu)
> #ifdef CONFIG_PMU_SYSFS
>        if (cpu_has_feature(CPU_FTR_MMCRA))
>                device_create_file(s, &dev_attr_mmcra);
> +
> +       if (cpu_has_feature(CPU_FTR_ARCH_31))
> +               device_create_file(s, &dev_attr_mmcr3);
> #endif /* CONFIG_PMU_SYSFS */
>
>        if (cpu_has_feature(CPU_FTR_PURR)) {
> @@ -980,6 +985,9 @@ static int unregister_cpu_online(unsigned int cpu)
> #ifdef CONFIG_PMU_SYSFS
>        if (cpu_has_feature(CPU_FTR_MMCRA))
>                device_remove_file(s, &dev_attr_mmcra);
> +
> +       if (cpu_has_feature(CPU_FTR_ARCH_31))
> +               device_remove_file(s, &dev_attr_mmcr3);
> #endif /* CONFIG_PMU_SYSFS */
>
>        if (cpu_has_feature(CPU_FTR_PURR)) {
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index f4d07b5..ca32fc0 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -72,6 +72,11 @@ struct cpu_hw_events {
> /*
>  * 32-bit doesn't have MMCRA but does have an MMCR2,
>  * and a few other names are different.
> + * Also 32-bit doesn't have MMCR3, SIER2 and SIER3.
> + * Define them as zero knowing that any code path accessing
> + * these registers (via mtspr/mfspr) are done under ppmu flag
> + * check for PPMU_ARCH_310S and we will not enter that code path
> + * for 32-bit.
>  */
> #ifdef CONFIG_PPC32
>
> @@ -85,6 +90,9 @@ struct cpu_hw_events {
> #define MMCR0_PMCC_U6          0
>
> #define SPRN_MMCRA             SPRN_MMCR2
> +#define SPRN_MMCR3             0
> +#define SPRN_SIER2             0
> +#define SPRN_SIER3             0
> #define MMCRA_SAMPLE_ENABLE    0
>
> static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
> @@ -581,6 +589,11 @@ static void ebb_switch_out(unsigned long mmcr0)
>        current->thread.sdar  = mfspr(SPRN_SDAR);
>        current->thread.mmcr0 = mmcr0 & MMCR0_USER_MASK;
>        current->thread.mmcr2 = mfspr(SPRN_MMCR2) & MMCR2_USER_MASK;
> +       if (ppmu->flags & PPMU_ARCH_310S) {
> +               current->thread.mmcr3 = mfspr(SPRN_MMCR3);
>
> Like MMCR0_USER_MASK and MMCR2_USER_MASK do we need a MMCR3_USER_MASK
> here, or is there no need?
>
>
> Jordan
>
> We don’t need user mask for MMCR3 and other new SPRs ( SIER2/3) . Incase of MMCR3, we dont have any Freeze control bits and incase of SIER2/3, it is similar to SIER (where HW handles the masking of the bits), hence we didn't add any user_mask for these SPRs
Okay thank you for explaining that.
>
> Thanks
> Athira
>
> +               current->thread.sier2 = mfspr(SPRN_SIER2);
> +               current->thread.sier3 = mfspr(SPRN_SIER3);
> +       }
> }
>
> static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
> @@ -620,6 +633,12 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
>         * instead manage the MMCR2 entirely by itself.
>         */
>        mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2 | current->thread.mmcr2);
> +
> +       if (ppmu->flags & PPMU_ARCH_310S) {
> +               mtspr(SPRN_MMCR3, current->thread.mmcr3);
> +               mtspr(SPRN_SIER2, current->thread.sier2);
> +               mtspr(SPRN_SIER3, current->thread.sier3);
> +       }
> out:
>        return mmcr0;
> }
> @@ -840,6 +859,11 @@ void perf_event_print_debug(void)
>                pr_info("EBBRR: %016lx BESCR: %016lx\n",
>                        mfspr(SPRN_EBBRR), mfspr(SPRN_BESCR));
>        }
> +
> +       if (ppmu->flags & PPMU_ARCH_310S) {
> +               pr_info("MMCR3: %016lx SIER2: %016lx SIER3: %016lx\n",
> +                       mfspr(SPRN_MMCR3), mfspr(SPRN_SIER2), mfspr(SPRN_SIER3));
> +       }
> #endif
>        pr_info("SIAR:  %016lx SDAR:  %016lx SIER:  %016lx\n",
>                mfspr(SPRN_SIAR), sdar, sier);
> @@ -1305,6 +1329,8 @@ static void power_pmu_enable(struct pmu *pmu)
>        if (!cpuhw->n_added) {
>                mtspr(SPRN_MMCRA, cpuhw->mmcr.mmcra & ~MMCRA_SAMPLE_ENABLE);
>                mtspr(SPRN_MMCR1, cpuhw->mmcr.mmcr1);
> +               if (ppmu->flags & PPMU_ARCH_310S)
> +                       mtspr(SPRN_MMCR3, cpuhw->mmcr.mmcr3);
>                goto out_enable;
>        }
>
> @@ -1348,6 +1374,9 @@ static void power_pmu_enable(struct pmu *pmu)
>        if (ppmu->flags & PPMU_ARCH_207S)
>                mtspr(SPRN_MMCR2, cpuhw->mmcr.mmcr2);
>
> +       if (ppmu->flags & PPMU_ARCH_310S)
> +               mtspr(SPRN_MMCR3, cpuhw->mmcr.mmcr3);
> +
>        /*
>         * Read off any pre-existing events that need to move
>         * to another PMC.
> --
> 1.8.3.1
>
>

^ permalink raw reply

* Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs
From: Jordan Niethe @ 2020-07-22 10:49 UTC (permalink / raw)
  To: Athira Rajeev
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <9A4E06A2-5686-4C85-B2F7-0904F195B58A@linux.vnet.ibm.com>

On Wed, Jul 22, 2020 at 5:55 PM Athira Rajeev
<atrajeev@linux.vnet.ibm.com> wrote:
>
>
>
> On 22-Jul-2020, at 10:11 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
>
> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
> <atrajeev@linux.vnet.ibm.com> wrote:
>
>
> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>
> Add power10 feature function to dt_cpu_ftrs.c along
> with a power10 specific init() to initialize pmu sprs,
> sets the oprofile_cpu_type and cpu_features. This will
> enable performance monitoring unit(PMU) for Power10
> in CPU features with "performance-monitor-power10".
>
> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
> feature at boot for power10.
>
> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
> ---
> arch/powerpc/include/asm/reg.h        |  3 +++
> arch/powerpc/kernel/cpu_setup_power.S |  8 ++++++++
> arch/powerpc/kernel/dt_cpu_ftrs.c     | 26 ++++++++++++++++++++++++++
> 3 files changed, 37 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 21a1b2d..900ada1 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -1068,6 +1068,9 @@
> #define MMCR0_PMC2_LOADMISSTIME        0x5
> #endif
>
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define MMCRA_BHRB_DISABLE     0x0000002000000000
>
> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.
>
>
>
> Hi Jordan
>
> Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
> need to define this for 32-bit to satisfy core-book3s to compile as below:
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 900ada10762c..7e271657b412 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -888,6 +888,8 @@
>  #define   MMCRA_SLOT   0x07000000UL /* SLOT bits (37-39) */
>  #define   MMCRA_SLOT_SHIFT     24
>  #define   MMCRA_SAMPLE_ENABLE 0x00000001UL /* enable sampling */
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define   MMCRA_BHRB_DISABLE  0x0000002000000000
>  #define   POWER6_MMCRA_SDSYNC 0x0000080000000000ULL    /* SDAR/SIAR synced */
>  #define   POWER6_MMCRA_SIHV   0x0000040000000000ULL
>  #define   POWER6_MMCRA_SIPR   0x0000020000000000ULL
> @@ -1068,9 +1070,6 @@
>  #define MMCR0_PMC2_LOADMISSTIME        0x5
>  #endif
>
>
>
> -/* BHRB disable bit for PowerISA v3.10 */
> -#define MMCRA_BHRB_DISABLE     0x0000002000000000
> -
>  /*
>   * SPRG usage:
>   *
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 36baae666387..88068f20827c 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -94,6 +94,7 @@ static unsigned int freeze_events_kernel = MMCR0_FCS;
>  #define SPRN_SIER2             0
>  #define SPRN_SIER3             0
>  #define MMCRA_SAMPLE_ENABLE    0
> +#define MMCRA_BHRB_DISABLE     0
>
>
>
>  static inline unsigned long perf_ip_adjust(struct pt_regs *regs)
>  {
>
>
>
> +
> /*
>  * SPRG usage:
>  *
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa7..b8e0d1e 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
> _GLOBAL(__setup_cpu_power10)
>        mflr    r11
>        bl      __init_FSCR_power10
> +       bl      __init_PMU_ISA31
>
> So we set MMCRA here but then aren't we still going to call __init_PMU
> which will overwrite that?
> Would this setting MMCRA also need to be handled in __restore_cpu_power10?
>
>
> Thanks for this nice catch !  When I rebased code initial phase, we didn’t had power10 part filled in.
> It was a miss from my side in adding PMu init functions and thanks for pointing this out.
> Below patch will call __init_PMU functions in setup and restore. Please check if this looks good
>
> --
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa714106..e672a6c5fd7c 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
>  _GLOBAL(__setup_cpu_power10)
>   mflr r11
>   bl __init_FSCR_power10
> + bl __init_PMU
> + bl __init_PMU_ISA31
> + bl __init_PMU_HV
>   b 1f
>
>  _GLOBAL(__setup_cpu_power9)

Won't you also need to change where the label 1 is:
--- a/arch/powerpc/kernel/cpu_setup_power.S
+++ b/arch/powerpc/kernel/cpu_setup_power.S
@@ -100,8 +100,8 @@ _GLOBAL(__setup_cpu_power10)
 _GLOBAL(__setup_cpu_power9)
        mflr    r11
        bl      __init_FSCR
-1:     bl      __init_PMU
-       bl      __init_hvmode_206
+       bl      __init_PMU
+1:     bl      __init_hvmode_206
        mtlr    r11
        beqlr
        li      r0,0

> @@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
>  _GLOBAL(__restore_cpu_power10)
>   mflr r11
>   bl __init_FSCR_power10
> + bl __init_PMU
> + bl __init_PMU_ISA31
> + bl __init_PMU_HV
>   b 1f
>
>  _GLOBAL(__restore_cpu_power9)
> @@ -233,3 +239,10 @@ __init_PMU_ISA207:
>   li r5,0
>   mtspr SPRN_MMCRS,r5
>   blr
> +
> +__init_PMU_ISA31:
> + li r5,0
> + mtspr SPRN_MMCR3,r5
> + LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
> + mtspr SPRN_MMCRA,r5
> + blr
>
> —
>
>        b       1f
>
> _GLOBAL(__setup_cpu_power9)
> @@ -233,3 +234,10 @@ __init_PMU_ISA207:
>        li      r5,0
>        mtspr   SPRN_MMCRS,r5
>        blr
> +
> +__init_PMU_ISA31:
> +       li      r5,0
> +       mtspr   SPRN_MMCR3,r5
> +       LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
> +       mtspr   SPRN_MMCRA,r5
> +       blr
> diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c b/arch/powerpc/kernel/dt_cpu_ftrs.c
> index 3a40951..f482286 100644
> --- a/arch/powerpc/kernel/dt_cpu_ftrs.c
> +++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
> @@ -450,6 +450,31 @@ static int __init feat_enable_pmu_power9(struct dt_cpu_feature *f)
>        return 1;
> }
>
> +static void init_pmu_power10(void)
> +{
> +       init_pmu_power9();
> +
> +       mtspr(SPRN_MMCR3, 0);
> +       mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
> +}
> +
> +static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
> +{
> +       hfscr_pmu_enable();
> +
> +       init_pmu_power10();
> +       init_pmu_registers = init_pmu_power10;
> +
> +       cur_cpu_spec->cpu_features |= CPU_FTR_MMCRA;
> +       cur_cpu_spec->cpu_user_features |= PPC_FEATURE_PSERIES_PERFMON_COMPAT;
> +
> +       cur_cpu_spec->num_pmcs          = 6;
> +       cur_cpu_spec->pmc_type          = PPC_PMC_IBM;
> +       cur_cpu_spec->oprofile_cpu_type = "ppc64/power10";
> +
> +       return 1;
> +}
> +
> static int __init feat_enable_tm(struct dt_cpu_feature *f)
> {
> #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> @@ -639,6 +664,7 @@ struct dt_cpu_feature_match {
>        {"pc-relative-addressing", feat_enable, 0},
>        {"machine-check-power9", feat_enable_mce_power9, 0},
>        {"performance-monitor-power9", feat_enable_pmu_power9, 0},
> +       {"performance-monitor-power10", feat_enable_pmu_power10, 0},
>        {"event-based-branch-v3", feat_enable, 0},
>        {"random-number-generator", feat_enable, 0},
>        {"system-call-vectored", feat_disable, 0},
> --
> 1.8.3.1
>
>

^ permalink raw reply

* Re: [PATCH] powerpc/64s: Fix irq tracing corruption in interrupt/syscall return caused by perf interrupts
From: Alexey Kardashevskiy @ 2020-07-22 10:50 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev
In-Reply-To: <20200722073437.930521-1-npiggin@gmail.com>



On 22/07/2020 17:34, Nicholas Piggin wrote:
> Alexey reports lockdep_assert_irqs_enabled() warnings when stress testing perf, e.g.,
> 
> WARNING: CPU: 0 PID: 1556 at kernel/softirq.c:169 __local_bh_enable_ip+0x258/0x270
> CPU: 0 PID: 1556 Comm: syz-executor
> NIP:  c0000000001ec888 LR: c0000000001ec884 CTR: c000000000ef0610
> REGS: c000000022d4f8a0 TRAP: 0700   Not tainted  (5.8.0-rc3-x)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28008844  XER: 20040000
> CFAR: c0000000001dc1d0 IRQMASK: 0
> 
> The interesting thing is MSR[EE] and IRQMASK shows interrupts are enabled,
> suggesting the current->hardirqs_enabled irq tracing state is going out of sync
> with the actual interrupt enable state.
> 
> The cause is a window in interrupt/syscall return where irq tracing state is being
> adjusted for an irqs-enabled return while MSR[EE] is still enabled. A perf
> interrupt hits and ends up calling trace_hardirqs_off() when restoring
> interrupt flags to a disable state.
> 
> Fix this by disabling perf interrupts as well while adjusting irq tracing state.
> 
> Add a debug check that catches the condition sooner.
> 
> Fixes: 68b34588e202 ("powerpc/64/sycall: Implement syscall entry/exit logic in C")
> Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> 
> I can reproduce similar symptoms and this patch fixes my test case,
> still trying to confirm Alexey's test case or whether there's another
> similar bug causing it.


This does not fix my testcase. I applied this on top of 4fa640dc5230
("Merge tag 'vfio-v5.8-rc7' of git://github.com/awilliam/linux-vfio into
master")  without any of my testing code, just to be clear. Sorry...


> 
>  arch/powerpc/kernel/syscall_64.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/syscall_64.c b/arch/powerpc/kernel/syscall_64.c
> index 79edba3ab312..6c6f88eff915 100644
> --- a/arch/powerpc/kernel/syscall_64.c
> +++ b/arch/powerpc/kernel/syscall_64.c
> @@ -107,8 +107,13 @@ notrace long system_call_exception(long r3, long r4, long r5,
>   */
>  static notrace inline bool prep_irq_for_enabled_exit(void)
>  {
> -	/* This must be done with RI=1 because tracing may touch vmaps */
> -	trace_hardirqs_on();
> +	if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
> +		/* Prevent perf interrupts hitting and messing up the trace_hardirqs state */
> +		irq_soft_mask_set(IRQS_ALL_DISABLED);
> +
> +		/* This must be done with RI=1 because tracing may touch vmaps */
> +		trace_hardirqs_on();
> +	}
>  
>  	/* This pattern matches prep_irq_for_idle */
>  	__hard_EE_RI_disable();
> @@ -123,6 +128,8 @@ static notrace inline bool prep_irq_for_enabled_exit(void)
>  	local_paca->irq_happened = 0;
>  	irq_soft_mask_set(IRQS_ENABLED);
>  
> +	lockdep_assert_irqs_enabled();
> +
>  	return true;
>  }
>  
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH 09/20] Documentation: i2c: eliminate duplicated word
From: Wolfram Sang @ 2020-07-22 10:49 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: kvm, linux-doc, David Airlie, kgdb-bugreport, linux-fpga,
	Liviu Dudau, dri-devel, Douglas Anderson, Paul Cercueil, keyrings,
	Paul Mackerras, linux-i2c, Pavel Machek, Srinivas Pandruvada,
	Mihail Atanassov, linux-leds, linux-s390, Daniel Thompson,
	linux-scsi, Jonathan Corbet, Masahiro Yamada, Matthew Wilcox,
	Halil Pasic, Jarkko Sakkinen, James Wang, linux-input,
	Mali DP Maintainers, Derek Kiernan, linux-mips, Dragan Cvetic,
	Wu Hao, Tony Krowiak, linux-kbuild, James E.J. Bottomley,
	Jiri Kosina, Hannes Reinecke, linux-block, Thomas Bogendoerfer,
	Jacek Anaszewski, linux-mm, Dan Williams, Andrew Morton,
	Mimi Zohar, Jens Axboe, Michal Marek, Martin K. Petersen,
	Pierre Morel, linux-kernel, Daniel Vetter, Jason Wessel,
	Paolo Bonzini, linux-integrity, linuxppc-dev, Mike Rapoport,
	Dan Murphy
In-Reply-To: <20200707180414.10467-10-rdunlap@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 214 bytes --]

On Tue, Jul 07, 2020 at 11:04:03AM -0700, Randy Dunlap wrote:
> Drop doubled word "new".
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>

For the record:

Acked-by: Wolfram Sang <wsa@kernel.org>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [v3 07/15] powerpc/perf: Add power10_feat to dt_cpu_ftrs
From: Michael Ellerman @ 2020-07-22 10:39 UTC (permalink / raw)
  To: Athira Rajeev, Jordan Niethe
  Cc: Gautham R Shenoy, Michael Neuling, maddy, kvm, kvm-ppc, svaidyan,
	acme, jolsa, linuxppc-dev
In-Reply-To: <9A4E06A2-5686-4C85-B2F7-0904F195B58A@linux.vnet.ibm.com>

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>> On 22-Jul-2020, at 10:11 AM, Jordan Niethe <jniethe5@gmail.com> wrote:
>> 
>> On Sat, Jul 18, 2020 at 1:13 AM Athira Rajeev
>> <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> wrote:
>>> 
>>> From: Madhavan Srinivasan <maddy@linux.ibm.com>
>>> 
>>> Add power10 feature function to dt_cpu_ftrs.c along
>>> with a power10 specific init() to initialize pmu sprs,
>>> sets the oprofile_cpu_type and cpu_features. This will
>>> enable performance monitoring unit(PMU) for Power10
>>> in CPU features with "performance-monitor-power10".
>>> 
>>> For PowerISA v3.1, BHRB disable is controlled via Monitor Mode
>>> Control Register A (MMCRA) bit, namely "BHRB Recording Disable
>>> (BHRBRD)". This patch initializes MMCRA BHRBRD to disable BHRB
>>> feature at boot for power10.
>>> 
>>> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
>>> ---
>>> arch/powerpc/include/asm/reg.h        |  3 +++
>>> arch/powerpc/kernel/cpu_setup_power.S |  8 ++++++++
>>> arch/powerpc/kernel/dt_cpu_ftrs.c     | 26 ++++++++++++++++++++++++++
>>> 3 files changed, 37 insertions(+)
>>> 
>>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>>> index 21a1b2d..900ada1 100644
>>> --- a/arch/powerpc/include/asm/reg.h
>>> +++ b/arch/powerpc/include/asm/reg.h
>>> @@ -1068,6 +1068,9 @@
>>> #define MMCR0_PMC2_LOADMISSTIME        0x5
>>> #endif
>>> 
>>> +/* BHRB disable bit for PowerISA v3.10 */
>>> +#define MMCRA_BHRB_DISABLE     0x0000002000000000
>> Shouldn't this go under SPRN_MMCRA with the other MMCRA_*.
>
>
> Hi Jordan
>
> Ok, the definition of MMCRA is under #ifdef for 64 bit .  if I move definition of MMCRA_BHRB_DISABLE along with other SPR's, I also
> need to define this for 32-bit to satisfy core-book3s to compile as below:
>
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 900ada10762c..7e271657b412 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -888,6 +888,8 @@
>  #define   MMCRA_SLOT   0x07000000UL /* SLOT bits (37-39) */
>  #define   MMCRA_SLOT_SHIFT     24
>  #define   MMCRA_SAMPLE_ENABLE 0x00000001UL /* enable sampling */
> +/* BHRB disable bit for PowerISA v3.10 */
> +#define   MMCRA_BHRB_DISABLE  0x0000002000000000

I changed it to:

#define   MMCRA_BHRB_DISABLE  0x2000000000UL // BHRB disable bit for ISA v3.1

>>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
>>> index efdcfa7..b8e0d1e 100644
>>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>>> @@ -94,6 +94,7 @@ _GLOBAL(__restore_cpu_power8)
>>> _GLOBAL(__setup_cpu_power10)
>>>        mflr    r11
>>>        bl      __init_FSCR_power10
>>> +       bl      __init_PMU_ISA31
>> So we set MMCRA here but then aren't we still going to call __init_PMU
>> which will overwrite that?
>> Would this setting MMCRA also need to be handled in __restore_cpu_power10?
>
> Thanks for this nice catch !  When I rebased code initial phase, we didn’t had power10 part filled in.
> It was a miss from my side in adding PMu init functions and thanks for pointing this out. 
> Below patch will call __init_PMU functions in setup and restore. Please check if this looks good

Actually those changes should be in a separate patch.

This one is wiring up DT CPU features, the cpu setup routines are not
used by DT CPU features.

So please send a new patch I can insert into the series that adds the
cpu_setup_power.S changes.

cheers

> --
> diff --git a/arch/powerpc/kernel/cpu_setup_power.S b/arch/powerpc/kernel/cpu_setup_power.S
> index efdcfa714106..e672a6c5fd7c 100644
> --- a/arch/powerpc/kernel/cpu_setup_power.S
> +++ b/arch/powerpc/kernel/cpu_setup_power.S
> @@ -94,6 +94,9 @@ _GLOBAL(__restore_cpu_power8)
>  _GLOBAL(__setup_cpu_power10)
>  	mflr	r11
>  	bl	__init_FSCR_power10
> +	bl	__init_PMU
> +	bl	__init_PMU_ISA31
> +	bl	__init_PMU_HV
>  	b	1f
>  
>  _GLOBAL(__setup_cpu_power9)
> @@ -124,6 +127,9 @@ _GLOBAL(__setup_cpu_power9)
>  _GLOBAL(__restore_cpu_power10)
>  	mflr	r11
>  	bl	__init_FSCR_power10
> +	bl	__init_PMU
> +	bl	__init_PMU_ISA31
> +	bl	__init_PMU_HV
>  	b	1f
>  
>  _GLOBAL(__restore_cpu_power9)
> @@ -233,3 +239,10 @@ __init_PMU_ISA207:
>  	li	r5,0
>  	mtspr	SPRN_MMCRS,r5
>  	blr
> +
> +__init_PMU_ISA31:
> +	li	r5,0
> +	mtspr	SPRN_MMCR3,r5
> +	LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
> +	mtspr	SPRN_MMCRA,r5
> +	blr
>

^ permalink raw reply

* Re: [PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting
From: Alexey Kardashevskiy @ 2020-07-22 10:06 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <CAOSf1CF1_Ga1KDhqLGxTgg+ugj6AfrzXbouZq1MiMa0faHZeeg@mail.gmail.com>



On 22/07/2020 15:39, Oliver O'Halloran wrote:
> On Wed, Jul 15, 2020 at 6:00 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>>>>                *
>>>>> -              * Generally, one M64 BAR maps one IOV BAR. To avoid conflict
>>>>> -              * with other devices, IOV BAR size is expanded to be
>>>>> -              * (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
>>>>> -              * segment size , the expanded size would equal to half of the
>>>>> -              * whole M64 space size, which will exhaust the M64 Space and
>>>>> -              * limit the system flexibility.  This is a design decision to
>>>>> -              * set the boundary to quarter of the M64 segment size.
>>>>> +              * The 1/4 limit is arbitrary and can be tweaked.
>>>>>                */
>>>>> -             if (total_vf_bar_sz > gate) {
>>>>> -                     mul = roundup_pow_of_two(total_vfs);
>>>>> -                     dev_info(&pdev->dev,
>>>>> -                             "VF BAR Total IOV size %llx > %llx, roundup to %d VFs\n",
>>>>> -                             total_vf_bar_sz, gate, mul);
>>>>> -                     iov->m64_single_mode = true;
>>>>> -                     break;
>>>>> -             }
>>>>> -     }
>>>>> +             if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
>>>>> +                     /*
>>>>> +                      * On PHB3, the minimum size alignment of M64 BAR in
>>>>> +                      * single mode is 32MB. If this VF BAR is smaller than
>>>>> +                      * 32MB, but still too large for a segmented window
>>>>> +                      * then we can't map it and need to disable SR-IOV for
>>>>> +                      * this device.
>>>>
>>>>
>>>> Why not use single PE mode for such BAR? Better than nothing.
>>>
>>> Suppose you could, but I figured VFs were mainly interesting since you
>>> could give each VF to a separate guest. If there's multiple VFs under
>>> the same single PE BAR then they'd have to be assigned to the same
>>
>> True. But with one PE per VF we can still have 15 (or 14?) isolated VFs
>> which is not hundreds but better than 0.
> 
> We can only use single PE BARs if the per-VF size is >= 32MB due to
> the alignment requirements on P8. If the per-VF size is smaller then
> we're stuck with multiple VFs inside the same BAR which is bad due to
> the PAPR requirements mentioned below. Sure we could look at doing
> something else, but considering this matches the current behaviour
> it's a bit hard to care...
>
>>> guest in order to retain the freeze/unfreeze behaviour that PAPR
>>> requires. I guess that's how it used to work, but it seems better just
>>> to disable them rather than having VFs which sort of work.
>>
>> Well, realistically the segment size should be 8MB to make this matter
>> (or the whole window 2GB) which does not seem to happen so it does not
>> matter.
> 
> I'm not sure what you mean.

I mean how can we possibly hit this case, what m64_segsize would the
platform have to trigger this. The whole check seems useless but whatever.



-- 
Alexey

^ permalink raw reply

* Re: [v4 5/5] KVM: PPC: Book3S HV: migrate hot plugged memory
From: Bharata B Rao @ 2020-07-22 10:01 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1594972827-13928-6-git-send-email-linuxram@us.ibm.com>

On Fri, Jul 17, 2020 at 01:00:27AM -0700, Ram Pai wrote:
> From: Laurent Dufour <ldufour@linux.ibm.com>
> 
> When a memory slot is hot plugged to a SVM, PFNs associated with the
> GFNs in that slot must be migrated to secure-PFNs, aka device-PFNs.
> 
> Call kvmppc_uv_migrate_mem_slot() to accomplish this.
> Disable page-merge for all pages in the memory slot.
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> [rearranged the code, and modified the commit log]
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h | 10 ++++++++++
>  arch/powerpc/kvm/book3s_hv.c                | 10 ++--------
>  arch/powerpc/kvm/book3s_hv_uvmem.c          | 22 ++++++++++++++++++++++
>  3 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> index f229ab5..6f7da00 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -25,6 +25,9 @@ void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>  			     struct kvm *kvm, bool skip_page_out);
>  int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
>  			const struct kvm_memory_slot *memslot);
> +void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot *new);
> +void kvmppc_memslot_delete(struct kvm *kvm, const struct kvm_memory_slot *old);

The names look a bit generic, but these functions are specific
to secure guests. May be rename them to kvmppc_uvmem_memslot_[create/delele]?

> +
>  #else
>  static inline int kvmppc_uvmem_init(void)
>  {
> @@ -84,5 +87,12 @@ static inline int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
>  static inline void
>  kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>  			struct kvm *kvm, bool skip_page_out) { }
> +
> +static inline void  kvmppc_memslot_create(struct kvm *kvm,
> +		const struct kvm_memory_slot *new) { }
> +
> +static inline void  kvmppc_memslot_delete(struct kvm *kvm,
> +		const struct kvm_memory_slot *old) { }
> +
>  #endif /* CONFIG_PPC_UV */
>  #endif /* __ASM_KVM_BOOK3S_UVMEM_H__ */
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index d331b46..bf3be3b 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4515,16 +4515,10 @@ static void kvmppc_core_commit_memory_region_hv(struct kvm *kvm,
>  
>  	switch (change) {
>  	case KVM_MR_CREATE:
> -		if (kvmppc_uvmem_slot_init(kvm, new))
> -			return;
> -		uv_register_mem_slot(kvm->arch.lpid,
> -				     new->base_gfn << PAGE_SHIFT,
> -				     new->npages * PAGE_SIZE,
> -				     0, new->id);
> +		kvmppc_memslot_create(kvm, new);
>  		break;
>  	case KVM_MR_DELETE:
> -		uv_unregister_mem_slot(kvm->arch.lpid, old->id);
> -		kvmppc_uvmem_slot_free(kvm, old);
> +		kvmppc_memslot_delete(kvm, old);
>  		break;
>  	default:
>  		/* TODO: Handle KVM_MR_MOVE */
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index a206984..a2b4d25 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -1089,6 +1089,28 @@ int kvmppc_send_page_to_uv(struct kvm *kvm, unsigned long gfn)
>  	return (ret == U_SUCCESS) ? RESUME_GUEST : -EFAULT;
>  }
>  
> +void kvmppc_memslot_create(struct kvm *kvm, const struct kvm_memory_slot *new)
> +{
> +	if (kvmppc_uvmem_slot_init(kvm, new))
> +		return;
> +
> +	if (kvmppc_memslot_page_merge(kvm, new, false))
> +		return;
> +
> +	if (uv_register_mem_slot(kvm->arch.lpid, new->base_gfn << PAGE_SHIFT,
> +			new->npages * PAGE_SIZE, 0, new->id))
> +		return;
> +
> +	kvmppc_uv_migrate_mem_slot(kvm, new);

Quite a few things can return failure here including
kvmppc_uv_migrate_mem_slot() and we are ignoring all of those.
I am wondering if this should be called from prepare_memory_region callback
instead of commit_memory_region. In the prepare phase, we have a way
to back out in case of error. Can you check if moving this call to
prepare callback is feasible?

In the other case in 1/5, the code issues ksm unmerge request on error,
but not here.

Also check if the code for 1st three calls can be shared with similar
code in 1/5.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH 05/15] powerpc/powernv/sriov: Move SR-IOV into a seperate file
From: Alexey Kardashevskiy @ 2020-07-22  9:53 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev
In-Reply-To: <CAOSf1CF0jv_cq5xgVz+7fzf155MjHT72p+VN1EY6HjjW1Nza-w@mail.gmail.com>



On 22/07/2020 15:01, Oliver O'Halloran wrote:
> On Tue, Jul 14, 2020 at 7:16 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>> On 10/07/2020 15:23, Oliver O'Halloran wrote:
>>> +     align = pci_iov_resource_size(pdev, resno);
>>> +
>>> +     /*
>>> +      * iov can be null if we have an SR-IOV device with IOV BAR that can't
>>> +      * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
>>> +      * In that case we don't allow VFs to be enabled so just return the
>>> +      * default alignment.
>>> +      */
>>> +     if (!iov)
>>> +             return align;
>>
>>
>> This is the new chunk. What would happen before? Non-prefetch BAR would
>> still go to m64 space?
> 
> I don't think there's any real change. Currently if the setup in
> pnv_pci_ioda_fixup_iov_resources() fails then pdn->vfs_expanded will
> be zero. The !iov check here fills the same role, but it's more
> explicit. vfs_expanded has some other behaviour too so we can't get
> rid of it entirely (yet).

The check is fine, you have to have one as @iov can be NULL (unlike
pci_dn). The comment is what bothered me. It would make more sense
somewhere in pnv_pci_ioda_fixup_iov_resources() near
"dev_warn(&pdev->dev, "Don't support SR-IOV with"" as now it suggests
there is one reason for the failed iov configuration only while there
are two reasons.


-- 
Alexey

^ permalink raw reply

* Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone
From: Arnd Bergmann @ 2020-07-22  9:43 UTC (permalink / raw)
  To: Palmer Dabbelt
  Cc: Albert Ou, Alexandre Ghiti, Atish Patra, Anup Patel,
	linux-kernel@vger.kernel.org, Paul Walmsley, Linux-MM,
	Paul Mackerras, Zong Li, linux-riscv, linuxppc-dev
In-Reply-To: <mhng-08bff01a-ca15-4bbc-8454-2ca3e823fef8@palmerdabbelt-glaptop1>

On Tue, Jul 21, 2020 at 9:06 PM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Tue, 21 Jul 2020 11:36:10 PDT (-0700), alex@ghiti.fr wrote:
> > Let's try to make progress here: I add linux-mm in CC to get feedback on
> > this patch as it blocks sv48 support too.
>
> Sorry for being slow here.  I haven't replied because I hadn't really fleshed
> out the design yet, but just so everyone's on the same page my problems with
> this are:
>
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.

There is actually an ongoing work to make 32-bit Arm kernels move
vmlinux into the vmalloc space, as part of the move to avoid highmem.

Overall, a 32-bit system would waste about 0.1% of its virtual address space
by having the kernel be located in both the linear map and the vmalloc area.
It's not zero, but not that bad either. With the typical split of 3072 MB user,
768MB linear and 256MB vmalloc, it's also around 1.5% of the available
vmalloc area (assuming a 4MB vmlinux in a typical 32-bit kernel), but the
boundaries can be changed arbitrarily if needed.

The eventual goal is to have a split of 3840MB for either user or linear map
plus and 256MB for vmalloc, including the kernel. Switching between linear
and user has a noticeable runtime overhead, but it relaxes both the limits
for user memory and lowmem, and it provides a somewhat stronger
address space isolation.

Another potential idea would be to completely randomize the physical
addresses underneath the kernel by using a random permutation of the
pages in the kernel image. This adds even more overhead (virt_to_phys
may need to call vmalloc_to_page or similar) and may cause problems
with DMA into kernel .data across page boundaries,

> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.

I assume you can't use the trick that x86 has where all kernel addresses
are at the top of the 64-bit address space and user addresses are at the
bottom, regardless of the size of the page tables?

      Arnd

^ permalink raw reply

* Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: peterz @ 2020-07-22  8:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Nathan Lynch, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Gautham R Shenoy, Jordan Niethe,
	Anton Blanchard, LKML, Valentin Schneider, linuxppc-dev,
	Nick Piggin
In-Reply-To: <20200722084854.GL10769@hirez.programming.kicks-ass.net>

On Wed, Jul 22, 2020 at 10:48:54AM +0200, Peter Zijlstra wrote:
> But reading your explanation, it looks like the Linux topology setup
> could actually unravel the fused-faux-SMT8 into two independent SMT4
> parts, negating that firmware option.

Ah, it looks like that's exactly what you're doing. Nice!

^ permalink raw reply

* Re: [v4 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
From: Bharata B Rao @ 2020-07-22  8:52 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1594972827-13928-2-git-send-email-linuxram@us.ibm.com>

On Fri, Jul 17, 2020 at 01:00:23AM -0700, Ram Pai wrote:
> Page-merging of pages in memory-slots associated with a Secure VM,
> is disabled in H_SVM_PAGE_IN handler.
> 
> This operation should have been done much earlier; the moment the VM
> is initiated for secure-transition. Delaying this operation, increases
> the probability for those pages to acquire new references , making it
> impossible to migrate those pages.
> 
> Disable page-migration in H_SVM_INIT_START handling.
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>

Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>

with a few observations below...

> ---
>  Documentation/powerpc/ultravisor.rst |  1 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 98 +++++++++++++++++++++++++++---------
>  2 files changed, 76 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
> index df136c8..a1c8c37 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -895,6 +895,7 @@ Return values
>      One of the following values:
>  
>  	* H_SUCCESS	 on success.
> +        * H_STATE        if the VM is not in a position to switch to secure.
>  
>  Description
>  ~~~~~~~~~~~
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index e6f76bc..0baa293 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
>  	return false;
>  }
>  
> +static int kvmppc_memslot_page_merge(struct kvm *kvm,
> +		struct kvm_memory_slot *memslot, bool merge)
> +{
> +	unsigned long gfn = memslot->base_gfn;
> +	unsigned long end, start = gfn_to_hva(kvm, gfn);
> +	int ret = 0;
> +	struct vm_area_struct *vma;
> +	int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> +
> +	if (kvm_is_error_hva(start))
> +		return H_STATE;
> +
> +	end = start + (memslot->npages << PAGE_SHIFT);
> +
> +	mmap_write_lock(kvm->mm);
> +	do {
> +		vma = find_vma_intersection(kvm->mm, start, end);
> +		if (!vma) {
> +			ret = H_STATE;
> +			break;
> +		}
> +		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> +			  merge_flag, &vma->vm_flags);
> +		if (ret) {
> +			ret = H_STATE;
> +			break;
> +		}
> +		start = vma->vm_end + 1;

This should be start = vma->vm_end I believe.

> +	} while (end > vma->vm_end);
> +
> +	mmap_write_unlock(kvm->mm);
> +	return ret;
> +}
> +
> +static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
> +{
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	int ret = 0;
> +
> +	slots = kvm_memslots(kvm);
> +	kvm_for_each_memslot(memslot, slots) {
> +		ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
> +		if (ret)
> +			break;
> +	}
> +	return ret;
> +}

You walk through all the slots here to issue kvm_madvise, but...

> +
> +static inline int kvmppc_disable_page_merge(struct kvm *kvm)
> +{
> +	return __kvmppc_page_merge(kvm, false);
> +}
> +
> +static inline int kvmppc_enable_page_merge(struct kvm *kvm)
> +{
> +	return __kvmppc_page_merge(kvm, true);
> +}
> +
>  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  {
>  	struct kvm_memslots *slots;
> @@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  		return H_AUTHORITY;
>  
>  	srcu_idx = srcu_read_lock(&kvm->srcu);
> +
> +	/* disable page-merging for all memslot */
> +	ret = kvmppc_disable_page_merge(kvm);
> +	if (ret)
> +		goto out;
> +
> +	/* register the memslot */
>  	slots = kvm_memslots(kvm);
>  	kvm_for_each_memslot(memslot, slots) {

... you are walking thro' the same set of slots here anyway. I think
it makes sense to issue merge advices from here itself. That will
help you to share code with kvmppc_memslot_create() in 5/5.

All the below 3 calls are common to both the code paths, I think
they can be carved out into a separate function if you prefer.

kvmppc_uvmem_slot_init
kvmppc_memslot_page_merge
uv_register_mem_slot

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: Peter Zijlstra @ 2020-07-22  8:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Nathan Lynch, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Gautham R Shenoy, Jordan Niethe,
	Anton Blanchard, LKML, Valentin Schneider, linuxppc-dev,
	Nick Piggin
In-Reply-To: <20200722081822.GG9290@linux.vnet.ibm.com>

On Wed, Jul 22, 2020 at 01:48:22PM +0530, Srikar Dronamraju wrote:
> * peterz@infradead.org <peterz@infradead.org> [2020-07-22 09:46:24]:
> 
> > On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > > same as SMT domain. However we could generalize this domain such that it
> > > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > What's the whole smallcore vs bigcore thing?
> > 
> > Would it make sense to have an actual overview of the various topology
> > layers somewhere? Because I'm getting lost and can't really make sense
> > of the code due to that.
> 
> To quote with an example: using (Power 9)
> 
> Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
> independent core with threads 0,2,4 and 6 acting like a core, while threads
> 1,3,5,7 acting as another core.  
> 
> The firmware can decide to showcase them as 2 independent small cores or as
> one big core. However the LLC (i.e last level of cache) is shared between
> all the threads of the SMT 8 core. Future power chips, LLC might change, it
> may be expanded to share with another SMT 8 core or it could be made
> specific to SMT 4 core.
> 
> The smt 8 core is also known as fused core/ Big core.
> The smaller smt 4 core is known as small core.
> 
> So on a Power9 Baremetal, the firmware would show up as SMT4 core.
> and we have a CACHE level at SMT 8. CACHE level would be very very similar
> to MC domain in X86.
> 
> Hope this is clear.

Ooh, that thing. I thought P9 was in actual fact an SMT4 hardware part,
but to be compatible with P8 it has this fused option where two cores
act like a single SMT8 part.

The interleaving enumeration is done due to P7 asymmetric cores,
resuting in schedulers having the preference to use the lower threads.

Combined that results in a P9-fused configuration using two independent
full cores when there's only 2 runnable threads.

Which is a subtly different story from yours.

But reading your explanation, it looks like the Linux topology setup
could actually unravel the fused-faux-SMT8 into two independent SMT4
parts, negating that firmware option.

Anyway, a few comments just around there might be helpfull.


Thanks!

^ permalink raw reply

* Re: [PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes
From: David Gibson @ 2020-07-22  7:51 UTC (permalink / raw)
  To: Bharata B Rao; +Cc: Nathan Lynch, linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <20200722060506.GO7902@in.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 3314 bytes --]

On Wed, Jul 22, 2020 at 11:35:06AM +0530, Bharata B Rao wrote:
> On Tue, Jul 21, 2020 at 10:25:58PM +1000, Michael Ellerman wrote:
> > Bharata B Rao <bharata@linux.ibm.com> writes:
> > > On Tue, Jul 21, 2020 at 11:45:20AM +1000, Michael Ellerman wrote:
> > >> Nathan Lynch <nathanl@linux.ibm.com> writes:
> > >> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> > >> >> This is the next version of the fixes for memory unplug on radix.
> > >> >> The issues and the fix are described in the actual patches.
> > >> >
> > >> > I guess this isn't actually causing problems at runtime right now, but I
> > >> > notice calls to resize_hpt_for_hotplug() from arch_add_memory() and
> > >> > arch_remove_memory(), which ought to be mmu-agnostic:
> > >> >
> > >> > int __ref arch_add_memory(int nid, u64 start, u64 size,
> > >> > 			  struct mhp_params *params)
> > >> > {
> > >> > 	unsigned long start_pfn = start >> PAGE_SHIFT;
> > >> > 	unsigned long nr_pages = size >> PAGE_SHIFT;
> > >> > 	int rc;
> > >> >
> > >> > 	resize_hpt_for_hotplug(memblock_phys_mem_size());
> > >> >
> > >> > 	start = (unsigned long)__va(start);
> > >> > 	rc = create_section_mapping(start, start + size, nid,
> > >> > 				    params->pgprot);
> > >> > ...
> > >> 
> > >> Hmm well spotted.
> > >> 
> > >> That does return early if the ops are not setup:
> > >> 
> > >> int resize_hpt_for_hotplug(unsigned long new_mem_size)
> > >> {
> > >> 	unsigned target_hpt_shift;
> > >> 
> > >> 	if (!mmu_hash_ops.resize_hpt)
> > >> 		return 0;
> > >> 
> > >> 
> > >> And:
> > >> 
> > >> void __init hpte_init_pseries(void)
> > >> {
> > >> 	...
> > >> 	if (firmware_has_feature(FW_FEATURE_HPT_RESIZE))
> > >> 		mmu_hash_ops.resize_hpt = pseries_lpar_resize_hpt;
> > >> 
> > >> And that comes in via ibm,hypertas-functions:
> > >> 
> > >> 	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
> > >> 
> > >> 
> > >> But firmware is not necessarily going to add/remove that call based on
> > >> whether we're using hash/radix.
> > >
> > > Correct but hpte_init_pseries() will not be called for radix guests.
> > 
> > Yeah, duh. You'd think the function name would have been a sufficient
> > clue for me :)
> > 
> > >> So I think a follow-up patch is needed to make this more robust.
> > >> 
> > >> Aneesh/Bharata what platform did you test this series on? I'm curious
> > >> how this didn't break.
> > >
> > > I have tested memory hotplug/unplug for radix guest on zz platform and
> > > sanity-tested this for hash guest on P8.
> > >
> > > As noted above, mmu_hash_ops.resize_hpt will not be set for radix
> > > guest and hence we won't see any breakage.
> > 
> > OK.
> > 
> > That's probably fine as it is then. Or maybe just a comment in
> > resize_hpt_for_hotplug() pointing out that resize_hpt will be NULL if
> > we're using radix.
> 
> Or we could move these calls to hpt-only routines like below?
> 
> David - Do you remember if there was any particular reason to have
> these two hpt-resize calls within powerpc-generic memory hotplug code?

I don't remember, sorry.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-22  8:18 UTC (permalink / raw)
  To: peterz
  Cc: Ingo Molnar, Nathan Lynch, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Gautham R Shenoy, Jordan Niethe,
	Anton Blanchard, LKML, Valentin Schneider, linuxppc-dev,
	Nick Piggin
In-Reply-To: <20200722074624.GP119549@hirez.programming.kicks-ass.net>

* peterz@infradead.org <peterz@infradead.org> [2020-07-22 09:46:24]:

> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> What's the whole smallcore vs bigcore thing?
> 
> Would it make sense to have an actual overview of the various topology
> layers somewhere? Because I'm getting lost and can't really make sense
> of the code due to that.

To quote with an example: using (Power 9)

Power 9 is an SMT 8 core by design. However this 8 thread core can run as 2
independent core with threads 0,2,4 and 6 acting like a core, while threads
1,3,5,7 acting as another core.  

The firmware can decide to showcase them as 2 independent small cores or as
one big core. However the LLC (i.e last level of cache) is shared between
all the threads of the SMT 8 core. Future power chips, LLC might change, it
may be expanded to share with another SMT 8 core or it could be made
specific to SMT 4 core.

The smt 8 core is also known as fused core/ Big core.
The smaller smt 4 core is known as small core.

So on a Power9 Baremetal, the firmware would show up as SMT4 core.
and we have a CACHE level at SMT 8. CACHE level would be very very similar
to MC domain in X86.

Hope this is clear.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v2 01/10] powerpc/smp: Cache node for reuse
From: Srikar Dronamraju @ 2020-07-22  8:04 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Ingo Molnar, Nathan Lynch, Oliver OHalloran, Michael Neuling,
	Gautham R Shenoy, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Nick Piggin, linuxppc-dev, Valentin Schneider
In-Reply-To: <87imegq9my.fsf@mpe.ellerman.id.au>

* Michael Ellerman <michaele@au1.ibm.com> [2020-07-22 17:41:41]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > While cpu_to_node is inline function with access to per_cpu variable.
> > However when using repeatedly, it may be cleaner to cache it in a local
> > variable.
> 
> It's not clear what "cleaner" is supposed to mean. Are you saying it
> makes the source clearer, or the generated code?
> 
> I'm not sure it will make any difference to the latter.

I meant the source code, I am okay dropping the hunks that try to cache
cpu_to_node.

> 
> > Also fix a build error in a some weird config.
> > "error: _numa_cpu_lookup_table_ undeclared"
> 
> Separate patch please.

Okay, will do.

> 
> > No functional change
> 
> The ifdef change means that's not true.

Okay

> > @@ -854,20 +854,24 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> >  	cpu_callin_map[boot_cpuid] = 1;
> >  
> >  	for_each_possible_cpu(cpu) {
> > +		int node = cpu_to_node(cpu);
> > +
> 
> Does cpu_to_node() even work here?

Except in the case where NUMA is not enabled, (when cpu_to_node would return
-1), It should work here since numa initialization would have happened by
now. It cpu_to_node(cpu) should work once numa_setup_cpu() /
map_cpu_to_node() gets called.  And those are being called before this.

> 
> Doesn't look like it to me.
> 
> More fallout from 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID") ?
> 
> >  	}
> >  
> >  	/* Init the cpumasks so the boot CPU is related to itself */

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Ram Pai @ 2020-07-22  7:49 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: ldufour, aik, kvm-ppc, bharata, sathnaga, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <875zags3qp.fsf@mpe.ellerman.id.au>

On Wed, Jul 22, 2020 at 12:06:06PM +1000, Michael Ellerman wrote:
> Ram Pai <linuxram@us.ibm.com> writes:
> > An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> > fault is delivered to the ultravisor.
> >
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can race
> > with other vcpus modifying the SVM's process scoped page table.
> 
> You're trying to read the guest's kernel text IIUC, that mapping should
> be stable. Possibly permissions on it could change over time, but the
> virtual -> real mapping should not.

Actually the code does not capture the address of the instruction in the
sprg0 register. It captures the instruction itself. So should the mapping
matter?

> 
> > This problem can be correctly solved with some help from the kernel.
> >
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> This is not something I'm going to merge. Sorry.

Ok. Will consider other approaches.

RP

^ permalink raw reply

* Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: peterz @ 2020-07-22  7:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Nathan Lynch, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Gautham R Shenoy, Jordan Niethe,
	Anton Blanchard, LKML, Valentin Schneider, linuxppc-dev,
	Nick Piggin
In-Reply-To: <20200721113814.32284-7-srikar@linux.vnet.ibm.com>

On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.

What's the whole smallcore vs bigcore thing?

Would it make sense to have an actual overview of the various topology
layers somewhere? Because I'm getting lost and can't really make sense
of the code due to that.

^ permalink raw reply

* RE: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Ram Pai @ 2020-07-22  7:45 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: ldufour, aik, kvm-ppc, bharata, sathnaga, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200722074205.GH7339@oc0525413822.ibm.com>

On Wed, Jul 22, 2020 at 12:42:05AM -0700, Ram Pai wrote:
> On Wed, Jul 22, 2020 at 03:02:32PM +1000, Paul Mackerras wrote:
> > On Thu, Jul 16, 2020 at 01:32:13AM -0700, Ram Pai wrote:
> > > An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> > > appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> > > fault is delivered to the ultravisor.
> > > 
> > > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > > translation. Walking the two level page table to read the instruction can race
> > > with other vcpus modifying the SVM's process scoped page table.
> > > 
> > > This problem can be correctly solved with some help from the kernel.
> > > 
> > > Capture the faulting instruction in SPRG0 register, before executing the
> > > faulting instruction. This enables the ultravisor to easily procure the
> > > faulting instruction and emulate it.
> > 
> > Just a comment on the approach of putting the instruction in SPRG0:
> > these I/O accessors can be used in interrupt routines, which means
> > that if these accessors are ever used with interrupts enabled, there
> > is the possibility of an external interrupt occurring between the
> > instruction that sets SPRG0 and the load/store instruction that
> > faults.  If the handler for that interrupt itself does an I/O access,
> > it will overwrite SPRG0, corrupting the value set by the interrupted
> > code.
> 
> Acutally my proposed code restores the value of SPRG0 before returning back to
> the interrupted instruction. So here is the sequence. I think it works.
> 
>  (1) store sprg0 in register Rx (lets say srpg0 had 0xc. Rx now contains 0xc)
> 
>  (2) save faulting instruction address in sprg0 (lets say the value is 0xa.
> 		 			sprg0 will contain 0xa).

Small correction. sprg0 does not store the address of the faulting
instruction. It stores the isntruction itself. Regardless, the code
should work, I think.

RP

^ permalink raw reply

* Re: Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register
From: Ram Pai @ 2020-07-22  7:42 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: ldufour, aik, kvm-ppc, bharata, sathnaga, sukadev, linuxppc-dev,
	bauerman, david
In-Reply-To: <20200722050232.GD3878639@thinks.paulus.ozlabs.org>

On Wed, Jul 22, 2020 at 03:02:32PM +1000, Paul Mackerras wrote:
> On Thu, Jul 16, 2020 at 01:32:13AM -0700, Ram Pai wrote:
> > An instruction accessing a mmio address, generates a HDSI fault.  This fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, the
> > fault is delivered to the ultravisor.
> > 
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can race
> > with other vcpus modifying the SVM's process scoped page table.
> > 
> > This problem can be correctly solved with some help from the kernel.
> > 
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> Just a comment on the approach of putting the instruction in SPRG0:
> these I/O accessors can be used in interrupt routines, which means
> that if these accessors are ever used with interrupts enabled, there
> is the possibility of an external interrupt occurring between the
> instruction that sets SPRG0 and the load/store instruction that
> faults.  If the handler for that interrupt itself does an I/O access,
> it will overwrite SPRG0, corrupting the value set by the interrupted
> code.

Acutally my proposed code restores the value of SPRG0 before returning back to
the interrupted instruction. So here is the sequence. I think it works.

 (1) store sprg0 in register Rx (lets say srpg0 had 0xc. Rx now contains 0xc)

 (2) save faulting instruction address in sprg0 (lets say the value is 0xa.
		 			sprg0 will contain 0xa).

 (3) 		<----- interrupt arrives

 (4) store sprg0 in register Ry ( Ry now contains 0xa )

 (5) save faulting instruction address in sprg0 (lets say the value is 0xb).

 (6) 0xb: execute faulting instruction

 (7) restore Ry into sprg0 ( sprg0 now contains 0xa )

 (8) 		<-- return from interrupt

 (9)  0xa: execute faulting instruction

 (10) restore Rx into sprg0 (sprg0 now contains 0xc)

> 
> The choices to fix that would seem to be (a) disable interrupts around
> all I/O accesses, (b) have the accessor save and restore SPRG0, or (c)
> solve the problem another way, such as by doing a H_LOGICAL_CI_LOAD
> or H_LOGICAL_CI_STORE hypercall.

Ok. Will explore (c).

Thanks,
RP

-- 
Ram Pai

^ permalink raw reply

* Re: [PATCH v2 01/10] powerpc/smp: Cache node for reuse
From: Michael Ellerman @ 2020-07-22  7:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, Anton Blanchard,
	LKML, Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200721113814.32284-2-srikar@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> While cpu_to_node is inline function with access to per_cpu variable.
> However when using repeatedly, it may be cleaner to cache it in a local
> variable.

It's not clear what "cleaner" is supposed to mean. Are you saying it
makes the source clearer, or the generated code?

I'm not sure it will make any difference to the latter.

> Also fix a build error in a some weird config.
> "error: _numa_cpu_lookup_table_ undeclared"

Separate patch please.

> No functional change

The ifdef change means that's not true.

> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <michaele@au1.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Nick Piggin <npiggin@au1.ibm.com>
> Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@linux.ibm.com>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/smp.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 73199470c265..680c0edcc59d 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -843,7 +843,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  
>  	DBG("smp_prepare_cpus\n");
>  
> -	/* 
> +	/*
>  	 * setup_cpu may need to be called on the boot cpu. We havent
>  	 * spun any cpus up but lets be paranoid.
>  	 */
> @@ -854,20 +854,24 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  	cpu_callin_map[boot_cpuid] = 1;
>  
>  	for_each_possible_cpu(cpu) {
> +		int node = cpu_to_node(cpu);
> +

Does cpu_to_node() even work here?

Doesn't look like it to me.

More fallout from 8c272261194d ("powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID") ?

>  		zalloc_cpumask_var_node(&per_cpu(cpu_sibling_map, cpu),
> -					GFP_KERNEL, cpu_to_node(cpu));
> +					GFP_KERNEL, node);
>  		zalloc_cpumask_var_node(&per_cpu(cpu_l2_cache_map, cpu),
> -					GFP_KERNEL, cpu_to_node(cpu));
> +					GFP_KERNEL, node);
>  		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
> -					GFP_KERNEL, cpu_to_node(cpu));
> +					GFP_KERNEL, node);
> +#ifdef CONFIG_NEED_MULTIPLE_NODES
>  		/*
>  		 * numa_node_id() works after this.
>  		 */
>  		if (cpu_present(cpu)) {
> -			set_cpu_numa_node(cpu, numa_cpu_lookup_table[cpu]);
> -			set_cpu_numa_mem(cpu,
> -				local_memory_node(numa_cpu_lookup_table[cpu]));
> +			node = numa_cpu_lookup_table[cpu];
> +			set_cpu_numa_node(cpu, node);
> +			set_cpu_numa_mem(cpu, local_memory_node(node));
>  		}
> +#endif
>  	}
>  
>  	/* Init the cpumasks so the boot CPU is related to itself */
> -- 
> 2.17.1

cheers

^ permalink raw reply

* Re: [PATCH v2 06/10] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-22  7:39 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Nathan Lynch, Oliver OHalloran, Michael Neuling, Michael Ellerman,
	Peter Zijlstra, Jordan Niethe, Anton Blanchard, LKML, Ingo Molnar,
	Nick Piggin, linuxppc-dev, Valentin Schneider
In-Reply-To: <20200722065640.GE31038@in.ibm.com>

* Gautham R Shenoy <ego@linux.vnet.ibm.com> [2020-07-22 12:26:40]:

> Hello Srikar,
> 
> On Tue, Jul 21, 2020 at 05:08:10PM +0530, Srikar Dronamraju wrote:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> > 
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> > @@ -1339,14 +1345,20 @@ void start_secondary(void *unused)
> >  	/* Update topology CPU masks */
> >  	add_cpu_to_masks(cpu);
> > 
> > -	if (has_big_cores)
> > -		sibling_mask = cpu_smallcore_mask;
> >  	/*
> >  	 * Check for any shared caches. Note that this must be done on a
> >  	 * per-core basis because one core in the pair might be disabled.
> >  	 */
> > -	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> > -		shared_caches = true;
> > +	if (!shared_caches) {
> > +		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> > +		struct cpumask *mask = cpu_l2_cache_mask(cpu);
> > +
> > +		if (has_big_cores)
> > +			sibling_mask = cpu_smallcore_mask;
> > +
> > +		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> > +			shared_caches = true;
> 
> At the risk of repeating my comment to the v1 version of the patch, we
> have shared caches only l2_cache_mask(cpu) is a strict superset of
> sibling_mask(cpu).
> 
> "cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu))" does not
> capture this.

Why would it not? We are setting shared_caches if and only if l2_cache_mask
is a strict superset of sibling/smallcore mask.

> Could we please use
> 
>       if (!cpumask_equal(sibling_mask(cpu), mask) &&
>       	  cpumask_subset(sibling_mask(cpu), mask) {
>       }
> 

Scheduler mandates that two cpumasks for the same CPU would either have to
be equal or one of them has to be a strict superset of the other. If not the
scheduler would mark our domains as broken. That being the case,
cpumask_weight will correctly capture what we are looking for. That said
your condition is also correct but slightly heavier and doesn't provide us
with any more information or correctness.

> 
> Otherwise the patch looks good to me.
> 
> --
> Thanks and Regards
> gautham.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox