LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2] perf vendor events power10: Add metric events json file for power10 platform
From: Michael Ellerman @ 2021-10-25  3:23 UTC (permalink / raw)
  To: Paul A. Clarke, Kajol Jain
  Cc: maddy, rnsastry, linuxppc-dev, linux-kernel, acme,
	linux-perf-users, atrajeev, jolsa
In-Reply-To: <20211022144910.GC104437@li-24c3614c-2adc-11b2-a85c-85f334518bdb.ibm.com>

"Paul A. Clarke" <pc@us.ibm.com> writes:
> Thanks for the changes!
> More nits below (many left over from prior review)...
>
> On Fri, Oct 22, 2021 at 11:55:05AM +0530, Kajol Jain wrote:
>> Add pmu metric json file for power10 platform.
>> 
>> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
>> ---
>> Changelog v1 -> v2:
>> - Did some nit changes in BriefDescription field
>>   as suggested by Paul A. Clarke
>> 
>> - Link to the v1 patch: https://lkml.org/lkml/2021/10/6/131
>> 
>>  .../arch/powerpc/power10/metrics.json         | 676 ++++++++++++++++++
>>  1 file changed, 676 insertions(+)
>>  create mode 100644 tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>> 
>> diff --git a/tools/perf/pmu-events/arch/powerpc/power10/metrics.json b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>> new file mode 100644
>> index 000000000000..8adab5cd9934
>> --- /dev/null
>> +++ b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
>> @@ -0,0 +1,676 @@
>> +[
>> +    {
>> +        "BriefDescription": "Percentage of cycles that are run cycles",
>> +        "MetricExpr": "PM_RUN_CYC / PM_CYC * 100",
>> +        "MetricGroup": "General",
>> +        "MetricName": "RUN_CYCLES_RATE",
>> +        "ScaleUnit": "1%"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per completed instruction",
>> +        "MetricExpr": "PM_CYC / PM_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "CYCLES_PER_INSTRUCTION"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
>> +        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because there was a flush",
>> +        "MetricExpr": "PM_DISP_STALL_FLUSH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_FLUSH_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because the MMU was handling a translation miss",
>> +        "MetricExpr": "PM_DISP_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_TRANSLATION_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction ERAT miss",
>> +        "MetricExpr": "PM_DISP_STALL_IERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IERAT_ONLY_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction TLB miss",
>> +        "MetricExpr": "PM_DISP_STALL_ITLB_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_ITLB_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss",
>> +        "MetricExpr": "PM_DISP_STALL_IC_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L2",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L2 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L2_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L3",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L3_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from any source beyond the local L3",
>> +        "MetricExpr": "PM_DISP_STALL_IC_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_IC_L3MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss after a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_ICMISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_ICMISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L2 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L2 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L2_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L3 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from any source beyond the local L3 after suffering a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3MISS_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to a branch mispredict",
>> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_BR_MPRED_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any reason",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of a synchronizing instruction that requires the ICT to be empty before dispatch",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_SYNC_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_SYNC_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch while waiting on the scoreboard",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_SCOREBOARD_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_SCOREBOARD_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch due to issue queue full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_ISSQ_FULL_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISP_HELD_STALL_ISSQ_FULL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_RENAME_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_RENAME_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the STF mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_STF_MAPPER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_STF_MAPPER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the XVFC mapper/SRB was full",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_XVFC_MAPPER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_XVFC_MAPPER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any other reason",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_OTHER_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_OTHER_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction has been dispatched but not issued for any reason",
>> +        "MetricExpr": "PM_ISSUE_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "ISSUE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>> +        "MetricExpr": "PM_EXEC_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "EXECUTION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction spent executing an NTC instruction that gets flushed some time after dispatch",
>> +        "MetricExpr": "PM_EXEC_STALL_NTC_FLUSH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "NTC_FLUSH_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTF instruction finishes at dispatch",
>> +        "MetricExpr": "PM_EXEC_STALL_FIN_AT_DISP / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "FIN_AT_DISP_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the branch unit",
>> +        "MetricExpr": "PM_EXEC_STALL_BRU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "BRU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a simple fixed point instruction that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_SIMPLE_FX / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "SIMPLE_FX_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the VSU",
>> +        "MetricExpr": "PM_EXEC_STALL_VSU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "VSU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
>> +        "MetricExpr": "PM_EXEC_STALL_TRANSLATION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TRANSLATION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load or store that suffered a translation miss",
>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_ONLY_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DERAT_ONLY_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is recovering from a TLB miss",
>> +        "MetricExpr": "PM_EXEC_STALL_DERAT_DTLB_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DERAT_DTLB_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_LSU / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LSU_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_LOAD / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LOAD_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, with an RC dispatch conflict",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_CONFLICT_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, without an RC dispatch conflict",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L2L3_NOCONFLICT_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a source beyond the local L2 and local L3",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L3MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L3MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a neighbor chiplet's L2 or L3 in the same chip",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L21_L31 / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_L21_L31_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from L4, local memory or OpenCapp chip",
>
> What is "OpenCapp"?  Is is different from OpenCAPI?
>
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_LMEM / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_LMEM_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a remote chip (cache, L4, memory or OpenCapp) in the same group",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_CHIP / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_OFF_CHIP_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a distant chip (cache, L4, memory or OpenCapp chip)",
>> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_NODE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DMISS_OFF_NODE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a TLBIEL instruction",
>> +        "MetricExpr": "PM_EXEC_STALL_TLBIEL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TLBIEL_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is finishing a load after its data has been reloaded from a data source beyond the local L1, OR when the LSU is processing an L1-hit, OR when the NTF instruction merged with another load in the LMQ",
>> +        "MetricExpr": "PM_EXEC_STALL_LOAD_FINISH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LOAD_FINISH_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store that is executing in the LSU",
>> +        "MetricExpr": "PM_EXEC_STALL_STORE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is in the store unit outside of handling store misses or other special store operations",
>
> Is "store unit" not the same as "LSU" ?  Use "LSU" uniformly if appropriate:
> s/store unit/LSU/
>
>> +        "MetricExpr": "PM_EXEC_STALL_STORE_PIPE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_PIPE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store whose cache line was not resident in the L1 and had to wait for allocation of the missing line into the L1",
>> +        "MetricExpr": "PM_EXEC_STALL_STORE_MISS / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STORE_MISS_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a TLBIE instruction waiting for a response from the L2",
>> +        "MetricExpr": "PM_EXEC_STALL_TLBIE / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "TLBIE_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a PTESYNC instruction",
>> +        "MetricExpr": "PM_EXEC_STALL_PTESYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "PTESYNC_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because the thread was blocked",
>> +        "MetricExpr": "PM_CMPL_STALL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because it was interrupted by ANY exception",
>> +        "MetricExpr": "PM_CMPL_STALL_EXCEPTION / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "EXCEPTION_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is stuck at finish waiting for the non-speculative finish of either a STCX instruction waiting for its result or a load waiting for non-critical sectors of data and ECC",
>> +        "MetricExpr": "PM_CMPL_STALL_MEM_ECC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "MEM_ECC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete the instruction is a STCX instruction waiting for resolution from the nest",
>
> Need to reword this, I think.  I propose "Average cycles per instruction
> when the NTC instruction is a STCX instruction waiting for resolution
> from the nest", which follows the form used by HWSYNC_COMPLETION_STALL_CPI,
> below.
>
>> +        "MetricExpr": "PM_CMPL_STALL_STCX / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "STCX_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a LWSYNC instruction waiting to complete",
>> +        "MetricExpr": "PM_CMPL_STALL_LWSYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "LWSYNC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a HWSYNC instruction stuck at finish waiting for a response from the L2",
>> +        "MetricExpr": "PM_CMPL_STALL_HWSYNC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "HWSYNC_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction required special handling before completion",
>> +        "MetricExpr": "PM_CMPL_STALL_SPECIAL / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "SPECIAL_COMPLETION_STALL_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because fetch was being held, so there was nothing in the pipeline for this thread",
>> +        "MetricExpr": "PM_DISP_STALL_FETCH / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_FETCH_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of power management",
>> +        "MetricExpr": "PM_DISP_STALL_HELD_HALT_CYC / PM_RUN_INST_CMPL",
>> +        "MetricGroup": "CPI",
>> +        "MetricName": "DISPATCHED_HELD_HALT_CPI"
>> +    },
>> +    {
>> +        "BriefDescription": "Percentage of flushes per completed run instruction",
>
> s/per completed run instruction/per instruction/

I'm not sure we want to drop "completed" from this and all the following
descriptions.

There is a meaningful distinction between completed and dispatched
instructions, I think it's useful to be explicit about which the event
is counting.

I agree dropping "run" is a good idea, most people won't understand that
"run" means "non-idle", and I think don't expect idle instructions to be
counted anyway.

...
>
>> +        "MetricExpr": "PM_RUN_INST_CMPL / PM_RUN_CYC",
>> +        "MetricGroup": "General",
>> +        "MetricName": "RUN_IPC"
>> +    },
>> +    {
>> +        "BriefDescription": "Average number of instructions completed per instruction group",
>
> s/completed//

And here the meaning is different if you drop "completed".

cheers

^ permalink raw reply

* [PATCH] mm/migrate.c: Remove MIGRATE_PFN_LOCKED
From: Alistair Popple @ 2021-10-25  4:16 UTC (permalink / raw)
  To: akpm
  Cc: rcampbell, amd-gfx, nouveau, Felix.Kuehling, Alistair Popple,
	linux-kernel, kvm-ppc, linux-mm, jglisse, dri-devel, ziy,
	jhubbard, alexander.deucher, linuxppc-dev, hch, bskeggs

MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a
source page was already locked during migrate_vma_collect(). If it
wasn't then the a second attempt is made to lock the page. However if
the first attempt failed it's unlikely a second attempt will succeed,
and the retry adds complexity. So clean this up by removing the retry
and MIGRATE_PFN_LOCKED flag.

Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag
set, but nothing actually checks that.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 Documentation/vm/hmm.rst                 |   2 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c       |   4 +-
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   2 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |   4 +-
 include/linux/migrate.h                  |   1 -
 lib/test_hmm.c                           |   5 +-
 mm/migrate.c                             | 145 +++++------------------
 7 files changed, 35 insertions(+), 128 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index a14c2938e7af..f2a59ed82ed3 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -360,7 +360,7 @@ between device driver specific code and shared common code:
    system memory page, locks the page with ``lock_page()``, and fills in the
    ``dst`` array entry with::
 
-     dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+     dst[i] = migrate_pfn(page_to_pfn(dpage));
 
    Now that the driver knows that this page is being migrated, it can
    invalidate device private MMU mappings and copy device private memory
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index a7061ee3b157..28c436df9935 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -560,7 +560,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma,
 				  gpa, 0, page_shift);
 
 	if (ret == U_SUCCESS)
-		*mig.dst = migrate_pfn(pfn) | MIGRATE_PFN_LOCKED;
+		*mig.dst = migrate_pfn(pfn);
 	else {
 		unlock_page(dpage);
 		__free_page(dpage);
@@ -774,7 +774,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma,
 		}
 	}
 
-	*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	*mig.dst = migrate_pfn(page_to_pfn(dpage));
 	migrate_vma_pages(&mig);
 out_finalize:
 	migrate_vma_finalize(&mig);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 4a16e3c257b9..41d9417f182b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -300,7 +300,6 @@ svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange,
 			migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]);
 			svm_migrate_get_vram_page(prange, migrate->dst[i]);
 			migrate->dst[i] = migrate_pfn(migrate->dst[i]);
-			migrate->dst[i] |= MIGRATE_PFN_LOCKED;
 			src[i] = dma_map_page(dev, spage, 0, PAGE_SIZE,
 					      DMA_TO_DEVICE);
 			r = dma_mapping_error(dev, src[i]);
@@ -580,7 +579,6 @@ svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange,
 			      dst[i] >> PAGE_SHIFT, page_to_pfn(dpage));
 
 		migrate->dst[i] = migrate_pfn(page_to_pfn(dpage));
-		migrate->dst[i] |= MIGRATE_PFN_LOCKED;
 		j++;
 	}
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..3828aafd3ac4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -166,7 +166,7 @@ static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
 		goto error_dma_unmap;
 	mutex_unlock(&svmm->mutex);
 
-	args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	args->dst[0] = migrate_pfn(page_to_pfn(dpage));
 	return 0;
 
 error_dma_unmap:
@@ -602,7 +602,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
 	if (src & MIGRATE_PFN_WRITE)
 		*pfn |= NVIF_VMM_PFNMAP_V0_W;
-	return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	return migrate_pfn(page_to_pfn(dpage));
 
 out_dma_unmap:
 	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index c8077e936691..479b861ae490 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -119,7 +119,6 @@ static inline int migrate_misplaced_page(struct page *page,
  */
 #define MIGRATE_PFN_VALID	(1UL << 0)
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
-#define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
 #define MIGRATE_PFN_SHIFT	6
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index c259842f6d44..e2ce8f9b7605 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -613,8 +613,7 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 */
 		rpage->zone_device_data = dmirror;
 
-		*dst = migrate_pfn(page_to_pfn(dpage)) |
-			    MIGRATE_PFN_LOCKED;
+		*dst = migrate_pfn(page_to_pfn(dpage));
 		if ((*src & MIGRATE_PFN_WRITE) ||
 		    (!spage && args->vma->vm_flags & VM_WRITE))
 			*dst |= MIGRATE_PFN_WRITE;
@@ -1137,7 +1136,7 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
 		copy_highpage(dpage, spage);
-		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		*dst = migrate_pfn(page_to_pfn(dpage));
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index a6a7743ee98f..915e969811d0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2369,7 +2369,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		 * can't be dropped from it).
 		 */
 		get_page(page);
-		migrate->cpages++;
 
 		/*
 		 * Optimize for the common case where page is only mapped once
@@ -2379,7 +2378,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		if (trylock_page(page)) {
 			pte_t swp_pte;
 
-			mpfn |= MIGRATE_PFN_LOCKED;
+			migrate->cpages++;
 			ptep_get_and_clear(mm, addr, ptep);
 
 			/* Setup special migration page table entry */
@@ -2413,6 +2412,9 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 
 			if (pte_present(pte))
 				unmapped++;
+		} else {
+			put_page(page);
+			mpfn = 0;
 		}
 
 next:
@@ -2517,15 +2519,17 @@ static bool migrate_vma_check_page(struct page *page)
 }
 
 /*
- * migrate_vma_prepare() - lock pages and isolate them from the lru
+ * migrate_vma_unmap() - replace page mapping with special migration pte entry
  * @migrate: migrate struct containing all migration information
  *
- * This locks pages that have been collected by migrate_vma_collect(). Once each
- * page is locked it is isolated from the lru (for non-device pages). Finally,
- * the ref taken by migrate_vma_collect() is dropped, as locked pages cannot be
- * migrated by concurrent kernel threads.
+ * Isolate pages from the LRU and replace mappings (CPU page table pte) with a
+ * special migration pte entry and check if it has been pinned. Pinned pages are
+ * restored because we cannot migrate them.
+ *
+ * This is the last step before we call the device driver callback to allocate
+ * destination memory and copy contents of original page over to new page.
  */
-static void migrate_vma_prepare(struct migrate_vma *migrate)
+static void migrate_vma_unmap(struct migrate_vma *migrate)
 {
 	const unsigned long npages = migrate->npages;
 	const unsigned long start = migrate->start;
@@ -2534,32 +2538,12 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 
 	lru_add_drain();
 
-	for (i = 0; (i < npages) && migrate->cpages; i++) {
+	for (i = 0; i < npages; i++) {
 		struct page *page = migrate_pfn_to_page(migrate->src[i]);
-		bool remap = true;
 
 		if (!page)
 			continue;
 
-		if (!(migrate->src[i] & MIGRATE_PFN_LOCKED)) {
-			/*
-			 * Because we are migrating several pages there can be
-			 * a deadlock between 2 concurrent migration where each
-			 * are waiting on each other page lock.
-			 *
-			 * Make migrate_vma() a best effort thing and backoff
-			 * for any page we can not lock right away.
-			 */
-			if (!trylock_page(page)) {
-				migrate->src[i] = 0;
-				migrate->cpages--;
-				put_page(page);
-				continue;
-			}
-			remap = false;
-			migrate->src[i] |= MIGRATE_PFN_LOCKED;
-		}
-
 		/* ZONE_DEVICE pages are not on LRU */
 		if (!is_zone_device_page(page)) {
 			if (!PageLRU(page) && allow_drain) {
@@ -2569,16 +2553,9 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			}
 
 			if (isolate_lru_page(page)) {
-				if (remap) {
-					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-					migrate->cpages--;
-					restore++;
-				} else {
-					migrate->src[i] = 0;
-					unlock_page(page);
-					migrate->cpages--;
-					put_page(page);
-				}
+				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+				migrate->cpages--;
+				restore++;
 				continue;
 			}
 
@@ -2586,80 +2563,20 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 			put_page(page);
 		}
 
-		if (!migrate_vma_check_page(page)) {
-			if (remap) {
-				migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-				migrate->cpages--;
-				restore++;
-
-				if (!is_zone_device_page(page)) {
-					get_page(page);
-					putback_lru_page(page);
-				}
-			} else {
-				migrate->src[i] = 0;
-				unlock_page(page);
-				migrate->cpages--;
+		if (page_mapped(page))
+			try_to_migrate(page, 0);
 
-				if (!is_zone_device_page(page))
-					putback_lru_page(page);
-				else
-					put_page(page);
+		if (page_mapped(page) || !migrate_vma_check_page(page)) {
+			if (!is_zone_device_page(page)) {
+				get_page(page);
+				putback_lru_page(page);
 			}
-		}
-	}
-
-	for (i = 0, addr = start; i < npages && restore; i++, addr += PAGE_SIZE) {
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
-
-		if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE))
-			continue;
 
-		remove_migration_pte(page, migrate->vma, addr, page);
-
-		migrate->src[i] = 0;
-		unlock_page(page);
-		put_page(page);
-		restore--;
-	}
-}
-
-/*
- * migrate_vma_unmap() - replace page mapping with special migration pte entry
- * @migrate: migrate struct containing all migration information
- *
- * Replace page mapping (CPU page table pte) with a special migration pte entry
- * and check again if it has been pinned. Pinned pages are restored because we
- * cannot migrate them.
- *
- * This is the last step before we call the device driver callback to allocate
- * destination memory and copy contents of original page over to new page.
- */
-static void migrate_vma_unmap(struct migrate_vma *migrate)
-{
-	const unsigned long npages = migrate->npages;
-	const unsigned long start = migrate->start;
-	unsigned long addr, i, restore = 0;
-
-	for (i = 0; i < npages; i++) {
-		struct page *page = migrate_pfn_to_page(migrate->src[i]);
-
-		if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE))
+			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+			migrate->cpages--;
+			restore++;
 			continue;
-
-		if (page_mapped(page)) {
-			try_to_migrate(page, 0);
-			if (page_mapped(page))
-				goto restore;
 		}
-
-		if (migrate_vma_check_page(page))
-			continue;
-
-restore:
-		migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-		migrate->cpages--;
-		restore++;
 	}
 
 	for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) {
@@ -2672,12 +2589,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
 
 		migrate->src[i] = 0;
 		unlock_page(page);
+		put_page(page);
 		restore--;
-
-		if (is_zone_device_page(page))
-			put_page(page);
-		else
-			putback_lru_page(page);
 	}
 }
 
@@ -2700,8 +2613,8 @@ static void migrate_vma_unmap(struct migrate_vma *migrate)
  * it for all those entries (ie with MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE
  * flag set).  Once these are allocated and copied, the caller must update each
  * corresponding entry in the dst array with the pfn value of the destination
- * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_LOCKED flags set
- * (destination pages must have their struct pages locked, via lock_page()).
+ * page and with MIGRATE_PFN_VALID. Destination pages must be locked via
+ * lock_page().
  *
  * Note that the caller does not have to migrate all the pages that are marked
  * with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from
@@ -2770,8 +2683,6 @@ int migrate_vma_setup(struct migrate_vma *args)
 
 	migrate_vma_collect(args);
 
-	if (args->cpages)
-		migrate_vma_prepare(args);
 	if (args->cpages)
 		migrate_vma_unmap(args);
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH] powerpc/bpf: fix write protecting JIT code
From: Hari Bathini @ 2021-10-25  5:56 UTC (permalink / raw)
  To: naveen.n.rao, jniethe5, christophe.leroy, mpe, ast, daniel
  Cc: songliubraving, netdev, john.fastabend, andrii, stable, kpsingh,
	paulus, yhs, bpf, linuxppc-dev, kafai, Hari Bathini

Running program with bpf-to-bpf function calls results in data access
exception (0x300) with the below call trace:

    [c000000000113f28] bpf_int_jit_compile+0x238/0x750 (unreliable)
    [c00000000037d2f8] bpf_check+0x2008/0x2710
    [c000000000360050] bpf_prog_load+0xb00/0x13a0
    [c000000000361d94] __sys_bpf+0x6f4/0x27c0
    [c000000000363f0c] sys_bpf+0x2c/0x40
    [c000000000032434] system_call_exception+0x164/0x330
    [c00000000000c1e8] system_call_vectored_common+0xe8/0x278

as bpf_int_jit_compile() tries writing to write protected JIT code
location during the extra pass.

Fix it by holding off write protection of JIT code until the extra
pass, where branch target addresses fixup happens.

Cc: stable@vger.kernel.org
Fixes: 62e3d4210ac9 ("powerpc/bpf: Write protect JIT code")
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
---
 arch/powerpc/net/bpf_jit_comp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index fcbf7a917c56..90ce75f0f1e2 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -241,8 +241,8 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *fp)
 	fp->jited_len = alloclen;
 
 	bpf_flush_icache(bpf_hdr, (u8 *)bpf_hdr + (bpf_hdr->pages * PAGE_SIZE));
-	bpf_jit_binary_lock_ro(bpf_hdr);
 	if (!fp->is_func || extra_pass) {
+		bpf_jit_binary_lock_ro(bpf_hdr);
 		bpf_prog_fill_jited_linfo(fp, addrs);
 out_addrs:
 		kfree(addrs);
-- 
2.31.1


^ permalink raw reply related

* Re: [PATCH] powerpc/bpf: fix write protecting JIT code
From: Naveen N. Rao @ 2021-10-25  6:15 UTC (permalink / raw)
  To: ast, christophe.leroy, daniel, Hari Bathini, jniethe5, mpe
  Cc: songliubraving, netdev, john.fastabend, kpsingh, stable, andrii,
	paulus, yhs, bpf, linuxppc-dev, kafai
In-Reply-To: <20211025055649.114728-1-hbathini@linux.ibm.com>

Hari Bathini wrote:
> Running program with bpf-to-bpf function calls results in data access
> exception (0x300) with the below call trace:
> 
>     [c000000000113f28] bpf_int_jit_compile+0x238/0x750 (unreliable)
>     [c00000000037d2f8] bpf_check+0x2008/0x2710
>     [c000000000360050] bpf_prog_load+0xb00/0x13a0
>     [c000000000361d94] __sys_bpf+0x6f4/0x27c0
>     [c000000000363f0c] sys_bpf+0x2c/0x40
>     [c000000000032434] system_call_exception+0x164/0x330
>     [c00000000000c1e8] system_call_vectored_common+0xe8/0x278
> 
> as bpf_int_jit_compile() tries writing to write protected JIT code
> location during the extra pass.
> 
> Fix it by holding off write protection of JIT code until the extra
> pass, where branch target addresses fixup happens.
> 
> Cc: stable@vger.kernel.org
> Fixes: 62e3d4210ac9 ("powerpc/bpf: Write protect JIT code")
> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
> ---
>  arch/powerpc/net/bpf_jit_comp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks for the fix!

Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>


^ permalink raw reply

* [PATCH] macintosh/via-pmu-led: make disk activity usage a parameter.
From: Hill Ma @ 2021-10-25  7:25 UTC (permalink / raw)
  To: benh, linuxppc-dev; +Cc: Hill Ma, linux-kernel, linux-doc

Whether to use the LED as a disk activity is a user preference.
Some like this usage while others find the LED too bright. So it
might be a good idea to make this choice a runtime parameter rather
than compile-time config.

The default is set to disabled as OS X does not use the LED as a
disk activity indicator.

Signed-off-by: Hill Ma <maahiuzeon@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 ++++++
 drivers/macintosh/Kconfig                       | 10 ----------
 drivers/macintosh/via-pmu-led.c                 | 11 ++++++++---
 3 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 43dc35fe5bc0..a656a51ba0a8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -250,6 +250,12 @@
 			Use timer override. For some broken Nvidia NF5 boards
 			that require a timer override, but don't have HPET
 
+	adb_pmu_led_disk [PPC]
+			Use front LED as disk LED by default. Only applies to
+			PowerBook, iBook, PowerMac 7,2/7,3.
+			Format: <bool>  (1/Y/y=enable, 0/N/n=disable)
+			Default: disabled
+
 	add_efi_memmap	[EFI; X86] Include EFI memory map in
 			kernel's map of available physical RAM.
 
diff --git a/drivers/macintosh/Kconfig b/drivers/macintosh/Kconfig
index 5cdc361da37c..243215de563c 100644
--- a/drivers/macintosh/Kconfig
+++ b/drivers/macintosh/Kconfig
@@ -78,16 +78,6 @@ config ADB_PMU_LED
 	  behaviour of the old CONFIG_BLK_DEV_IDE_PMAC_BLINK, select this
 	  and the disk LED trigger and configure appropriately through sysfs.
 
-config ADB_PMU_LED_DISK
-	bool "Use front LED as DISK LED by default"
-	depends on ADB_PMU_LED
-	depends on LEDS_CLASS
-	select LEDS_TRIGGERS
-	select LEDS_TRIGGER_DISK
-	help
-	  This option makes the front LED default to the disk trigger
-	  so that it blinks on disk activity.
-
 config PMAC_SMU
 	bool "Support for SMU  based PowerMacs"
 	depends on PPC_PMAC64
diff --git a/drivers/macintosh/via-pmu-led.c b/drivers/macintosh/via-pmu-led.c
index ae067ab2373d..838dcf98f82e 100644
--- a/drivers/macintosh/via-pmu-led.c
+++ b/drivers/macintosh/via-pmu-led.c
@@ -25,6 +25,7 @@
 #include <linux/leds.h>
 #include <linux/adb.h>
 #include <linux/pmu.h>
+#include <linux/moduleparam.h>
 #include <asm/prom.h>
 
 static spinlock_t pmu_blink_lock;
@@ -71,11 +72,10 @@ static void pmu_led_set(struct led_classdev *led_cdev,
  	spin_unlock_irqrestore(&pmu_blink_lock, flags);
 }
 
+int adb_pmu_led_disk;
+
 static struct led_classdev pmu_led = {
 	.name = "pmu-led::front",
-#ifdef CONFIG_ADB_PMU_LED_DISK
-	.default_trigger = "disk-activity",
-#endif
 	.brightness_set = pmu_led_set,
 };
 
@@ -106,6 +106,9 @@ static int __init via_pmu_led_init(void)
 	}
 	of_node_put(dt);
 
+	if (adb_pmu_led_disk)
+		pmu_led.default_trigger = "disk-activity";
+
 	spin_lock_init(&pmu_blink_lock);
 	/* no outstanding req */
 	pmu_blink_req.complete = 1;
@@ -114,4 +117,6 @@ static int __init via_pmu_led_init(void)
 	return led_classdev_register(NULL, &pmu_led);
 }
 
+core_param(adb_pmu_led_disk, adb_pmu_led_disk, bool, 0644);
+
 late_initcall(via_pmu_led_init);
-- 
2.33.1


^ permalink raw reply related

* [Bug 206669] Little-endian kernel crashing on POWER8 on heavy big-endian PowerKVM load
From: bugzilla-daemon @ 2021-10-25  8:45 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <bug-206669-206035@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=206669

--- Comment #18 from John Paul Adrian Glaubitz (glaubitz@physik.fu-berlin.de) ---
There seems to be a related discussion:

> https://yhbt.net/lore/all/20200831091523.GC29521@kitsune.suse.cz/T/

This suspects 10d91611f426d4bafd2a83d966c36da811b2f7ad to be the cause:

>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10d91611f426d4bafd2a83d966c36da811b2f7ad

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Peter Zijlstra @ 2021-10-25  9:57 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ia64, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Waiman Long, Heiko Carstens, Vasily Gorbik, Boqun Feng,
	Stefan Kristiansson, Openrisc, Stafford Horne, Parisc List,
	Linux Kernel Mailing List, linuxppc-dev
In-Reply-To: <CAK8P3a0YjaRS+aUCOKGjsfkR3TM49PrG6U4ftG_Fz+OFuyCb0w@mail.gmail.com>

On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> On Sat, Oct 23, 2021 at 3:37 AM Waiman Long <longman@redhat.com> wrote:
> >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > From: Arnd Bergmann <arnd@arndb.de>
> > >
> > > As this is all dead code, just remove it and the helper functions built
> > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > it seems safer to leave it untouched.
> > >
> > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> >
> > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > from the Kconfig files as well?
> 
>  I couldn't figure this out.
> 
> What I see is that the only architectures setting GENERIC_LOCKBREAK are
> nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> implementing arch_spin_is_contended() are arm32, csky and ia64.
> 
> The part I don't understand is whether the option actually does anything
> useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> field when CONFIG_GENERIC_LOCKBREAK=y").

Urgh, what a mess.. AFAICT there's still code in
kernel/locking/spinlock.c that relies on it. Specifically when
GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
basically TaS locks which drop preempt/irq disable while spinning.

Anybody having this on and not having native TaS locks is in for a rude
surprise I suppose... sparc64 being the obvious candidate there :/




^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Peter Zijlstra @ 2021-10-25 10:06 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ia64, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Waiman Long, Heiko Carstens, Vasily Gorbik, Boqun Feng,
	Stefan Kristiansson, Openrisc, Stafford Horne, Parisc List,
	Linux Kernel Mailing List, linuxppc-dev
In-Reply-To: <YXZ/iLB7BvZtzDMp@hirez.programming.kicks-ass.net>

On Mon, Oct 25, 2021 at 11:57:28AM +0200, Peter Zijlstra wrote:
> On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> > On Sat, Oct 23, 2021 at 3:37 AM Waiman Long <longman@redhat.com> wrote:
> > >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > > From: Arnd Bergmann <arnd@arndb.de>
> > > >
> > > > As this is all dead code, just remove it and the helper functions built
> > > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > > it seems safer to leave it untouched.
> > > >
> > > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > > from the Kconfig files as well?
> > 
> >  I couldn't figure this out.
> > 
> > What I see is that the only architectures setting GENERIC_LOCKBREAK are
> > nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> > implementing arch_spin_is_contended() are arm32, csky and ia64.
> > 
> > The part I don't understand is whether the option actually does anything
> > useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> > field when CONFIG_GENERIC_LOCKBREAK=y").
> 
> Urgh, what a mess.. AFAICT there's still code in
> kernel/locking/spinlock.c that relies on it. Specifically when
> GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
> basically TaS locks which drop preempt/irq disable while spinning.
> 
> Anybody having this on and not having native TaS locks is in for a rude
> surprise I suppose... sparc64 being the obvious candidate there :/

Something like the *totally*untested* patch below would rip it all out.

---
 arch/ia64/Kconfig                |  3 --
 arch/nds32/Kconfig               |  4 --
 arch/parisc/Kconfig              |  5 ---
 arch/powerpc/Kconfig             |  5 ---
 arch/s390/Kconfig                |  3 --
 arch/sh/Kconfig                  |  4 --
 arch/sparc/Kconfig               |  6 ---
 include/linux/rwlock_api_smp.h   |  4 +-
 include/linux/spinlock_api_smp.h |  4 +-
 kernel/Kconfig.locks             | 26 ++++++------
 kernel/locking/spinlock.c        | 90 ----------------------------------------
 11 files changed, 17 insertions(+), 137 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 1e33666fa679..5ec3abba3c81 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -81,9 +81,6 @@ config MMU
 config STACKTRACE_SUPPORT
 	def_bool y
 
-config GENERIC_LOCKBREAK
-	def_bool n
-
 config GENERIC_CALIBRATE_DELAY
 	bool
 	default y
diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig
index aea26e739543..699008dbd6c2 100644
--- a/arch/nds32/Kconfig
+++ b/arch/nds32/Kconfig
@@ -59,10 +59,6 @@ config GENERIC_CSUM
 config GENERIC_HWEIGHT
 	def_bool y
 
-config GENERIC_LOCKBREAK
-	def_bool y
-	depends on PREEMPTION
-
 config STACKTRACE_SUPPORT
 	def_bool y
 
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 27a8b49af11f..afe70bcdde2c 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -86,11 +86,6 @@ config ARCH_DEFCONFIG
 	default "arch/parisc/configs/generic-32bit_defconfig" if !64BIT
 	default "arch/parisc/configs/generic-64bit_defconfig" if 64BIT
 
-config GENERIC_LOCKBREAK
-	bool
-	default y
-	depends on SMP && PREEMPTION
-
 config ARCH_HAS_ILOG2_U32
 	bool
 	default n
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..e782c9ea3f81 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -98,11 +98,6 @@ config LOCKDEP_SUPPORT
 	bool
 	default y
 
-config GENERIC_LOCKBREAK
-	bool
-	default y
-	depends on SMP && PREEMPTION
-
 config GENERIC_HWEIGHT
 	bool
 	default y
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index b86de61b8caa..e4ff05f5393b 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -26,9 +26,6 @@ config GENERIC_BUG
 config GENERIC_BUG_RELATIVE_POINTERS
 	def_bool y
 
-config GENERIC_LOCKBREAK
-	def_bool y if PREEMPTION
-
 config PGSTE
 	def_bool y if KVM
 
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 6904f4bdbf00..26f1cf2c69a3 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -86,10 +86,6 @@ config GENERIC_HWEIGHT
 config GENERIC_CALIBRATE_DELAY
 	bool
 
-config GENERIC_LOCKBREAK
-	def_bool y
-	depends on SMP && PREEMPTION
-
 config ARCH_SUSPEND_POSSIBLE
 	def_bool n
 
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index b120ed947f50..e77e7254eaa0 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -246,12 +246,6 @@ config US3_MC
 
 	  If in doubt, say Y, as this information can be very useful.
 
-# Global things across all Sun machines.
-config GENERIC_LOCKBREAK
-	bool
-	default y
-	depends on SPARC64 && SMP && PREEMPTION
-
 config NUMA
 	bool "NUMA support"
 	depends on SPARC64 && SMP
diff --git a/include/linux/rwlock_api_smp.h b/include/linux/rwlock_api_smp.h
index abfb53ab11be..a281d81ef8ee 100644
--- a/include/linux/rwlock_api_smp.h
+++ b/include/linux/rwlock_api_smp.h
@@ -141,7 +141,7 @@ static inline int __raw_write_trylock(rwlock_t *lock)
  * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if defined(CONFIG_DEBUG_LOCK_ALLOC)
 
 static inline void __raw_read_lock(rwlock_t *lock)
 {
@@ -211,7 +211,7 @@ static inline void __raw_write_lock(rwlock_t *lock)
 	LOCK_CONTENDED(lock, do_raw_write_trylock, do_raw_write_lock);
 }
 
-#endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */
+#endif /* CONFIG_DEBUG_LOCK_ALLOC */
 
 static inline void __raw_write_unlock(rwlock_t *lock)
 {
diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h
index 6b8e1a0b137b..fb437243eb2e 100644
--- a/include/linux/spinlock_api_smp.h
+++ b/include/linux/spinlock_api_smp.h
@@ -99,7 +99,7 @@ static inline int __raw_spin_trylock(raw_spinlock_t *lock)
  * even on CONFIG_PREEMPTION, because lockdep assumes that interrupts are
  * not re-enabled during lock-acquire (which the preempt-spin-ops do):
  */
-#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
+#if defined(CONFIG_DEBUG_LOCK_ALLOC)
 
 static inline unsigned long __raw_spin_lock_irqsave(raw_spinlock_t *lock)
 {
@@ -143,7 +143,7 @@ static inline void __raw_spin_lock(raw_spinlock_t *lock)
 	LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
 }
 
-#endif /* !CONFIG_GENERIC_LOCKBREAK || CONFIG_DEBUG_LOCK_ALLOC */
+#endif /* CONFIG_DEBUG_LOCK_ALLOC */
 
 static inline void __raw_spin_unlock(raw_spinlock_t *lock)
 {
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273ecd..8e0b501189e8 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -93,7 +93,7 @@ config UNINLINE_SPIN_UNLOCK
 
 #
 # lock_* functions are inlined when:
-#   - DEBUG_SPINLOCK=n and GENERIC_LOCKBREAK=n and ARCH_INLINE_*LOCK=y
+#   - DEBUG_SPINLOCK=n and ARCH_INLINE_*LOCK=y
 #
 # trylock_* functions are inlined when:
 #   - DEBUG_SPINLOCK=n and ARCH_INLINE_*LOCK=y
@@ -119,19 +119,19 @@ config INLINE_SPIN_TRYLOCK_BH
 
 config INLINE_SPIN_LOCK
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_SPIN_LOCK
+	depends on ARCH_INLINE_SPIN_LOCK
 
 config INLINE_SPIN_LOCK_BH
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_SPIN_LOCK_BH
+	depends on ARCH_INLINE_SPIN_LOCK_BH
 
 config INLINE_SPIN_LOCK_IRQ
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_SPIN_LOCK_IRQ
+	depends on ARCH_INLINE_SPIN_LOCK_IRQ
 
 config INLINE_SPIN_LOCK_IRQSAVE
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_SPIN_LOCK_IRQSAVE
+	depends on ARCH_INLINE_SPIN_LOCK_IRQSAVE
 
 config INLINE_SPIN_UNLOCK_BH
 	def_bool y
@@ -152,19 +152,19 @@ config INLINE_READ_TRYLOCK
 
 config INLINE_READ_LOCK
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_READ_LOCK
+	depends on ARCH_INLINE_READ_LOCK
 
 config INLINE_READ_LOCK_BH
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_READ_LOCK_BH
+	depends on ARCH_INLINE_READ_LOCK_BH
 
 config INLINE_READ_LOCK_IRQ
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_READ_LOCK_IRQ
+	depends on ARCH_INLINE_READ_LOCK_IRQ
 
 config INLINE_READ_LOCK_IRQSAVE
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_READ_LOCK_IRQSAVE
+	depends on ARCH_INLINE_READ_LOCK_IRQSAVE
 
 config INLINE_READ_UNLOCK
 	def_bool y
@@ -189,19 +189,19 @@ config INLINE_WRITE_TRYLOCK
 
 config INLINE_WRITE_LOCK
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_WRITE_LOCK
+	depends on ARCH_INLINE_WRITE_LOCK
 
 config INLINE_WRITE_LOCK_BH
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_WRITE_LOCK_BH
+	depends on ARCH_INLINE_WRITE_LOCK_BH
 
 config INLINE_WRITE_LOCK_IRQ
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_WRITE_LOCK_IRQ
+	depends on ARCH_INLINE_WRITE_LOCK_IRQ
 
 config INLINE_WRITE_LOCK_IRQSAVE
 	def_bool y
-	depends on !GENERIC_LOCKBREAK && ARCH_INLINE_WRITE_LOCK_IRQSAVE
+	depends on ARCH_INLINE_WRITE_LOCK_IRQSAVE
 
 config INLINE_WRITE_UNLOCK
 	def_bool y
diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c
index c5830cfa379a..d94ee95585fc 100644
--- a/kernel/locking/spinlock.c
+++ b/kernel/locking/spinlock.c
@@ -29,19 +29,6 @@ EXPORT_PER_CPU_SYMBOL(__mmiowb_state);
 #endif
 #endif
 
-/*
- * If lockdep is enabled then we use the non-preemption spin-ops
- * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
- * not re-enabled during lock-acquire (which the preempt-spin-ops do):
- */
-#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
-/*
- * The __lock_function inlines are taken from
- * spinlock : include/linux/spinlock_api_smp.h
- * rwlock   : include/linux/rwlock_api_smp.h
- */
-#else
-
 /*
  * Some architectures can relax in favour of the CPU owning the lock.
  */
@@ -55,83 +42,6 @@ EXPORT_PER_CPU_SYMBOL(__mmiowb_state);
 # define arch_spin_relax(l)	cpu_relax()
 #endif
 
-/*
- * We build the __lock_function inlines here. They are too large for
- * inlining all over the place, but here is only one user per function
- * which embeds them into the calling _lock_function below.
- *
- * This could be a long-held lock. We both prepare to spin for a long
- * time (making _this_ CPU preemptible if possible), and we also signal
- * towards that other CPU that it should break the lock ASAP.
- */
-#define BUILD_LOCK_OPS(op, locktype)					\
-void __lockfunc __raw_##op##_lock(locktype##_t *lock)			\
-{									\
-	for (;;) {							\
-		preempt_disable();					\
-		if (likely(do_raw_##op##_trylock(lock)))		\
-			break;						\
-		preempt_enable();					\
-									\
-		arch_##op##_relax(&lock->raw_lock);			\
-	}								\
-}									\
-									\
-unsigned long __lockfunc __raw_##op##_lock_irqsave(locktype##_t *lock)	\
-{									\
-	unsigned long flags;						\
-									\
-	for (;;) {							\
-		preempt_disable();					\
-		local_irq_save(flags);					\
-		if (likely(do_raw_##op##_trylock(lock)))		\
-			break;						\
-		local_irq_restore(flags);				\
-		preempt_enable();					\
-									\
-		arch_##op##_relax(&lock->raw_lock);			\
-	}								\
-									\
-	return flags;							\
-}									\
-									\
-void __lockfunc __raw_##op##_lock_irq(locktype##_t *lock)		\
-{									\
-	_raw_##op##_lock_irqsave(lock);					\
-}									\
-									\
-void __lockfunc __raw_##op##_lock_bh(locktype##_t *lock)		\
-{									\
-	unsigned long flags;						\
-									\
-	/*							*/	\
-	/* Careful: we must exclude softirqs too, hence the	*/	\
-	/* irq-disabling. We use the generic preemption-aware	*/	\
-	/* function:						*/	\
-	/**/								\
-	flags = _raw_##op##_lock_irqsave(lock);				\
-	local_bh_disable();						\
-	local_irq_restore(flags);					\
-}									\
-
-/*
- * Build preemption-friendly versions of the following
- * lock-spinning functions:
- *
- *         __[spin|read|write]_lock()
- *         __[spin|read|write]_lock_irq()
- *         __[spin|read|write]_lock_irqsave()
- *         __[spin|read|write]_lock_bh()
- */
-BUILD_LOCK_OPS(spin, raw_spinlock);
-
-#ifndef CONFIG_PREEMPT_RT
-BUILD_LOCK_OPS(read, rwlock);
-BUILD_LOCK_OPS(write, rwlock);
-#endif
-
-#endif
-
 #ifndef CONFIG_INLINE_SPIN_TRYLOCK
 int __lockfunc _raw_spin_trylock(raw_spinlock_t *lock)
 {

^ permalink raw reply related

* [PATCH] selftests/powerpc: Use date instead of EPOCHSECONDS in mitigation-patching.sh
From: Russell Currey @ 2021-10-25 10:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Russell Currey

The EPOCHSECONDS environment variable was added in bash 5.0 (released
2019).  Some distributions of the "stable" and "long-term" variety ship
older versions of bash than this, so swap to using the date command
instead.

"%s" was added to coreutils `date` in 1993 so we should be good, but who
knows, it is a GNU extension and not part of the POSIX spec for `date`.

Signed-off-by: Russell Currey <ruscur@russell.cc>
---
 .../testing/selftests/powerpc/security/mitigation-patching.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/powerpc/security/mitigation-patching.sh b/tools/testing/selftests/powerpc/security/mitigation-patching.sh
index 00197acb7ff1..b0b20e0b4e30 100755
--- a/tools/testing/selftests/powerpc/security/mitigation-patching.sh
+++ b/tools/testing/selftests/powerpc/security/mitigation-patching.sh
@@ -13,7 +13,7 @@ function do_one
 
     orig=$(cat "$mitigation")
 
-    start=$EPOCHSECONDS
+    start=$(date +%s)
     now=$start
 
     while [[ $((now-start)) -lt "$TIMEOUT" ]]
@@ -21,7 +21,7 @@ function do_one
         echo 0 > "$mitigation"
         echo 1 > "$mitigation"
 
-        now=$EPOCHSECONDS
+        now=$(date +%s)
     done
 
     echo "$orig" > "$mitigation"
-- 
2.33.1


^ permalink raw reply related

* Linux kernel: powerpc: KVM guest can trigger host crash on Power8
From: Michael Ellerman @ 2021-10-25 11:18 UTC (permalink / raw)
  To: oss-security; +Cc: linuxppc-dev

The Linux kernel for powerpc since v5.2 has a bug which allows a
malicious KVM guest to crash the host, when the host is running on
Power8.

Only machines using Linux as the hypervisor, aka. KVM, powernv or bare
metal, are affected by the bug. Machines running PowerVM are not
affected.

The bug was introduced in:

    10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

Which was first released in v5.2.

The upstream fix is:

  cdeb5d7d890e ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337

Which will be included in the v5.16 release.

Note to backporters, the following commits are required:

  73287caa9210ded6066833195f4335f7f688a46b
  ("powerpc64/idle: Fix SP offsets when saving GPRs")

  9b4416c5095c20e110c82ae602c254099b83b72f
  ("KVM: PPC: Book3S HV: Fix stack handling in idle_kvm_start_guest()")

  cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337
  ("KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest")

  496c5fe25c377ddb7815c4ce8ecfb676f051e9b6
  ("powerpc/idle: Don't corrupt back chain when going idle")

I have a test case to trigger the bug, which I can share privately with
anyone who would like to test the fix.

cheers

^ permalink raw reply

* Re: [PATCH v2] perf vendor events power10: Add metric events json file for power10 platform
From: Paul A. Clarke @ 2021-10-25 12:06 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: maddy, rnsastry, Kajol Jain, linuxppc-dev, linux-kernel, acme,
	linux-perf-users, atrajeev, jolsa
In-Reply-To: <87wnm1bxek.fsf@mpe.ellerman.id.au>

On Mon, Oct 25, 2021 at 02:23:15PM +1100, Michael Ellerman wrote:
> "Paul A. Clarke" <pc@us.ibm.com> writes:
> > Thanks for the changes!
> > More nits below (many left over from prior review)...
> >
> > On Fri, Oct 22, 2021 at 11:55:05AM +0530, Kajol Jain wrote:
> >> Add pmu metric json file for power10 platform.
> >> 
> >> Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
> >> ---
> >> Changelog v1 -> v2:
> >> - Did some nit changes in BriefDescription field
> >>   as suggested by Paul A. Clarke
> >> 
> >> - Link to the v1 patch: https://lkml.org/lkml/2021/10/6/131
> >> 
> >>  .../arch/powerpc/power10/metrics.json         | 676 ++++++++++++++++++
> >>  1 file changed, 676 insertions(+)
> >>  create mode 100644 tools/perf/pmu-events/arch/powerpc/power10/metrics.json
> >> 
> >> diff --git a/tools/perf/pmu-events/arch/powerpc/power10/metrics.json b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
> >> new file mode 100644
> >> index 000000000000..8adab5cd9934
> >> --- /dev/null
> >> +++ b/tools/perf/pmu-events/arch/powerpc/power10/metrics.json
> >> @@ -0,0 +1,676 @@
> >> +[
> >> +    {
> >> +        "BriefDescription": "Percentage of cycles that are run cycles",
> >> +        "MetricExpr": "PM_RUN_CYC / PM_CYC * 100",
> >> +        "MetricGroup": "General",
> >> +        "MetricName": "RUN_CYCLES_RATE",
> >> +        "ScaleUnit": "1%"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per completed instruction",
> >> +        "MetricExpr": "PM_CYC / PM_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "CYCLES_PER_INSTRUCTION"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
> >> +        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because there was a flush",
> >> +        "MetricExpr": "PM_DISP_STALL_FLUSH / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_FLUSH_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because the MMU was handling a translation miss",
> >> +        "MetricExpr": "PM_DISP_STALL_TRANSLATION / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_TRANSLATION_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction ERAT miss",
> >> +        "MetricExpr": "PM_DISP_STALL_IERAT_ONLY_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_IERAT_ONLY_MISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled waiting to resolve an instruction TLB miss",
> >> +        "MetricExpr": "PM_DISP_STALL_ITLB_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_ITLB_MISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss",
> >> +        "MetricExpr": "PM_DISP_STALL_IC_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_IC_MISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L2",
> >> +        "MetricExpr": "PM_DISP_STALL_IC_L2 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_IC_L2_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from the local L3",
> >> +        "MetricExpr": "PM_DISP_STALL_IC_L3 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_IC_L3_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while the instruction was fetched from any source beyond the local L3",
> >> +        "MetricExpr": "PM_DISP_STALL_IC_L3MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_IC_L3MISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to an icache miss after a branch mispredict",
> >> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_ICMISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_BR_MPRED_ICMISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L2 after suffering a branch mispredict",
> >> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L2 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L2_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from the local L3 after suffering a branch mispredict",
> >> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled while instruction was fetched from any source beyond the local L3 after suffering a branch mispredict",
> >> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED_IC_L3MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_BR_MPRED_IC_L3MISS_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled due to a branch mispredict",
> >> +        "MetricExpr": "PM_DISP_STALL_BR_MPRED / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_BR_MPRED_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any reason",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of a synchronizing instruction that requires the ICT to be empty before dispatch",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_SYNC_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISP_HELD_STALL_SYNC_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch while waiting on the scoreboard",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_SCOREBOARD_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISP_HELD_STALL_SCOREBOARD_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch due to issue queue full",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_ISSQ_FULL_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISP_HELD_STALL_ISSQ_FULL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the mapper/SRB was full",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_RENAME_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_RENAME_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the STF mapper/SRB was full",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_STF_MAPPER_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_STF_MAPPER_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because the XVFC mapper/SRB was full",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_XVFC_MAPPER_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_XVFC_MAPPER_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch for any other reason",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_OTHER_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_OTHER_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction has been dispatched but not issued for any reason",
> >> +        "MetricExpr": "PM_ISSUE_STALL / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "ISSUE_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
> >> +        "MetricExpr": "PM_EXEC_STALL / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "EXECUTION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction spent executing an NTC instruction that gets flushed some time after dispatch",
> >> +        "MetricExpr": "PM_EXEC_STALL_NTC_FLUSH / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "NTC_FLUSH_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTF instruction finishes at dispatch",
> >> +        "MetricExpr": "PM_EXEC_STALL_FIN_AT_DISP / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "FIN_AT_DISP_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the branch unit",
> >> +        "MetricExpr": "PM_EXEC_STALL_BRU / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "BRU_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a simple fixed point instruction that is executing in the LSU",
> >> +        "MetricExpr": "PM_EXEC_STALL_SIMPLE_FX / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "SIMPLE_FX_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the VSU",
> >> +        "MetricExpr": "PM_EXEC_STALL_VSU / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "VSU_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting to be finished in one of the execution units",
> >> +        "MetricExpr": "PM_EXEC_STALL_TRANSLATION / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "TRANSLATION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load or store that suffered a translation miss",
> >> +        "MetricExpr": "PM_EXEC_STALL_DERAT_ONLY_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DERAT_ONLY_MISS_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is recovering from a TLB miss",
> >> +        "MetricExpr": "PM_EXEC_STALL_DERAT_DTLB_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DERAT_DTLB_MISS_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing in the LSU",
> >> +        "MetricExpr": "PM_EXEC_STALL_LSU / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "LSU_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a load that is executing in the LSU",
> >> +        "MetricExpr": "PM_EXEC_STALL_LOAD / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "LOAD_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_L2L3_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, with an RC dispatch conflict",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_CONFLICT / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_L2L3_CONFLICT_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from either the local L2 or local L3, without an RC dispatch conflict",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L2L3_NOCONFLICT / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_L2L3_NOCONFLICT_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a source beyond the local L2 and local L3",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L3MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_L3MISS_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a neighbor chiplet's L2 or L3 in the same chip",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_L21_L31 / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_L21_L31_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from L4, local memory or OpenCapp chip",
> >
> > What is "OpenCapp"?  Is is different from OpenCAPI?
> >
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_LMEM / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_LMEM_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a remote chip (cache, L4, memory or OpenCapp) in the same group",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_CHIP / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_OFF_CHIP_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is waiting for a load miss to resolve from a distant chip (cache, L4, memory or OpenCapp chip)",
> >> +        "MetricExpr": "PM_EXEC_STALL_DMISS_OFF_NODE / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DMISS_OFF_NODE_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a TLBIEL instruction",
> >> +        "MetricExpr": "PM_EXEC_STALL_TLBIEL / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "TLBIEL_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is finishing a load after its data has been reloaded from a data source beyond the local L1, OR when the LSU is processing an L1-hit, OR when the NTF instruction merged with another load in the LMQ",
> >> +        "MetricExpr": "PM_EXEC_STALL_LOAD_FINISH / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "LOAD_FINISH_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store that is executing in the LSU",
> >> +        "MetricExpr": "PM_EXEC_STALL_STORE / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "STORE_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is in the store unit outside of handling store misses or other special store operations",
> >
> > Is "store unit" not the same as "LSU" ?  Use "LSU" uniformly if appropriate:
> > s/store unit/LSU/
> >
> >> +        "MetricExpr": "PM_EXEC_STALL_STORE_PIPE / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "STORE_PIPE_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a store whose cache line was not resident in the L1 and had to wait for allocation of the missing line into the L1",
> >> +        "MetricExpr": "PM_EXEC_STALL_STORE_MISS / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "STORE_MISS_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a TLBIE instruction waiting for a response from the L2",
> >> +        "MetricExpr": "PM_EXEC_STALL_TLBIE / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "TLBIE_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is executing a PTESYNC instruction",
> >> +        "MetricExpr": "PM_EXEC_STALL_PTESYNC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "PTESYNC_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because the thread was blocked",
> >> +        "MetricExpr": "PM_CMPL_STALL / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete because it was interrupted by ANY exception",
> >> +        "MetricExpr": "PM_CMPL_STALL_EXCEPTION / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "EXCEPTION_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is stuck at finish waiting for the non-speculative finish of either a STCX instruction waiting for its result or a load waiting for non-critical sectors of data and ECC",
> >> +        "MetricExpr": "PM_CMPL_STALL_MEM_ECC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "MEM_ECC_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction cannot complete the instruction is a STCX instruction waiting for resolution from the nest",
> >
> > Need to reword this, I think.  I propose "Average cycles per instruction
> > when the NTC instruction is a STCX instruction waiting for resolution
> > from the nest", which follows the form used by HWSYNC_COMPLETION_STALL_CPI,
> > below.
> >
> >> +        "MetricExpr": "PM_CMPL_STALL_STCX / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "STCX_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a LWSYNC instruction waiting to complete",
> >> +        "MetricExpr": "PM_CMPL_STALL_LWSYNC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "LWSYNC_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction is a HWSYNC instruction stuck at finish waiting for a response from the L2",
> >> +        "MetricExpr": "PM_CMPL_STALL_HWSYNC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "HWSYNC_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction required special handling before completion",
> >> +        "MetricExpr": "PM_CMPL_STALL_SPECIAL / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "SPECIAL_COMPLETION_STALL_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when dispatch was stalled because fetch was being held, so there was nothing in the pipeline for this thread",
> >> +        "MetricExpr": "PM_DISP_STALL_FETCH / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_FETCH_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average cycles per instruction when the NTC instruction was held at dispatch because of power management",
> >> +        "MetricExpr": "PM_DISP_STALL_HELD_HALT_CYC / PM_RUN_INST_CMPL",
> >> +        "MetricGroup": "CPI",
> >> +        "MetricName": "DISPATCHED_HELD_HALT_CPI"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Percentage of flushes per completed run instruction",
> >
> > s/per completed run instruction/per instruction/
> 
> I'm not sure we want to drop "completed" from this and all the following
> descriptions.
> 
> There is a meaningful distinction between completed and dispatched
> instructions, I think it's useful to be explicit about which the event
> is counting.
> 
> I agree dropping "run" is a good idea, most people won't understand that
> "run" means "non-idle", and I think don't expect idle instructions to be
> counted anyway.
> 
> ...
> >
> >> +        "MetricExpr": "PM_RUN_INST_CMPL / PM_RUN_CYC",
> >> +        "MetricGroup": "General",
> >> +        "MetricName": "RUN_IPC"
> >> +    },
> >> +    {
> >> +        "BriefDescription": "Average number of instructions completed per instruction group",
> >
> > s/completed//
> 
> And here the meaning is different if you drop "completed".

All fair comments.  I am looking for consistency, but correctness trumps.

Regarding consistency, though, there are lots of occurences like:
|        "BriefDescription": "Average cycles per instruction when dispatch was stalled for any reason",
|        "MetricExpr": "PM_DISP_STALL_CYC / PM_RUN_INST_CMPL",

Can we pick one phrase for all metrics where PM_RUN_INST_CMPL is used,
perhaps?  "completed instructions" ?

PC

^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Arnd Bergmann @ 2021-10-25 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-ia64, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Waiman Long, Heiko Carstens, Vasily Gorbik, Boqun Feng,
	Stefan Kristiansson, Openrisc, Stafford Horne, Parisc List,
	Linux Kernel Mailing List, linuxppc-dev
In-Reply-To: <YXZ/iLB7BvZtzDMp@hirez.programming.kicks-ass.net>

On Mon, Oct 25, 2021 at 11:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> > On Sat, Oct 23, 2021 at 3:37 AM Waiman Long <longman@redhat.com> wrote:
> > >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > > From: Arnd Bergmann <arnd@arndb.de>
> > > >
> > > > As this is all dead code, just remove it and the helper functions built
> > > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > > it seems safer to leave it untouched.
> > > >
> > > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > > from the Kconfig files as well?
> >
> >  I couldn't figure this out.
> >
> > What I see is that the only architectures setting GENERIC_LOCKBREAK are
> > nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> > implementing arch_spin_is_contended() are arm32, csky and ia64.
> >
> > The part I don't understand is whether the option actually does anything
> > useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> > field when CONFIG_GENERIC_LOCKBREAK=y").
>
> Urgh, what a mess.. AFAICT there's still code in
> kernel/locking/spinlock.c that relies on it. Specifically when
> GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
> basically TaS locks which drop preempt/irq disable while spinning.
>
> Anybody having this on and not having native TaS locks is in for a rude
> surprise I suppose... sparc64 being the obvious candidate there :/

Is this a problem on s390 and powerpc, those two being the ones
that matter in practice?

On s390, we pick between the cmpxchg() based directed-yield when
running on virtualized CPUs, and a normal qspinlock when running on a
dedicated CPU.

On PowerPC, we pick at compile-time between either the qspinlock
(default-enabled on Book3S-64, i.e. all server chips) or a ll/sc based
spinlock plus vm_yield() (default on embedded and 32-bit mac).

       Arnd

^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Peter Zijlstra @ 2021-10-25 14:33 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ia64, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Waiman Long, Heiko Carstens, Vasily Gorbik, Boqun Feng,
	Stefan Kristiansson, Openrisc, Stafford Horne, Parisc List,
	Linux Kernel Mailing List, linuxppc-dev
In-Reply-To: <CAK8P3a2Luz7sd5cM1OdZhYCs_UPzo+2qVQYSZPfR2QN+0DkyRg@mail.gmail.com>

On Mon, Oct 25, 2021 at 03:06:24PM +0200, Arnd Bergmann wrote:
> On Mon, Oct 25, 2021 at 11:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
> > > On Sat, Oct 23, 2021 at 3:37 AM Waiman Long <longman@redhat.com> wrote:
> > > >> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
> > > > > From: Arnd Bergmann <arnd@arndb.de>
> > > > >
> > > > > As this is all dead code, just remove it and the helper functions built
> > > > > around it. For arch/ia64, the inline asm could be cleaned up, but
> > > > > it seems safer to leave it untouched.
> > > > >
> > > > > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > > >
> > > > Does that mean we can also remove the GENERIC_LOCKBREAK config option
> > > > from the Kconfig files as well?
> > >
> > >  I couldn't figure this out.
> > >
> > > What I see is that the only architectures setting GENERIC_LOCKBREAK are
> > > nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
> > > implementing arch_spin_is_contended() are arm32, csky and ia64.
> > >
> > > The part I don't understand is whether the option actually does anything
> > > useful any more after commit d89c70356acf ("locking/core: Remove break_lock
> > > field when CONFIG_GENERIC_LOCKBREAK=y").
> >
> > Urgh, what a mess.. AFAICT there's still code in
> > kernel/locking/spinlock.c that relies on it. Specifically when
> > GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
> > basically TaS locks which drop preempt/irq disable while spinning.
> >
> > Anybody having this on and not having native TaS locks is in for a rude
> > surprise I suppose... sparc64 being the obvious candidate there :/
> 
> Is this a problem on s390 and powerpc, those two being the ones
> that matter in practice?
> 
> On s390, we pick between the cmpxchg() based directed-yield when
> running on virtualized CPUs, and a normal qspinlock when running on a
> dedicated CPU.
> 
> On PowerPC, we pick at compile-time between either the qspinlock
> (default-enabled on Book3S-64, i.e. all server chips) or a ll/sc based
> spinlock plus vm_yield() (default on embedded and 32-bit mac).

Urgh, yeah, so this crud undermines the whole point of having a fair
lock. I'm thinking s390 and Power want to have this fixed.

^ permalink raw reply

* Re: [PATCH] powerpc: Enhance pmem DMA bypass handling
From: Brian King @ 2021-10-25 14:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy, linuxppc-dev
In-Reply-To: <3d1dcfa3-23a7-cb7f-8671-2198861987e6@ozlabs.ru>

On 10/23/21 7:18 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 23/10/2021 07:18, Brian King wrote:
>> On 10/22/21 7:24 AM, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 22/10/2021 04:44, Brian King wrote:
>>>> If ibm,pmemory is installed in the system, it can appear anywhere
>>>> in the address space. This patch enhances how we handle DMA for devices when
>>>> ibm,pmemory is present. In the case where we have enough DMA space to
>>>> direct map all of RAM, but not ibm,pmemory, we use direct DMA for
>>>> I/O to RAM and use the default window to dynamically map ibm,pmemory.
>>>> In the case where we only have a single DMA window, this won't work, > so if the window is not big enough to map the entire address range,
>>>> we cannot direct map.
>>>
>>> but we want the pmem range to be mapped into the huge DMA window too if we can, why skip it?
>>
>> This patch should simply do what the comment in this commit mentioned below suggests, which says that
>> ibm,pmemory can appear anywhere in the address space. If the DMA window is large enough
>> to map all of MAX_PHYSMEM_BITS, we will indeed simply do direct DMA for everything,
>> including the pmem. If we do not have a big enough window to do that, we will do
>> direct DMA for DRAM and dynamic mapping for pmem.
> 
> 
> Right, and this is what we do already, do not we? I missing something here.

The upstream code does not work correctly that I can see. If I boot an upstream kernel
with an nvme device and vpmem assigned to the LPAR, and enable dev_dbg in arch/powerpc/platforms/pseries/iommu.c,
I see the following in the logs:

[    2.157549] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 500000 8000000 20000121 returned 0
[    2.157561] nvme 0121:50:00.0: Skipping ibm,pmemory
[    2.157567] nvme 0121:50:00.0: can't map partition max 0x8000000000000 with 16777216 65536-sized pages
[    2.170150] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 500000 8000000 20000121 10 28 returned 0 (liobn = 0x70000121 starting addr = 8000000 0)
[    2.170170] nvme 0121:50:00.0: created tce table LIOBN 0x70000121 for /pci@800000020000121/pci1014,683@0
[    2.356260] nvme 0121:50:00.0: node is /pci@800000020000121/pci1014,683@0

This means we are heading down the leg in enable_ddw where we do not set direct_mapping to true. We use
create the DDW window, but don't do any direct DMA. This is because the window is not large enough to
map 2PB of memory, which is what ddw_memory_hotplug_max returns without my patch. 

With my patch applied, I get this in the logs:

[    2.204866] nvme 0121:50:00.0: ibm,query-pe-dma-windows(53) 500000 8000000 20000121 returned 0
[    2.204875] nvme 0121:50:00.0: Skipping ibm,pmemory
[    2.205058] nvme 0121:50:00.0: ibm,create-pe-dma-window(54) 500000 8000000 20000121 10 21 returned 0 (liobn = 0x70000121 starting addr = 8000000 0)
[    2.205068] nvme 0121:50:00.0: created tce table LIOBN 0x70000121 for /pci@800000020000121/pci1014,683@0
[    2.215898] nvme 0121:50:00.0: iommu: 64-bit OK but direct DMA is limited by 800000200000000


Thanks,

Brian


> 
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/powerpc/platforms/pseries/iommu.c?id=bf6e2d562bbc4d115cf322b0bca57fe5bbd26f48
>>
>>
>> Thanks,
>>
>> Brian
>>
>>
>>>
>>>
>>>>
>>>> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
>>>> ---
>>>>    arch/powerpc/platforms/pseries/iommu.c | 19 ++++++++++---------
>>>>    1 file changed, 10 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>>>> index 269f61d519c2..d9ae985d10a4 100644
>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>> @@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>>>>        phys_addr_t max_addr = memory_hotplug_max();
>>>>        struct device_node *memory;
>>>>    -    /*
>>>> -     * The "ibm,pmemory" can appear anywhere in the address space.
>>>> -     * Assuming it is still backed by page structs, set the upper limit
>>>> -     * for the huge DMA window as MAX_PHYSMEM_BITS.
>>>> -     */
>>>> -    if (of_find_node_by_type(NULL, "ibm,pmemory"))
>>>> -        return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
>>>> -            (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
>>>> -
>>>>        for_each_node_by_type(memory, "memory") {
>>>>            unsigned long start, size;
>>>>            int n_mem_addr_cells, n_mem_size_cells, len;
>>>> @@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>>         */
>>>>        len = max_ram_len;
>>>>        if (pmem_present) {
>>>> +        if (default_win_removed) {
>>>> +            /*
>>>> +             * If we only have one DMA window and have pmem present,
>>>> +             * then we need to be able to map the entire address
>>>> +             * range in order to be able to do direct DMA to RAM.
>>>> +             */
>>>> +            len = order_base_2((sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
>>>> +                    (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS));
>>>> +        }
>>>> +
>>>>            if (query.largest_available_block >=
>>>>                (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
>>>>                len = MAX_PHYSMEM_BITS;
>>>>
>>>
>>
>>
> 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center


^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Waiman Long @ 2021-10-25 15:28 UTC (permalink / raw)
  To: Arnd Bergmann, Peter Zijlstra
  Cc: linux-ia64, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Heiko Carstens, Vasily Gorbik, Boqun Feng, Stefan Kristiansson,
	Openrisc, Stafford Horne, Parisc List, Linux Kernel Mailing List,
	linuxppc-dev
In-Reply-To: <CAK8P3a2Luz7sd5cM1OdZhYCs_UPzo+2qVQYSZPfR2QN+0DkyRg@mail.gmail.com>


On 10/25/21 9:06 AM, Arnd Bergmann wrote:
> On Mon, Oct 25, 2021 at 11:57 AM Peter Zijlstra <peterz@infradead.org> wrote:
>> On Sat, Oct 23, 2021 at 06:04:57PM +0200, Arnd Bergmann wrote:
>>> On Sat, Oct 23, 2021 at 3:37 AM Waiman Long <longman@redhat.com> wrote:
>>>>> On 10/22/21 7:59 AM, Arnd Bergmann wrote:
>>>>> From: Arnd Bergmann <arnd@arndb.de>
>>>>>
>>>>> As this is all dead code, just remove it and the helper functions built
>>>>> around it. For arch/ia64, the inline asm could be cleaned up, but
>>>>> it seems safer to leave it untouched.
>>>>>
>>>>> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
>>>> Does that mean we can also remove the GENERIC_LOCKBREAK config option
>>>> from the Kconfig files as well?
>>>   I couldn't figure this out.
>>>
>>> What I see is that the only architectures setting GENERIC_LOCKBREAK are
>>> nds32, parisc, powerpc, s390, sh and sparc64, while the only architectures
>>> implementing arch_spin_is_contended() are arm32, csky and ia64.
>>>
>>> The part I don't understand is whether the option actually does anything
>>> useful any more after commit d89c70356acf ("locking/core: Remove break_lock
>>> field when CONFIG_GENERIC_LOCKBREAK=y").
>> Urgh, what a mess.. AFAICT there's still code in
>> kernel/locking/spinlock.c that relies on it. Specifically when
>> GENERIC_LOCKBREAK=y we seem to create _lock*() variants that are
>> basically TaS locks which drop preempt/irq disable while spinning.
>>
>> Anybody having this on and not having native TaS locks is in for a rude
>> surprise I suppose... sparc64 being the obvious candidate there :/
> Is this a problem on s390 and powerpc, those two being the ones
> that matter in practice?
>
> On s390, we pick between the cmpxchg() based directed-yield when
> running on virtualized CPUs, and a normal qspinlock when running on a
> dedicated CPU.

I am not aware that s390 is using qspinlocks at all as I don't see 
ARCH_USE_QUEUED_SPINLOCKS being set anywhere under arch/s390. I only see 
that it uses a cmpxchg based spinlock.

Cheers,
Longman




^ permalink raw reply

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Arnd Bergmann @ 2021-10-25 15:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: linux-ia64, Peter Zijlstra, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Heiko Carstens, Vasily Gorbik, Boqun Feng, Stefan Kristiansson,
	Openrisc, Stafford Horne, Parisc List, Linux Kernel Mailing List,
	linuxppc-dev
In-Reply-To: <2413f412-a390-bbc0-e848-e2a77d1f0ab3@redhat.com>

On Mon, Oct 25, 2021 at 5:28 PM Waiman Long <longman@redhat.com> wrote:
> On 10/25/21 9:06 AM, Arnd Bergmann wrote:
> >
> > On s390, we pick between the cmpxchg() based directed-yield when
> > running on virtualized CPUs, and a normal qspinlock when running on a
> > dedicated CPU.
>
> I am not aware that s390 is using qspinlocks at all as I don't see
> ARCH_USE_QUEUED_SPINLOCKS being set anywhere under arch/s390. I only see
> that it uses a cmpxchg based spinlock.

Sorry, I should not have said "normal" here. See arch/s390/lib/spinlock.c
for their custom queued spinlocks as implemented in arch_spin_lock_queued().
I don't know if that code actually does the same thing as the generic qspinlock,
but it seems at least similar.

       Arnd

^ permalink raw reply

* Re: [PATCH 00/13] block: add_disk() error handling stragglers
From: Luis Chamberlain @ 2021-10-25 15:58 UTC (permalink / raw)
  To: Geoff Levand
  Cc: nvdimm, vigneshr, linux-nvme, paulus, miquel.raynal, ira.weiny,
	hch, dave.jiang, sagi, minchan, vishal.l.verma, ngupta,
	linux-block, kbusch, dan.j.williams, axboe, linux-kernel, jim,
	senozhatsky, richard, linux-mtd, linuxppc-dev
In-Reply-To: <24bc86d0-9d8d-8c8a-7f74-a87f9089342b@infradead.org>

On Thu, Oct 21, 2021 at 08:10:49PM -0700, Geoff Levand wrote:
> Hi Luis,
> 
> On 10/18/21 9:15 AM, Luis Chamberlain wrote:
> > On Sun, Oct 17, 2021 at 08:26:33AM -0700, Geoff Levand wrote:
> >> Hi Luis,
> >>
> >> On 10/15/21 4:52 PM, Luis Chamberlain wrote:
> >>> This patch set consists of al the straggler drivers for which we have
> >>> have no patch reviews done for yet. I'd like to ask for folks to please
> >>> consider chiming in, specially if you're the maintainer for the driver.
> >>> Additionally if you can specify if you'll take the patch in yourself or
> >>> if you want Jens to take it, that'd be great too.
> >>
> >> Do you have a git repo with the patch set applied that I can use to test with?
> > 
> > Sure, although the second to last patch is in a state of flux given
> > the ataflop driver currently is broken and so we're seeing how to fix
> > that first:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20211011-for-axboe-add-disk-error-handling
> 
> That branch has so many changes applied on top of the base v5.15-rc4
> that the patches I need to apply to test on PS3 with don't apply.
> 
> Do you have something closer to say v5.15-rc5?  Preferred would be
> just your add_disk() error handling patches plus what they depend
> on.

If you just want to test the ps3 changes, I've put this branch together
just for yo, its based on v5.15-rc6:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20211025-ps3-add-disk

  Luis

^ permalink raw reply

* Re: [PATCH 08/13] zram: add error handling support for add_disk()
From: Minchan Kim @ 2021-10-25 16:55 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: nvdimm, vigneshr, linux-nvme, paulus, miquel.raynal, ira.weiny,
	hch, dave.jiang, sagi, vishal.l.verma, ngupta, linux-block,
	kbusch, dan.j.williams, axboe, geoff, linux-kernel, jim,
	senozhatsky, richard, linux-mtd, linuxppc-dev
In-Reply-To: <20211015235219.2191207-9-mcgrof@kernel.org>

On Fri, Oct 15, 2021 at 04:52:14PM -0700, Luis Chamberlain wrote:
> We never checked for errors on add_disk() as this function
> returned void. Now that this is fixed, use the shiny new
> error handling.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>

^ permalink raw reply

* [PATCH v5 0/5] Fix long standing AER Error Handling Issues
From: Naveen Naidu @ 2021-10-25 17:00 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: Oza Pawandeep, linux-pci, Sinan Kaya, linux-kernel, Naveen Naidu,
	skhan, Keith Busch, linuxppc-dev, linux-kernel-mentees

This patch series aims at fixing some of the AER error handling issues
we have.

Currently we have the following issues: 
  
  1. Confusing message in aer_print_error()
  2. aer_err_info not being initialized completely in DPC path before 
     we print the AER logs
  3. A bug [1] in clearing of AER registers in the native AER path

[1] https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/

The patch series fixes the above things.

PATCH 1: 
  - Fixes the first issue
  - This patch is independent of other patches and can be applied
    seperately

PATCH 2 - 3:
  - Fixes the second issue
  - Patch 3 is depended on Patch 2 in the series

PATCH 4
  - Fixes the bug in clearing of AER registers which leades to
    AER message spew [1]

PATCH 5:
  - Adds extra information (devctl register) in AER error logs.
  - Patch 5 depends on Patch 4 of the series

Thanks,
Naveen Naidu

Changelog
=========
v5:
    - Edit the commit message of Patch 1 and Patch 5 to include how to
      test the AER messages using aer-inject.
    - Edit Patch 3 to initialize info.id depending on the trigger
      reason.
    - Drop few patches (v4 4/8, 5/8 7/8) since they were wrong.

v4:
    - Fix logical error in 6/8, in the previous version of the patch set
      there was a bug, in how I added the devices to the queue.

v3:
    - Edit the commit messages to be in imperative style and split the
      commits to be more atomic.

v2:
    - Add [PATCH 7] which includes the device control register 
      information in AER error logs.

Naveen Naidu (5):
  [PATCH v5 1/5] PCI/AER: Remove ID from aer_agent_string[]
  [PATCH v5 2/5] PCI: Cleanup struct aer_err_info
  [PATCH v5 3/5] PCI/DPC: Initialize info.id in dpc_process_error()
  [PATCH v5 4/5] PCI/AER: Clear error device AER registers in aer_irq()
  [PATCH v5 5/5] PCI/AER: Include DEVCTL in aer_print_error()

 drivers/pci/pci.h      |  23 +++-
 drivers/pci/pcie/aer.c | 269 ++++++++++++++++++++++++++++-------------
 drivers/pci/pcie/dpc.c |  16 ++-
 3 files changed, 214 insertions(+), 94 deletions(-)

-- 
2.25.1


^ permalink raw reply

* [PATCH v5 1/5] PCI/AER: Remove ID from aer_agent_string[]
From: Naveen Naidu @ 2021-10-25 17:01 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
	linux-kernel-mentees
In-Reply-To: <cover.1635179600.git.naveennaidu479@gmail.com>

Currently, we do not print the "id" field in the AER error logs. Yet the
aer_agent_string[] has the word "id" in it. The AER error log looks
like:

  pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)

Without the "id" field in the error log, The aer_agent_string[]
(eg: "Receiver ID") does not make sense. A user reading the
aer_agent_string[] in the log, might inadvertently look for an "id"
field and not finding it might lead to confusion.

Remove the "ID" from the aer_agent_string[].

It is easy to reproduce this by using aer-inject:

  $ aer-inject -s 00:03:0 corr-err-file

The content of the corr-err-file file is as below:

  AER
  COR_STATUS BAD_TLP
  HEADER_LOG 0 1 2 3

The following are sample dummy errors inject via aer-inject.

Before
=======

In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
the "id" field was removed from the AER error logs, so currently AER
logs look like:

  pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03:0
  pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) <--- no id field
  pcieport 0000:00:03.0:   device [1b36:000c] error status/mask=00000040/0000e000
  pcieport 0000:00:03.0:    [ 6] BadTLP

After
======

  pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
  pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
  pcieport 0000:00:03.0:   device [1b36:000c] error status/mask=00000040/0000e000
  pcieport 0000:00:03.0:    [ 6] BadTLP

Link: https://lore.kernel.org/linux-pci/20211021170317.GA2700910@bhelgaas/T/#m618bda4e54042d95a1a83fccc01cdb423f7590dc
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
 drivers/pci/pcie/aer.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9784fdcf3006..241ff361b43c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = {
 };
 
 static const char *aer_agent_string[] = {
-	"Receiver ID",
-	"Requester ID",
-	"Completer ID",
-	"Transmitter ID"
+	"Receiver",
+	"Requester",
+	"Completer",
+	"Transmitter"
 };
 
 #define aer_stats_dev_attr(name, stats_array, strings_array,		\
@@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	const char *level;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent)\n",
 			aer_error_severity_string[info->severity]);
 		goto out;
 	}
-- 
2.25.1


^ permalink raw reply related

* [PATCH v5 2/5] PCI: Cleanup struct aer_err_info
From: Naveen Naidu @ 2021-10-25 17:01 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
	linux-kernel-mentees
In-Reply-To: <cover.1635179600.git.naveennaidu479@gmail.com>

The id, status and the mask fields of the struct aer_err_info comes
directly from the registers, hence their sizes should be explicit.

The length of these registers are:
  - id: 16 bits - Represents the Error Source Requester ID
  - status: 32 bits - COR/UNCOR Error Status
  - mask: 32 bits - COR/UNCOR Error Mask

Since the length of the above registers are even, use u16 and u32
to represent their values.

Also remove the __pad fields.

"pahole" was run on the modified struct aer_err_info and the size
remains unchanged.

Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
 drivers/pci/pci.h | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 1cce56c2aea0..9be7a966fda7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -427,18 +427,16 @@ struct aer_err_info {
 	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
 
-	unsigned int id:16;
+	u16	id;
 
 	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
-	unsigned int __pad1:5;
 	unsigned int multi_error_valid:1;
 
 	unsigned int first_error:5;
-	unsigned int __pad2:2;
 	unsigned int tlp_header_valid:1;
 
-	unsigned int status;		/* COR/UNCOR Error Status */
-	unsigned int mask;		/* COR/UNCOR Error Mask */
+	u32 status;		/* COR/UNCOR Error Status */
+	u32 mask;		/* COR/UNCOR Error Mask */
 	struct aer_header_log_regs tlp;	/* TLP Header */
 };
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH v5 3/5] PCI/DPC: Initialize info.id in dpc_process_error()
From: Naveen Naidu @ 2021-10-25 17:01 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: Oza Pawandeep, linux-pci, Sinan Kaya, linux-kernel, Naveen Naidu,
	skhan, Keith Busch, linuxppc-dev, linux-kernel-mentees
In-Reply-To: <cover.1635179600.git.naveennaidu479@gmail.com>

In the dpc_process_error() path, info.id isn't initialized before being
passed to aer_print_error(). In the corresponding AER path, it is
initialized in aer_isr_one_error().

The error message shown during Coverity Scan is:

  Coverity #1461602
  CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
  8. uninit_use_in_call: Using uninitialized value info.id when calling aer_print_error.

Also Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
Trigger Reason indicates ERR_NONFATAL or ERR_FATAL. Initialize the
"info.id" based on the trigger reason before passing it to
aer_print_error()

Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
 drivers/pci/pcie/dpc.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..6fa1b1eb4671 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -262,16 +262,24 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
 
 void dpc_process_error(struct pci_dev *pdev)
 {
-	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
+	u16 cap = pdev->dpc_cap, status, reason, ext_reason;
 	struct aer_err_info info;
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
-	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
+	reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
+
+	/*
+	 * Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
+	 * Trigger Reason indicates ERR_NONFATAL or ERR_FATAL.
+	 */
+	if (reason == 1 || reason == 2)
+		pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &info.id);
+	else
+		info.id = 0;
 
 	pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
-		 status, source);
+		 status, info.id);
 
-	reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
 	ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
 	pci_warn(pdev, "%s detected\n",
 		 (reason == 0) ? "unmasked uncorrectable error" :
-- 
2.25.1


^ permalink raw reply related

* [PATCH v5 4/5] PCI/AER: Clear error device AER registers in aer_irq()
From: Naveen Naidu @ 2021-10-25 17:01 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
	linux-kernel-mentees
In-Reply-To: <cover.1635179600.git.naveennaidu479@gmail.com>

Converge the APEI path and native AER path of clearing the AER registers
of the error device.

In APEI path, the system firmware clears the AER registers before
handing off the record to OS. But in "native AER" path, the execution
path of clearing the AER register is as follows:

  aer_isr_one_error
    aer_print_port_info
      if (find_source_device())
        aer_process_err_devices
          handle_error_source
            pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)

The above path has a bug, if the find_source_device() fails, AER
registers are not cleared from the error device. This means, the error
device will keep reporting the error again and again and would lead
to message spew.

Related Bug Report:
  https://lore.kernel.org/linux-pci/20151229155822.GA17321@localhost/
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The above bug could be avoided, if the AER registers are cleared during
the AER IRQ handler aer_irq(), which would provide guarantee that the AER
error registers are always cleared. This is similar to how APEI handles
these errors.

The main aim is that:

  When an interrupt handler deals with a interrupt, it must *always*
  clear the source of the interrupt.

Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
 drivers/pci/pci.h      |  13 ++-
 drivers/pci/pcie/aer.c | 249 ++++++++++++++++++++++++++++-------------
 2 files changed, 184 insertions(+), 78 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9be7a966fda7..eb88d8bfeaf7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -424,7 +424,6 @@ static inline bool pci_dev_is_added(const struct pci_dev *dev)
 #define AER_MAX_MULTI_ERR_DEVICES	5	/* Not likely to have more */
 
 struct aer_err_info {
-	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
 
 	u16	id;
@@ -440,6 +439,18 @@ struct aer_err_info {
 	struct aer_header_log_regs tlp;	/* TLP Header */
 };
 
+/* Preliminary AER error information processed from Root port */
+struct aer_devices_err_info {
+	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
+	struct aer_err_info err_info;
+};
+
+/* AER information associated with each error device */
+struct aer_dev_err_info {
+	struct pci_dev *dev;
+	struct aer_err_info err_info;
+};
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 #endif	/* CONFIG_PCIEAER */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 241ff361b43c..d3937f5384e4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -36,6 +36,18 @@
 
 #define AER_ERROR_SOURCES_MAX		128
 
+/*
+ * There can be 128 maximum error sources (AER_ERROR_SOURCES_MAX) and each
+ * error source can have maximum of 5 error devices (AER_MAX_MULTI_ERR_DEVICES)
+ * so the maximum error devices we can report is:
+ *
+ * AER_ERROR_DEVICES_MAX = AER_ERROR_SOURCES_MAX * AER_MAX_MULTI_ERR_DEVICES == (128 * 5) == 640
+ *
+ * But since, the size in KFIFO should be a power of two, the closest value
+ * to 640 is 1024
+ */
+# define AER_ERROR_DEVICES_MAX		1024
+
 #define AER_MAX_TYPEOF_COR_ERRS		16	/* as per PCI_ERR_COR_STATUS */
 #define AER_MAX_TYPEOF_UNCOR_ERRS	27	/* as per PCI_ERR_UNCOR_STATUS*/
 
@@ -46,7 +58,7 @@ struct aer_err_source {
 
 struct aer_rpc {
 	struct pci_dev *rpd;		/* Root Port device */
-	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
+	DECLARE_KFIFO(aer_fifo, struct aer_dev_err_info, AER_ERROR_DEVICES_MAX);
 };
 
 /* AER stats for the device */
@@ -803,14 +815,14 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 
 /**
  * add_error_device - list device to be handled
- * @e_info: pointer to error info
+ * @e_dev: pointer to error info
  * @dev: pointer to pci_dev to be added
  */
-static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
+static int add_error_device(struct aer_devices_err_info *e_dev, struct pci_dev *dev)
 {
-	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
-		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
-		e_info->error_dev_num++;
+	if (e_dev->err_info.error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
+		e_dev->dev[e_dev->err_info.error_dev_num] = pci_dev_get(dev);
+		e_dev->err_info.error_dev_num++;
 		return 0;
 	}
 	return -ENOSPC;
@@ -877,18 +889,18 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 
 static int find_device_iter(struct pci_dev *dev, void *data)
 {
-	struct aer_err_info *e_info = (struct aer_err_info *)data;
+	struct aer_devices_err_info *e_dev = (struct aer_devices_err_info *)data;
 
-	if (is_error_source(dev, e_info)) {
+	if (is_error_source(dev, &e_dev->err_info)) {
 		/* List this device */
-		if (add_error_device(e_info, dev)) {
+		if (add_error_device(e_dev, dev)) {
 			/* We cannot handle more... Stop iteration */
 			/* TODO: Should print error message here? */
 			return 1;
 		}
 
 		/* If there is only a single error, stop iteration */
-		if (!e_info->multi_error_valid)
+		if (!e_dev->err_info.multi_error_valid)
 			return 1;
 	}
 	return 0;
@@ -897,7 +909,7 @@ static int find_device_iter(struct pci_dev *dev, void *data)
 /**
  * find_source_device - search through device hierarchy for source device
  * @parent: pointer to Root Port pci_dev data structure
- * @e_info: including detailed error information such like id
+ * @e_dev: including detailed error information such like id
  *
  * Return true if found.
  *
@@ -907,26 +919,26 @@ static int find_device_iter(struct pci_dev *dev, void *data)
  * e_info->error_dev_num and e_info->dev[], based on the given information.
  */
 static bool find_source_device(struct pci_dev *parent,
-		struct aer_err_info *e_info)
+		struct aer_devices_err_info *e_dev)
 {
 	struct pci_dev *dev = parent;
 	int result;
 
 	/* Must reset in this function */
-	e_info->error_dev_num = 0;
+	e_dev->err_info.error_dev_num = 0;
 
 	/* Is Root Port an agent that sends error message? */
-	result = find_device_iter(dev, e_info);
+	result = find_device_iter(dev, e_dev);
 	if (result)
 		return true;
 
 	if (pci_pcie_type(parent) == PCI_EXP_TYPE_RC_EC)
-		pcie_walk_rcec(parent, find_device_iter, e_info);
+		pcie_walk_rcec(parent, find_device_iter, e_dev);
 	else
-		pci_walk_bus(parent->subordinate, find_device_iter, e_info);
+		pci_walk_bus(parent->subordinate, find_device_iter, e_dev);
 
-	if (!e_info->error_dev_num) {
-		pci_info(parent, "can't find device of ID%04x\n", e_info->id);
+	if (!e_dev->err_info.error_dev_num) {
+		pci_info(parent, "can't find device of ID%04x\n", e_dev->err_info.id);
 		return false;
 	}
 	return true;
@@ -940,24 +952,42 @@ static bool find_source_device(struct pci_dev *parent,
  * Invoked when an error being detected by Root Port.
  */
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
+{
+	/*
+	 * Correctable error does not need software intervention.
+	 * No need to go through error recovery process.
+	 */
+	if (info->severity == AER_NONFATAL)
+		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
+	else if (info->severity == AER_FATAL)
+		pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
+	pci_dev_put(dev);
+}
+
+/**
+ * clear_error_source_aer_registers - clear AER registers of the error source device
+ * @dev: pointer to pci_dev data structure of error source device
+ * @info: comprehensive error information
+ *
+ * Invoked when an error being detected by Root Port but before we handle the
+ * error.
+ */
+static void clear_error_source_aer_registers(struct pci_dev *dev, struct aer_err_info info)
 {
 	int aer = dev->aer_cap;
 
-	if (info->severity == AER_CORRECTABLE) {
-		/*
-		 * Correctable error does not need software intervention.
-		 * No need to go through error recovery process.
-		 */
+	if (info.severity == AER_CORRECTABLE) {
 		if (aer)
 			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
-					info->status);
+					info.status);
 		if (pcie_aer_is_native(dev))
 			pcie_clear_device_status(dev);
-	} else if (info->severity == AER_NONFATAL)
-		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
-	else if (info->severity == AER_FATAL)
-		pcie_do_recovery(dev, pci_channel_io_frozen, aer_root_reset);
-	pci_dev_put(dev);
+	} else if (info.severity == AER_NONFATAL) {
+		pci_aer_clear_nonfatal_status(dev);
+	} else if (info.severity == AER_FATAL) {
+		pci_aer_clear_fatal_status(dev);
+	}
+
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -1093,70 +1123,112 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	return 1;
 }
 
-static inline void aer_process_err_devices(struct aer_err_info *e_info)
-{
-	int i;
-
-	/* Report all before handle them, not to lost records by reset etc. */
-	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (aer_get_device_error_info(e_info->dev[i], e_info))
-			aer_print_error(e_info->dev[i], e_info);
-	}
-	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (aer_get_device_error_info(e_info->dev[i], e_info))
-			handle_error_source(e_info->dev[i], e_info);
-	}
-}
-
 /**
- * aer_isr_one_error - consume an error detected by root port
- * @rpc: pointer to the root port which holds an error
+ * aer_find_corr_error_source_device - find the error source which detected the corrected error
+ * @rp: pointer to Root Port pci_dev data structure
  * @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
  */
-static void aer_isr_one_error(struct aer_rpc *rpc,
-		struct aer_err_source *e_src)
+static bool aer_find_corr_error_source_device(struct pci_dev *rp,
+					struct aer_err_source *e_src,
+					struct aer_devices_err_info *e_info)
 {
-	struct pci_dev *pdev = rpc->rpd;
-	struct aer_err_info e_info;
-
-	pci_rootport_aer_stats_incr(pdev, e_src);
-
-	/*
-	 * There is a possibility that both correctable error and
-	 * uncorrectable error being logged. Report correctable error first.
-	 */
 	if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
-		e_info.id = ERR_COR_ID(e_src->id);
-		e_info.severity = AER_CORRECTABLE;
+		e_info->err_info.id = ERR_COR_ID(e_src->id);
+		e_info->err_info.severity = AER_CORRECTABLE;
 
 		if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
-			e_info.multi_error_valid = 1;
+			e_info->err_info.multi_error_valid = 1;
 		else
-			e_info.multi_error_valid = 0;
-		aer_print_port_info(pdev, &e_info);
+			e_info->err_info.multi_error_valid = 0;
 
-		if (find_source_device(pdev, &e_info))
-			aer_process_err_devices(&e_info);
+		if (!find_source_device(rp, e_info))
+			return false;
 	}
+	return true;
+}
 
+/**
+ * aer_find_uncorr_error_source_device - find the error source which detected the uncorrected error
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_src: pointer to an error source
+ * @e_info: including detailed error information such like id
+ *
+ * Return true if found.
+ *
+ * Process the error information received at the Root Port, set these values
+ * in the aer_devices_err_info and find all the devices that are related to
+ * the error.
+ */
+static bool aer_find_uncorr_error_source_device(struct pci_dev *rp,
+					struct aer_err_source *e_src,
+					struct aer_devices_err_info *e_info)
+{
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
-		e_info.id = ERR_UNCOR_ID(e_src->id);
+		e_info->err_info.id = ERR_UNCOR_ID(e_src->id);
 
 		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
-			e_info.severity = AER_FATAL;
+			e_info->err_info.severity = AER_FATAL;
 		else
-			e_info.severity = AER_NONFATAL;
+			e_info->err_info.severity = AER_NONFATAL;
 
 		if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
-			e_info.multi_error_valid = 1;
+			e_info->err_info.multi_error_valid = 1;
 		else
-			e_info.multi_error_valid = 0;
+			e_info->err_info.multi_error_valid = 0;
+
+		if (!find_source_device(rp, e_info))
+			return false;
+	}
 
-		aer_print_port_info(pdev, &e_info);
+	return true;
+}
 
-		if (find_source_device(pdev, &e_info))
-			aer_process_err_devices(&e_info);
+/**
+ * aer_isr_one_error - consume an error detected by root port
+ * @rp: pointer to Root Port pci_dev data structure
+ * @e_dev: pointer to an error device
+ */
+static void aer_isr_one_error(struct pci_dev *rp, struct aer_dev_err_info *e_dev)
+{
+	aer_print_port_info(rp, &e_dev->err_info);
+	aer_print_error(e_dev->dev, &e_dev->err_info);
+	handle_error_source(e_dev->dev, &e_dev->err_info);
+}
+
+static bool aer_add_err_devices_to_queue(struct aer_rpc *rpc,
+		struct aer_devices_err_info *e_info)
+{
+	int i;
+	struct aer_dev_err_info *e_dev;
+
+	e_dev = kzalloc(sizeof(*e_dev), GFP_ATOMIC);
+	if (!e_dev)
+		return false;
+
+	for (i = 0; i < e_info->err_info.error_dev_num && e_info->dev[i]; i++) {
+		e_dev->err_info = e_info->err_info;
+		e_dev->dev = e_info->dev[i];
+
+		/*
+		 * Store the AER register information for each error device on
+		 * the queue
+		 */
+		if (aer_get_device_error_info(e_dev->dev, &e_dev->err_info)) {
+			if (!kfifo_put(&rpc->aer_fifo, *e_dev))
+				return false;
+
+			clear_error_source_aer_registers(e_dev->dev, e_dev->err_info);
+		}
 	}
+
+	return true;
 }
 
 /**
@@ -1170,13 +1242,13 @@ static irqreturn_t aer_isr(int irq, void *context)
 {
 	struct pcie_device *dev = (struct pcie_device *)context;
 	struct aer_rpc *rpc = get_service_data(dev);
-	struct aer_err_source e_src;
+	struct aer_dev_err_info e_dev;
 
 	if (kfifo_is_empty(&rpc->aer_fifo))
 		return IRQ_NONE;
 
-	while (kfifo_get(&rpc->aer_fifo, &e_src))
-		aer_isr_one_error(rpc, &e_src);
+	while (kfifo_get(&rpc->aer_fifo, &e_dev))
+		aer_isr_one_error(rpc->rpd, &e_dev);
 	return IRQ_HANDLED;
 }
 
@@ -1194,6 +1266,11 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct pci_dev *rp = rpc->rpd;
 	int aer = rp->aer_cap;
 	struct aer_err_source e_src = {};
+	struct aer_devices_err_info *e_info;
+
+	e_info = kzalloc(sizeof(*e_info), GFP_ATOMIC);
+	if (!e_info)
+		return IRQ_NONE;
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
 	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
@@ -1202,8 +1279,26 @@ static irqreturn_t aer_irq(int irq, void *context)
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
 	pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, e_src.status);
 
-	if (!kfifo_put(&rpc->aer_fifo, e_src))
-		return IRQ_HANDLED;
+	pci_rootport_aer_stats_incr(rp, &e_src);
+
+	/*
+	 * There is a possibility that both correctable error and
+	 * uncorrectable error are being logged. Find the devices which caused
+	 * correctable errors first so that they can be added to the queue first
+	 * and will be reported first.
+	 *
+	 * Before adding the error device to the queue to be handled, clear the
+	 * AER status registers.
+	 */
+	if (aer_find_corr_error_source_device(rp, &e_src, e_info)) {
+		if (!aer_add_err_devices_to_queue(rpc, e_info))
+			return IRQ_NONE;
+	}
+
+	if (aer_find_uncorr_error_source_device(rp, &e_src, e_info)) {
+		if (!aer_add_err_devices_to_queue(rpc, e_info))
+			return IRQ_NONE;
+	}
 
 	return IRQ_WAKE_THREAD;
 }
-- 
2.25.1


^ permalink raw reply related

* [PATCH v5 5/5] PCI/AER: Include DEVCTL in aer_print_error()
From: Naveen Naidu @ 2021-10-25 17:01 UTC (permalink / raw)
  To: bhelgaas, ruscur, oohall
  Cc: linux-pci, linux-kernel, Naveen Naidu, skhan, linuxppc-dev,
	linux-kernel-mentees
In-Reply-To: <cover.1635179600.git.naveennaidu479@gmail.com>

Print the contents of Device Control Register of the device which
detected the error. This might help in faster error diagnosis.

It is easy to test this by using aer-inject:

  $ aer-inject -s 00:03:0 corr-err-file

The content of the corr-err-file is as below:

  AER
  COR_STATUS BAD_TLP
  HEADER_LOG 0 1 2 3

Sample output from dummy error injected by aer-inject:

  pcieport 0000:00:03.0: AER: Corrected error received: 0000:00:03.0
  pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver)
  pcieport 0000:00:03.0:   device [1b36:000c] error status/mask=00000040/0000e000, devctl=0x000f <-- devctl added to the error log
  pcieport 0000:00:03.0:    [ 6] BadTLP

Signed-off-by: Naveen Naidu <naveennaidu479@gmail.com>
---
 drivers/pci/pci.h      |  2 ++
 drivers/pci/pcie/aer.c | 10 ++++++++--
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eb88d8bfeaf7..48ed7f91113b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -437,6 +437,8 @@ struct aer_err_info {
 	u32 status;		/* COR/UNCOR Error Status */
 	u32 mask;		/* COR/UNCOR Error Mask */
 	struct aer_header_log_regs tlp;	/* TLP Header */
+
+	u16 devctl;
 };
 
 /* Preliminary AER error information processed from Root port */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3937f5384e4..fdeef9deb016 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -729,8 +729,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 		   aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
-	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
-		   dev->vendor, dev->device, info->status, info->mask);
+	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x, devctl=%#06x\n",
+		   dev->vendor, dev->device, info->status, info->mask, info->devctl);
 
 	__aer_print_error(dev, info);
 
@@ -1083,6 +1083,12 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	if (!aer)
 		return 0;
 
+	/*
+	 * Cache the value of Device Control Register now, because later the
+	 * device might not be available
+	 */
+	pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &info->devctl);
+
 	if (info->severity == AER_CORRECTABLE) {
 		pci_read_config_dword(dev, aer + PCI_ERR_COR_STATUS,
 			&info->status);
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] locking: remove spin_lock_flags() etc
From: Waiman Long @ 2021-10-25 18:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-ia64, Peter Zijlstra, James E.J. Bottomley, Paul Mackerras,
	Alexander Gordeev, Will Deacon, Jonas Bonn, linux-s390,
	Arnd Bergmann, Helge Deller, Christian Borntraeger, Ingo Molnar,
	Heiko Carstens, Vasily Gorbik, Boqun Feng, Stefan Kristiansson,
	Openrisc, Stafford Horne, Parisc List, Linux Kernel Mailing List,
	linuxppc-dev
In-Reply-To: <CAK8P3a3JEBF-dEg0iVThMMRNK3CDxY+mRtTeStMusycnethO_g@mail.gmail.com>

On 10/25/21 11:44 AM, Arnd Bergmann wrote:
> On Mon, Oct 25, 2021 at 5:28 PM Waiman Long <longman@redhat.com> wrote:
>> On 10/25/21 9:06 AM, Arnd Bergmann wrote:
>>> On s390, we pick between the cmpxchg() based directed-yield when
>>> running on virtualized CPUs, and a normal qspinlock when running on a
>>> dedicated CPU.
>> I am not aware that s390 is using qspinlocks at all as I don't see
>> ARCH_USE_QUEUED_SPINLOCKS being set anywhere under arch/s390. I only see
>> that it uses a cmpxchg based spinlock.
> Sorry, I should not have said "normal" here. See arch/s390/lib/spinlock.c
> for their custom queued spinlocks as implemented in arch_spin_lock_queued().
> I don't know if that code actually does the same thing as the generic qspinlock,
> but it seems at least similar.

Yes, you are right. Their queued lock code looks like a custom version 
of the pvqspinlock code.

Cheers,
Longman


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox