From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 597C819E966 for ; Fri, 7 Feb 2025 18:52:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738954347; cv=none; b=e3OB0R3Y6SR4FSn2aclvSDGnrYp2o6aoTqSE1R11Yjq8qm9iAUwFp7tFViwNB/CIemNUj4PzAVdUr+CXmwT0Oi3umj8Zh9W+vyp373iYaeDCBt0oW/1ps4bLxdJhIV/PzimHWlY1+OZuKOjvpMcVugUyGM8XnHppp3Paq7ejUk4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738954347; c=relaxed/simple; bh=NKAv9drCw7hWo6IRJuXeA7XNwrcGVMojmbn6KDdIM6w=; h=Date:In-Reply-To:Message-Id:Mime-Version:References:Subject:From: To:Content-Type; b=kzV9EidrdLGpy155ib8vfonD9f82ch3aTfFQoWBHcyKPrJVMbqdDlf08IMifPk4+pUrLF+3ZCadrEk1RLElv+JEGRS+z5AexDFIxNaR1bmNY9mJkJsP5m2q2vf3e43qoi5+JG9uBVkh0+zcsFmYcXSAdu65YEhiJ34iDTYU30hA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=p2LvI60E; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--irogers.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p2LvI60E" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e5b2bf5a317so3606786276.3 for ; Fri, 07 Feb 2025 10:52:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738954333; x=1739559133; darn=vger.kernel.org; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=bvJ/x4s+vvbmuMDOCIwlWoNWdUh/w8+3zbzng+No01s=; b=p2LvI60EMe08L1oj4O6tvWsU7oLiIdmaF3dcTZ6FZA/DnZXAnA/1TNdhPVNBrDsh5Q KE2bCvsC0DxGop4nymt/lyzVS6J0T4RaObJ/bAaIUUdAimvsxSA4ZmYMHFze3W353ZmP bW0oq42OvV0Rq9sKQYHWeIzvXpaVQhmmxlryStF0kXnq2hctVbm0KeQVwjrKITKbYmxI SSvi3GzKxO84OxHhCT7QDJ+cBl9PoWwIgfaZ4hS8vTwgEQb6uR5Lnn9E8Mh05D7vgSTi Qa7fQmsKGDxt9Lh6jkcbiZOmaTHV+HwNUTs2t9kswolYQP/8w7FlBMvMwboyGwVsSDWK nWpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738954333; x=1739559133; h=content-transfer-encoding:to:from:subject:references:mime-version :message-id:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=bvJ/x4s+vvbmuMDOCIwlWoNWdUh/w8+3zbzng+No01s=; b=ookpyxcQprB8Y70MtUpyw5S5djHFuDi5YyflKHUP3qTktatS6b1429ZLxxnHQlGzQo DO7BjFXlCwQR4/xCPNfoEtyJfpIrbJ8YZKV1SeXpMKmjVontR/l+dhTO3AKRnjtWxe0A NWdhPjCWUffjSiVprzzvBwAI0ogMqnxOxSj5JJKbhf9W1r2fmYuBZ4UlB9DezEk1AyPU jVDuGN5hZGiMLH88ZoEdtGKwA2cwkO0eFPCgf/FwQJ25oGPu74gVe+wlPcffQDc2Y2Ml b5z+hpFZVkiMVXubSPvV0uNYtiNvQ4a0uMXgjNjJ3JFazNSJnN0xYHbtYQ96G6zHbMD4 nI+Q== X-Forwarded-Encrypted: i=1; AJvYcCXwdhVGcpje5Hth7NEiBxBE/UlA0tC/OXEu43cStQud36t/pC/oq/jgKtznDlpC0E5nG9QjQ4B991dxzJd6mpt5@vger.kernel.org X-Gm-Message-State: AOJu0YxF0qEyS1jw1HUR2kaYMCbn1jwGq9j918yLgkpRDcKC8CnG2mE2 94iV9UrCbhpmbzjNpX/AuKvK5VaZblqmAzPjR1aEUUR4Ki/CMEPc7d3yvUW6EtQ1ZwQWmep0SWP /HXtBhw== X-Google-Smtp-Source: AGHT+IHMEYsTter4l0tbaFWNrix7mON0Ln2WzklxCi+FeMdDUMz7qL9Xp+KEEit3b7/m8ED0Yi/2LphA8adK X-Received: from irogers.svl.corp.google.com ([2620:15c:2c5:11:79e4:4a45:1e80:ace5]) (user=irogers job=sendgmr) by 2002:a25:d601:0:b0:e5b:1290:9461 with SMTP id 3f1490d57ef6-e5b462b8f5bmr33095276.8.1738954333188; Fri, 07 Feb 2025 10:52:13 -0800 (PST) Date: Fri, 7 Feb 2025 10:51:35 -0800 In-Reply-To: <20250207185154.824433-1-irogers@google.com> Message-Id: <20250207185154.824433-7-irogers@google.com> Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250207185154.824433-1-irogers@google.com> X-Mailer: git-send-email 2.48.1.502.g6dc24dfdaf-goog Subject: [PATCH v7 06/24] perf vendor events: Update BroadwellX events/metrics From: Ian Rogers To: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Kan Liang , "=?UTF-8?q?Andreas=20F=C3=A4rber?=" , Manivannan Sadhasivam , Weilin Wang , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Perry Taylor , Samantha Alt , Caleb Biggers , Edward Baker , Michael Petlan , Thomas Falcon Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Update events from v22 to v23. Update TMA metrics from 4.8 to 5.02. Bring in the event updates v23: https://github.com/intel/perfmon/commit/679982113f4bfa16cee19d5408a7f8e309e= 3ac23 The TMA 5.02 update is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c2= 2d782 Co-developed-by: Caleb Biggers Signed-off-by: Caleb Biggers Acked-by: Kan Liang Signed-off-by: Ian Rogers --- .../arch/x86/broadwellx/bdx-metrics.json | 344 ++++++++++-------- .../pmu-events/arch/x86/broadwellx/cache.json | 10 +- .../arch/x86/broadwellx/frontend.json | 4 +- .../arch/x86/broadwellx/memory.json | 6 +- .../arch/x86/broadwellx/metricgroups.json | 5 + .../arch/x86/broadwellx/pipeline.json | 10 +- .../arch/x86/broadwellx/uncore-cache.json | 28 +- .../x86/broadwellx/uncore-interconnect.json | 36 +- .../arch/x86/broadwellx/uncore-memory.json | 1 + tools/perf/pmu-events/arch/x86/mapfile.csv | 2 +- 10 files changed, 230 insertions(+), 216 deletions(-) diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json b/t= ools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json index 0577d7460082..8016202bad1f 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/bdx-metrics.json @@ -55,7 +55,7 @@ "MetricName": "UNCORE_FREQ" }, { - "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles.", + "BriefDescription": "Cycles per instruction retired; indicating ho= w much time each executed instruction took; in units of cycles", "MetricExpr": "CPU_CLK_UNHALTED.THREAD / INST_RETIRED.ANY", "MetricName": "cpi", "ScaleUnit": "1per_instr" @@ -76,24 +76,24 @@ "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data loads to the total number of complete= d instructions", "MetricExpr": "DTLB_LOAD_MISSES.WALK_COMPLETED / INST_RETIRED.ANY"= , "MetricName": "dtlb_load_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data loads to the total number of complet= ed instructions. This implies it missed in the DTLB and further levels of T= LB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by demand data stores to the total number of complet= ed instructions", "MetricExpr": "DTLB_STORE_MISSES.WALK_COMPLETED / INST_RETIRED.ANY= ", "MetricName": "dtlb_store_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by demand data stores to the total number of comple= ted instructions. This implies it missed in the DTLB and further levels of = TLB", "ScaleUnit": "1per_instr" }, { - "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU.", + "BriefDescription": "Bandwidth of IO reads that are initiated by e= nd device controllers that are requesting memory from the CPU", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x19= e@ * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_read", "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU.", + "BriefDescription": "Bandwidth of IO writes that are initiated by = end device controllers that are writing memory to the CPU", "MetricExpr": "(cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D0x1= c8\\,filter_tid\\=3D0x3e@ + cbox@UNC_C_TOR_INSERTS.OPCODE\\,filter_opc\\=3D= 0x180\\,filter_tid\\=3D0x3e@) * 64 / 1e6 / duration_time", "MetricName": "io_bandwidth_write", "ScaleUnit": "1MB/s" @@ -102,14 +102,14 @@ "BriefDescription": "Ratio of number of completed page walks (for = 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total n= umber of completed instructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED_2M_4M / INST_RETIRED.ANY= ", "MetricName": "itlb_large_page_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= 2 megabyte and 4 megabyte page sizes) caused by a code fetch to the total = number of completed instructions. This implies it missed in the Instruction= Translation Lookaside Buffer (ITLB) and further levels of TLB", "ScaleUnit": "1per_instr" }, { "BriefDescription": "Ratio of number of completed page walks (for = all page sizes) caused by a code fetch to the total number of completed ins= tructions", "MetricExpr": "ITLB_MISSES.WALK_COMPLETED / INST_RETIRED.ANY", "MetricName": "itlb_mpi", - "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB.", + "PublicDescription": "Ratio of number of completed page walks (for= all page sizes) caused by a code fetch to the total number of completed in= structions. This implies it missed in the ITLB (Instruction TLB) and furthe= r levels of TLB", "ScaleUnit": "1per_instr" }, { @@ -209,13 +209,13 @@ "ScaleUnit": "1MB/s" }, { - "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches.", + "BriefDescription": "Memory read that miss the last level cache (L= LC) addressed to local DRAM as a percentage of total memory read accesses, = does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_o= pc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\=3D= 0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x182@)= ", "MetricName": "numa_reads_addressed_to_local_dram", "ScaleUnit": "100%" }, { - "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches.", + "BriefDescription": "Memory reads that miss the last level cache (= LLC) addressed to remote DRAM as a percentage of total memory read accesses= , does not include LLC prefetches", "MetricExpr": "cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_= opc\\=3D0x182@ / (cbox@UNC_C_TOR_INSERTS.MISS_LOCAL_OPCODE\\,filter_opc\\= =3D0x182@ + cbox@UNC_C_TOR_INSERTS.MISS_REMOTE_OPCODE\\,filter_opc\\=3D0x18= 2@)", "MetricName": "numa_reads_addressed_to_remote_dram", "ScaleUnit": "100%" @@ -276,12 +276,12 @@ "MetricExpr": "LD_BLOCKS_PARTIAL.ADDRESS_ALIAS / tma_info_thread_c= lks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_4k_aliasing", - "MetricThreshold": "tma_4k_aliasing > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound).", + "MetricThreshold": "tma_4k_aliasing > 0.2 & tma_l1_bound > 0.1 & t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often memory load = accesses were aliased by preceding stores (in program order) with a 4K addr= ess offset. False match is possible; which incur a few cycles load re-issue= . However; the short re-issue duration is often hidden by the out-of-order = core and HW optimizations; hence a user may safely ignore a high value of t= his metric unless it manages to propagate up into parent nodes of the hiera= rchy (e.g. to L1_Bound)", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations.", + "BriefDescription": "This metric represents Core fraction of cycle= s CPU dispatched uops on execution ports for ALU operations", "MetricConstraint": "NO_GROUP_EVENTS_NMI", "MetricExpr": "(UOPS_DISPATCHED_PORT.PORT_0 + UOPS_DISPATCHED_PORT= .PORT_1 + UOPS_DISPATCHED_PORT.PORT_5 + UOPS_DISPATCHED_PORT.PORT_6) / tma_= info_thread_slots", "MetricGroup": "TopdownL5;tma_L5_group;tma_ports_utilized_3m_group= ", @@ -294,8 +294,8 @@ "MetricExpr": "66 * OTHER_ASSISTS.ANY_WB_ASSIST / tma_info_thread_= slots", "MetricGroup": "BvIO;TopdownL4;tma_L4_group;tma_microcode_sequence= r_group", "MetricName": "tma_assists", - "MetricThreshold": "tma_assists > 0.1 & (tma_microcode_sequencer >= 0.05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y", + "MetricThreshold": "tma_assists > 0.1 & tma_microcode_sequencer > = 0.05 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of slots the = CPU retired uops delivered by the Microcode_Sequencer as a result of Assist= s. Assists are long sequences of uops that are required in certain corner-c= ases for operations that cannot be handled natively by the execution pipeli= ne. For example; when working with very small floating point values (so-cal= led Denormals); the FP units are not set up to perform these operations nat= ively. Instead; a sequence of instructions to perform the computation on th= e Denormals is injected into the pipeline. Since these microcode sequences = might be dozens of uops long; Assists can be extremely deleterious to perfo= rmance and they can be avoided in many cases. Sample with: OTHER_ASSISTS.AN= Y_WB_ASSIST", "ScaleUnit": "100%" }, { @@ -306,7 +306,7 @@ "MetricName": "tma_backend_bound", "MetricThreshold": "tma_backend_bound > 0.2", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound.", + "PublicDescription": "This category represents fraction of slots w= here no uops are being delivered due to a lack of required resources for ac= cepting new uops in the Backend. Backend is the portion of the processor co= re where the out-of-order scheduler dispatches ready uops into their respec= tive execution units; and once completed these uops get retired according t= o program order. For example; stalls due to data-cache misses or stalls due= to the divider unit being overloaded are both categorized under Backend Bo= und. Backend Bound is further divided into two main categories: Memory Boun= d and Core Bound", "ScaleUnit": "100%" }, { @@ -316,7 +316,7 @@ "MetricName": "tma_bad_speculation", "MetricThreshold": "tma_bad_speculation > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example.", + "PublicDescription": "This category represents fraction of slots w= asted due to incorrect speculations. This include slots used to issue uops = that do not eventually get retired and slots for which the issue-pipeline w= as blocked due to recovery from earlier incorrect speculation. For example;= wasted work due to miss-predicted branches are categorized under Bad Specu= lation category. Incorrect data speculation followed by Memory Ordering Nuk= es is another example", "ScaleUnit": "100%" }, { @@ -327,7 +327,7 @@ "MetricName": "tma_branch_mispredicts", "MetricThreshold": "tma_branch_mispredicts > 0.1 & tma_bad_specula= tion > 0.15", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_info_bad_spec_branch_misprediction_cost, tma_mispredicts_resteers", + "PublicDescription": "This metric represents fraction of slots the= CPU has wasted due to Branch Misprediction. These slots are either wasted= by uops fetched from an incorrectly speculated program path; or stalls whe= n the out-of-order part of the machine needs to recover its state from a sp= eculative path. Sample with: BR_MISP_RETIRED.ALL_BRANCHES. Related metrics:= tma_mispredicts_resteers", "ScaleUnit": "100%" }, { @@ -335,8 +335,8 @@ "MetricExpr": "12 * (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS= .COUNT + BACLEARS.ANY) / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group", "MetricName": "tma_branch_resteers", - "MetricThreshold": "tma_branch_resteers > 0.05 & (tma_fetch_latenc= y > 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES", + "MetricThreshold": "tma_branch_resteers > 0.05 & tma_fetch_latency= > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers. Branch Resteers estimates the Fro= ntend delay in fetching operations from corrected path; following all sorts= of miss-predicted branches. For example; branchy code with lots of miss-pr= edictions might get categorized under Branch Resteers. Note the value of th= is node may overlap with its siblings. Sample with: BR_MISP_RETIRED.ALL_BRA= NCHES. Related metrics: tma_l3_hit_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -345,8 +345,8 @@ "MetricExpr": "max(0, tma_microcode_sequencer - tma_assists)", "MetricGroup": "TopdownL4;tma_L4_group;tma_microcode_sequencer_gro= up", "MetricName": "tma_cisc", - "MetricThreshold": "tma_cisc > 0.1 & (tma_microcode_sequencer > 0.= 05 & tma_heavy_operations > 0.1)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources.", + "MetricThreshold": "tma_cisc > 0.1 & tma_microcode_sequencer > 0.0= 5 & tma_heavy_operations > 0.1", + "PublicDescription": "This metric estimates fraction of cycles the= CPU retired uops originated from CISC (complex instruction set computer) i= nstruction. A CISC instruction has multiple uops that are required to perfo= rm the instruction's functionality as in the case of read-modify-write as a= n example. Since these instructions require multiple uops they may or may n= ot imply sub-optimal use of machine resources", "ScaleUnit": "100%" }, { @@ -354,7 +354,7 @@ "MetricExpr": "MACHINE_CLEARS.COUNT * tma_branch_resteers / (BR_MI= SP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;MachineClears;TopdownL4;tma_L4_group;tma_b= ranch_resteers_group;tma_issueMC", "MetricName": "tma_clears_resteers", - "MetricThreshold": "tma_clears_resteers > 0.05 & (tma_branch_reste= ers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_clears_resteers > 0.05 & tma_branch_restee= rs > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Machine Clears. Rel= ated metrics: tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tm= a_ms_switches", "ScaleUnit": "100%" }, @@ -362,10 +362,10 @@ "BriefDescription": "This metric estimates fraction of cycles whil= e the memory subsystem was handling synchronizations due to contested acces= ses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(60 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM * (1 = + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_= UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS= _L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LO= AD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_D= RAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RET= IRED.REMOTE_FWD))) + 43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS * (1 + ME= M_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS= _RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_= HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_U= OPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM = + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED= .REMOTE_FWD)))) / tma_info_thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_l3_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_l3_bound_group", "MetricName": "tma_contested_accesses", - "MetricThreshold": "tma_contested_accesses > 0.05 & (tma_l3_bound = > 0.05 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_L3_HIT_RETIRED.XSNP_HITM_PS;MEM_LOAD_L3_HIT_RETIRED.XSNP_MISS_PS. Re= lated metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_contested_accesses > 0.05 & tma_l3_bound >= 0.05 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to contested acce= sses. Contested accesses occur when data written by one Logical Processor a= re read by another Logical Processor on a different Physical Core. Examples= of contested accesses include synchronizations such as locks; true data sh= aring such as modified locked variables; and false sharing. Sample with: ME= M_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MIS= S. Related metrics: tma_data_sharing, tma_false_sharing, tma_machine_clears= , tma_remote_cache", "ScaleUnit": "100%" }, { @@ -376,7 +376,7 @@ "MetricName": "tma_core_bound", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2= ", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations).", + "PublicDescription": "This metric represents fraction of slots whe= re Core non-memory issues were of a bottleneck. Shortage in hardware compu= te resources; or dependencies in software's instructions are both categoriz= ed under Core Bound. Hence it may indicate the machine ran out of an out-of= -order resource; certain execution units are overloaded or dependencies in = program's data- or instruction-flow are limiting the performance (e.g. FP-c= hained long-latency arithmetic operations)", "ScaleUnit": "100%" }, { @@ -385,8 +385,8 @@ "MetricExpr": "43 * (MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT * (1 + = MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UO= PS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L= 3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD= _UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRA= M + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIR= ED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "BvMS;Offcore;Snoop;TopdownL4;tma_L4_group;tma_issu= eSyncxn;tma_l3_bound_group", "MetricName": "tma_data_sharing", - "MetricThreshold": "tma_data_sharing > 0.05 & (tma_l3_bound > 0.05= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_L3_HIT_RETIRED.XSNP_HIT_PS. Related metrics: tma_contested_accesses, tma= _false_sharing, tma_machine_clears, tma_remote_cache", + "MetricThreshold": "tma_data_sharing > 0.05 & tma_l3_bound > 0.05 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling synchronizations due to data-sharing a= ccesses. Data shared by multiple Logical Processors (even just read shared)= may cause increased access latency due to cache coherency. Excessive data = sharing can drastically harm multithreaded performance. Sample with: MEM_LO= AD_UOPS_L3_HIT_RETIRED.XSNP_HIT. Related metrics: tma_contested_accesses, t= ma_false_sharing, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { @@ -394,8 +394,8 @@ "MetricExpr": "ARITH.FPU_DIV_ACTIVE / tma_info_core_core_clks", "MetricGroup": "BvCB;TopdownL3;tma_L3_group;tma_core_bound_group", "MetricName": "tma_divider", - "MetricThreshold": "tma_divider > 0.2 & (tma_core_bound > 0.1 & tm= a_backend_bound > 0.2)", - "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.DIVIDER_UOPS", + "MetricThreshold": "tma_divider > 0.2 & tma_core_bound > 0.1 & tma= _backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles wh= ere the Divider unit was active. Divide and square root instructions are pe= rformed by the Divider unit and can take considerably longer latency than i= nteger or Floating Point addition; subtraction; or multiplication. Sample w= ith: ARITH.FPU_DIV_ACTIVE", "ScaleUnit": "100%" }, { @@ -404,8 +404,8 @@ "MetricExpr": "(1 - MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_= RETIRED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS)) * CYCLE_ACTIVITY.STALL= S_L2_MISS / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_dram_bound", - "MetricThreshold": "tma_dram_bound > 0.1 & (tma_memory_bound > 0.2= & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS_PS", + "MetricThreshold": "tma_dram_bound > 0.1 & tma_memory_bound > 0.2 = & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled on accesses to external memory (DRAM) by loads. Better caching can = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_RE= TIRED.L3_MISS", "ScaleUnit": "100%" }, { @@ -414,7 +414,7 @@ "MetricGroup": "DSB;FetchBW;TopdownL3;tma_L3_group;tma_fetch_bandw= idth_group", "MetricName": "tma_dsb", "MetricThreshold": "tma_dsb > 0.15 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to DSB (decoded uop cache) fetch pip= eline. For example; inefficient utilization of the DSB cache structure or = bank conflict when reading from it; are categorized here", "ScaleUnit": "100%" }, { @@ -422,45 +422,45 @@ "MetricExpr": "DSB2MITE_SWITCHES.PENALTY_CYCLES / tma_info_thread_= clks", "MetricGroup": "DSBmiss;FetchLat;TopdownL3;tma_L3_group;tma_fetch_= latency_group;tma_issueFB", "MetricName": "tma_dsb_switches", - "MetricThreshold": "tma_dsb_switches > 0.05 & (tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_dsb_switches > 0.05 & tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to switches from DSB to MITE pipelines. The DSB (deco= ded i-cache) is a Uop Cache where the front-end directly delivers Uops (mic= ro operations) avoiding heavy x86 decoding. The DSB pipeline has shorter la= tency and delivered higher bandwidth than the MITE (legacy instruction deco= de pipeline). Switching between the two pipelines can cause penalties hence= this metric measures the exposed penalty. Related metrics: tma_fetch_bandw= idth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb, tma_lcp", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles where the Data TLB (DTLB) was missed by load accesses", - "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / tm= a_info_thread_clks", + "MetricExpr": "(8 * DTLB_LOAD_MISSES.STLB_HIT + cpu@DTLB_LOAD_MISS= ES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_LOAD_MISSES.WALK_COMPLETED) / = tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_l1_bound_group", "MetricName": "tma_dtlb_load", - "MetricThreshold": "tma_dtlb_load > 0.1 & (tma_l1_bound > 0.1 & (t= ma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS_PS. Related metrics: t= ma_dtlb_store", + "MetricThreshold": "tma_dtlb_load > 0.1 & tma_l1_bound > 0.1 & tma= _memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles where the Data TLB (DTLB) was missed by load accesses. TLBs (Trans= lation Look-aside Buffers) are processor caches for recently used entries o= ut of the Page Tables that are used to map virtual- to physical-addresses b= y the operating system. This metric approximates the potential delay of dem= and loads missing the first-level data TLB (assuming worst case scenario wi= th back to back misses to different pages). This includes hitting in the se= cond-level TLB (STLB) as well as performing a hardware page walk on an STLB= miss. Sample with: MEM_UOPS_RETIRED.STLB_MISS_LOADS. Related metrics: tma_= dtlb_store", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates the fraction of= cycles spent handling first-level data TLB store misses", - "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED) /= tma_info_thread_clks", + "MetricExpr": "(8 * DTLB_STORE_MISSES.STLB_HIT + cpu@DTLB_STORE_MI= SSES.WALK_DURATION\\,cmask\\=3D0x1@ + 7 * DTLB_STORE_MISSES.WALK_COMPLETED)= / tma_info_thread_clks", "MetricGroup": "BvMT;MemoryTLB;TopdownL4;tma_L4_group;tma_issueTLB= ;tma_store_bound_group", "MetricName": "tma_dtlb_store", - "MetricThreshold": "tma_dtlb_store > 0.05 & (tma_store_bound > 0.2= & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES_PS. Related metrics: tma_dtlb_load", + "MetricThreshold": "tma_dtlb_store > 0.05 & tma_store_bound > 0.2 = & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates the fraction o= f cycles spent handling first-level data TLB store misses. As with ordinar= y data caching; focus on improving data locality and reducing working-set s= ize to reduce DTLB overhead. Additionally; consider using profile-guided o= ptimization (PGO) to collocate frequently-used data on the same page. Try = using larger page sizes for large amounts of frequently-used data. Sample w= ith: MEM_UOPS_RETIRED.STLB_MISS_STORES. Related metrics: tma_dtlb_load", "ScaleUnit": "100%" }, { "BriefDescription": "This metric roughly estimates how often CPU w= as handling synchronizations due to False Sharing", "MetricExpr": "(200 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_MISS.REMOTE_= HITM + 60 * OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE) / tma_info= _thread_clks", - "MetricGroup": "BvMS;DataSharing;Offcore;Snoop;TopdownL4;tma_L4_gr= oup;tma_issueSyncxn;tma_store_bound_group", + "MetricGroup": "BvMS;DataSharing;LockCont;Offcore;Snoop;TopdownL4;= tma_L4_group;tma_issueSyncxn;tma_store_bound_group", "MetricName": "tma_false_sharing", - "MetricThreshold": "tma_false_sharing > 0.05 & (tma_store_bound > = 0.2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_L3_HIT_= RETIRED.XSNP_HITM_PS;OFFCORE_RESPONSE.DEMAND_RFO.L3_HIT.SNOOP_HITM. Related= metrics: tma_contested_accesses, tma_data_sharing, tma_machine_clears, tma= _remote_cache", + "MetricThreshold": "tma_false_sharing > 0.05 & tma_store_bound > 0= .2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates how often CPU = was handling synchronizations due to False Sharing. False Sharing is a mult= ithreading hiccup; where multiple Logical Processors contend on different d= ata-elements mapped into the same cache line. Sample with: MEM_LOAD_UOPS_L3= _HIT_RETIRED.XSNP_HITM, MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM, OFFCORE_= RESPONSE.DEMAND_RFO.LLC_HIT.HITM_OTHER_CORE, OFFCORE_RESPONSE.DEMAND_RFO.LL= C_MISS.REMOTE_HITM. Related metrics: tma_contested_accesses, tma_data_shari= ng, tma_machine_clears, tma_remote_cache", "ScaleUnit": "100%" }, { "BriefDescription": "This metric does a *rough estimation* of how = often L1D Fill Buffer unavailability limited additional L1D miss memory acc= ess requests to proceed", "MetricConstraint": "NO_GROUP_EVENTS", - "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D1@ / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", + "MetricExpr": "tma_info_memory_load_miss_real_latency * cpu@L1D_PE= ND_MISS.FB_FULL\\,cmask\\=3D0x1@ / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;TopdownL4;tma_L4_group;tma_issueBW;t= ma_issueSL;tma_issueSmSt;tma_l1_bound_group", "MetricName": "tma_fb_full", "MetricThreshold": "tma_fb_full > 0.3", - "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency, tma_streaming_stores", + "PublicDescription": "This metric does a *rough estimation* of how= often L1D Fill Buffer unavailability limited additional L1D miss memory ac= cess requests to proceed. The higher the metric value; the deeper the memor= y hierarchy level the misses are satisfied from (metric values >1 are valid= ). Often it hints on approaching bandwidth limits (to L2 cache; L3 cache or= external memory). Related metrics: tma_info_system_dram_bw_use, tma_mem_ba= ndwidth, tma_sq_full, tma_store_latency", "ScaleUnit": "100%" }, { @@ -489,7 +489,7 @@ "MetricGroup": "HPC;TopdownL3;tma_L3_group;tma_light_operations_gr= oup", "MetricName": "tma_fp_arith", "MetricThreshold": "tma_fp_arith > 0.2 & tma_light_operations > 0.= 6", - "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting.", + "PublicDescription": "This metric represents overall arithmetic fl= oating-point (FP) operations fraction the CPU has executed (retired). Note = this metric's value may exceed its parent due to use of \"Uops\" CountDomai= n and FMA double-counting", "ScaleUnit": "100%" }, { @@ -497,8 +497,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.SCALAR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_scalar", - "MetricThreshold": "tma_fp_scalar > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, t= ma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_scalar > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) scalar uops fraction the CPU has retired. May overcount due to = FMA double counting. Related metrics: tma_fp_vector, tma_fp_vector_128b, tm= a_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports= _utilized_2", "ScaleUnit": "100%" }, { @@ -506,8 +506,8 @@ "MetricExpr": "FP_ARITH_INST_RETIRED.VECTOR / UOPS_RETIRED.RETIRE_= SLOTS", "MetricGroup": "Compute;Flops;TopdownL4;tma_L4_group;tma_fp_arith_= group;tma_issue2P", "MetricName": "tma_fp_vector", - "MetricThreshold": "tma_fp_vector > 0.1 & (tma_fp_arith > 0.2 & tm= a_light_operations > 0.6)", - "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b, t= ma_port_0, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector > 0.1 & tma_fp_arith > 0.2 & tma= _light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic floating= -point (FP) vector uops fraction the CPU has retired aggregated across all = vector widths. May overcount due to FMA double counting. Related metrics: t= ma_fp_scalar, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tma_port_= 1, tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -515,8 +515,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.128B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.128B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_128b", - "MetricThreshold": "tma_fp_vector_128b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_128b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 128-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_256b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -524,8 +524,8 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.256B_PACKED_DOUBLE + FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / UOPS_RETIRED.RETIRE_SLOTS", "MetricGroup": "Compute;Flops;TopdownL5;tma_L5_group;tma_fp_vector= _group;tma_issue2P", "MetricName": "tma_fp_vector_256b", - "MetricThreshold": "tma_fp_vector_256b > 0.1 & (tma_fp_vector > 0.= 1 & (tma_fp_arith > 0.2 & tma_light_operations > 0.6))", - "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting. Related metrics: tma_fp_scalar, tma_fp_vector,= tma_fp_vector_128b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5= , tma_port_6, tma_ports_utilized_2", + "MetricThreshold": "tma_fp_vector_256b > 0.1 & tma_fp_vector > 0.1= & tma_fp_arith > 0.2 & tma_light_operations > 0.6", + "PublicDescription": "This metric approximates arithmetic FP vecto= r uops fraction the CPU has retired for 256-bit wide vectors. May overcount= due to FMA double counting prior to LNL. Related metrics: tma_fp_scalar, t= ma_fp_vector, tma_fp_vector_128b, tma_port_0, tma_port_1, tma_port_5, tma_p= ort_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -535,33 +535,33 @@ "MetricName": "tma_frontend_bound", "MetricThreshold": "tma_frontend_bound > 0.15", "MetricgroupNoGroup": "TopdownL1", - "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound.", + "PublicDescription": "This category represents fraction of slots w= here the processor's Frontend undersupplies its Backend. Frontend denotes t= he first part of the processor core responsible to fetch operations that ar= e executed later on by the Backend part. Within the Frontend; a branch pred= ictor predicts the next address to fetch; cache-lines are fetched from the = memory subsystem; parsed into instructions; and lastly decoded into micro-o= perations (uops). Ideally the Frontend can issue Pipeline_Width uops every = cycle to the Backend. Frontend Bound denotes unutilized issue-slots when th= ere is no Backend stall; i.e. bubbles where Frontend delivered no uops whil= e Backend could have accepted them. For example; stalls due to instruction-= cache misses would be categorized under Frontend Bound", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations -- instructions that require= two or more uops or micro-coded sequences", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring heavy-weight operations , instructions that require = two or more uops or micro-coded sequences", "MetricExpr": "tma_microcode_sequencer", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_heavy_operations", "MetricThreshold": "tma_heavy_operations > 0.1", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations -- instructions that requir= e two or more uops or micro-coded sequences. This highly-correlates with th= e uop length of these instructions/sequences. ([ICL+] Note this may overcou= nt due to approximation using indirect events; [ADL+] .)", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring heavy-weight operations , instructions that require= two or more uops or micro-coded sequences. This highly-correlates with the= uop length of these instructions/sequences.([ICL+] Note this may overcount= due to approximation using indirect events; [ADL+])", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses.", + "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to instruction cache misses", "MetricExpr": "ICACHE.IFDATA_STALL / tma_info_thread_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;IcMiss;TopdownL3;tma_L3= _group;tma_fetch_latency_group", "MetricName": "tma_icache_misses", - "MetricThreshold": "tma_icache_misses > 0.05 & (tma_fetch_latency = > 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_icache_misses > 0.05 & tma_fetch_latency >= 0.1 & tma_frontend_bound > 0.15", "ScaleUnit": "100%" }, { - "BriefDescription": "Instructions per retired mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate).", + "BriefDescription": "Instructions per retired Mispredicts for indi= rect CALL or JMP branches (lower number means higher occurrence rate)", "MetricExpr": "tma_info_inst_mix_instructions / (UOPS_RETIRED.RETI= RE_SLOTS / UOPS_ISSUED.ANY * BR_MISP_EXEC.INDIRECT)", "MetricGroup": "Bad;BrMispredicts", "MetricName": "tma_info_bad_spec_ipmisp_indirect", - "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1e3" + "MetricThreshold": "tma_info_bad_spec_ipmisp_indirect < 1000" }, { "BriefDescription": "Number of Instructions per non-speculative Br= anch Misprediction (JEClear) (lower number means higher occurrence rate)", @@ -572,7 +572,7 @@ }, { "BriefDescription": "Core actual clocks when any Logical Processor= is active on the Physical Core", - "MetricExpr": "(CPU_CLK_UNHALTED.THREAD / 2 * (1 + CPU_CLK_UNHALTE= D.ONE_THREAD_ACTIVE / CPU_CLK_UNHALTED.REF_XCLK) if #core_wide < 1 else (CP= U_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tma_info_thread_clks))", + "MetricExpr": "(CPU_CLK_UNHALTED.THREAD_ANY / 2 if #SMT_on else tm= a_info_thread_clks)", "MetricGroup": "SMT", "MetricName": "tma_info_core_core_clks" }, @@ -593,11 +593,11 @@ "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + FP_ARITH_INST_RETIR= ED.VECTOR) / (2 * tma_info_core_core_clks)", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_core_fp_arith_utilization", - "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)." + "PublicDescription": "Actual per-core usage of the Floating Point = non-X87 execution units (regardless of precision or vector-width). Values >= 1 are possible due to ([BDW+] Fused-Multiply Add (FMA) counting - common; = [ADL+] use all of ADD/MUL/FMA in Scalar or 128/256-bit vectors - less commo= n)" }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per thread (logical-processor)", - "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D1@", + "MetricExpr": "UOPS_EXECUTED.THREAD / cpu@UOPS_EXECUTED.THREAD\\,c= mask\\=3D0x1@", "MetricGroup": "Backend;Cor;Pipeline;PortsUtil", "MetricName": "tma_info_core_ilp" }, @@ -616,7 +616,13 @@ "MetricName": "tma_info_frontend_ipunknown_branch" }, { - "BriefDescription": "Branch instructions per taken branch.", + "BriefDescription": "Taken Branches retired Per Cycle", + "MetricExpr": "BR_INST_RETIRED.NEAR_TAKEN / tma_info_thread_clks", + "MetricGroup": "Branches;FetchBW", + "MetricName": "tma_info_frontend_tbpc" + }, + { + "BriefDescription": "Branch instructions per taken branch", "MetricExpr": "BR_INST_RETIRED.ALL_BRANCHES / BR_INST_RETIRED.NEAR= _TAKEN", "MetricGroup": "Branches;Fed;PGO", "MetricName": "tma_info_inst_mix_bptkbranch" @@ -634,7 +640,7 @@ "MetricGroup": "Flops;InsType", "MetricName": "tma_info_inst_mix_iparith", "MetricThreshold": "tma_info_inst_mix_iparith < 10", - "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW." + "PublicDescription": "Instructions per FP Arithmetic instruction (= lower number means higher occurrence rate). Values < 1 are possible due to = intentional FMA double counting. Approximated prior to BDW" }, { "BriefDescription": "Instructions per FP Arithmetic AVX/SSE 128-bi= t instruction (lower number means higher occurrence rate)", @@ -642,7 +648,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx128", "MetricThreshold": "tma_info_inst_mix_iparith_avx128 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX/SSE 128-b= it instruction (lower number means higher occurrence rate). Values < 1 are = possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic AVX* 256-bit i= nstruction (lower number means higher occurrence rate)", @@ -650,7 +656,7 @@ "MetricGroup": "Flops;FpVector;InsType", "MetricName": "tma_info_inst_mix_iparith_avx256", "MetricThreshold": "tma_info_inst_mix_iparith_avx256 < 10", - "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic AVX* 256-bit = instruction (lower number means higher occurrence rate). Values < 1 are pos= sible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Double-= Precision instruction (lower number means higher occurrence rate)", @@ -658,7 +664,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_dp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_dp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Double= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per FP Arithmetic Scalar Single-= Precision instruction (lower number means higher occurrence rate)", @@ -666,7 +672,7 @@ "MetricGroup": "Flops;FpScalar;InsType", "MetricName": "tma_info_inst_mix_iparith_scalar_sp", "MetricThreshold": "tma_info_inst_mix_iparith_scalar_sp < 10", - "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting." + "PublicDescription": "Instructions per FP Arithmetic Scalar Single= -Precision instruction (lower number means higher occurrence rate). Values = < 1 are possible due to intentional FMA double counting" }, { "BriefDescription": "Instructions per Branch (lower number means h= igher occurrence rate)", @@ -708,7 +714,7 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.NEAR_TAKEN", "MetricGroup": "Branches;Fed;FetchBW;Frontend;PGO;tma_issueFB", "MetricName": "tma_info_inst_mix_iptb", - "MetricThreshold": "tma_info_inst_mix_iptb < 9", + "MetricThreshold": "tma_info_inst_mix_iptb < 4 * 2 + 1", "PublicDescription": "Instructions per taken branch. Related metri= cs: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage, = tma_lcp" }, { @@ -731,7 +737,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L1 data cache [GB / sec]", - "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / duration_time", + "MetricExpr": "64 * L1D.REPLACEMENT / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l1d_cache_fill_bw" }, @@ -743,7 +749,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L2 cache [GB / sec]", - "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / duration_time", + "MetricExpr": "64 * L2_LINES_IN.ALL / 1e9 / tma_info_system_time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l2_cache_fill_bw" }, @@ -785,7 +791,7 @@ }, { "BriefDescription": "Average per-thread data fill bandwidth to the= L3 cache [GB / sec]", - "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / duration_time", + "MetricExpr": "64 * LONGEST_LAT_CACHE.MISS / 1e9 / tma_info_system= _time", "MetricGroup": "Mem;MemoryBW", "MetricName": "tma_info_memory_l3_cache_fill_bw" }, @@ -804,7 +810,7 @@ { "BriefDescription": "Average Latency for L2 cache miss demand Load= s", "MetricExpr": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD / OFFCO= RE_REQUESTS.DEMAND_DATA_RD", - "MetricGroup": "Memory_Lat;Offcore", + "MetricGroup": "LockCont;Memory_Lat;Offcore", "MetricName": "tma_info_memory_latency_load_l2_miss_latency" }, { @@ -837,19 +843,19 @@ }, { "BriefDescription": "Instruction-Level-Parallelism (average number= of uops executed when there is execution) per core", - "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", + "MetricExpr": "UOPS_EXECUTED.THREAD / (cpu@UOPS_EXECUTED.CORE\\,cm= ask\\=3D0x1@ / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC)", "MetricGroup": "Cor;Pipeline;PortsUtil;SMT", "MetricName": "tma_info_pipeline_execute" }, { - "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired.", - "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D1@", + "BriefDescription": "Average number of Uops retired in cycles wher= e at least one uop has retired", + "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / cpu@UOPS_RETIRED.RETIRE= _SLOTS\\,cmask\\=3D0x1@", "MetricGroup": "Pipeline;Ret", "MetricName": "tma_info_pipeline_retire" }, { "BriefDescription": "Measured Average Core Frequency for unhalted = processors [GHz]", - "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / dur= ation_time", + "MetricExpr": "tma_info_system_turbo_utilization * TSC / 1e9 / tma= _info_system_time", "MetricGroup": "Power;Summary", "MetricName": "tma_info_system_core_frequency" }, @@ -867,14 +873,14 @@ }, { "BriefDescription": "Average external Memory Bandwidth Use for rea= ds and writes [GB / sec]", - "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / duration_time", + "MetricExpr": "64 * (UNC_M_CAS_COUNT.RD + UNC_M_CAS_COUNT.WR) / 1e= 9 / tma_info_system_time", "MetricGroup": "HPC;MemOffcore;MemoryBW;SoC;tma_issueBW", "MetricName": "tma_info_system_dram_bw_use", "PublicDescription": "Average external Memory Bandwidth Use for re= ads and writes [GB / sec]. Related metrics: tma_fb_full, tma_mem_bandwidth,= tma_sq_full" }, { "BriefDescription": "Giga Floating Point Operations Per Second", - "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / duration_time", + "MetricExpr": "(FP_ARITH_INST_RETIRED.SCALAR + 2 * FP_ARITH_INST_R= ETIRED.128B_PACKED_DOUBLE + 4 * FP_ARITH_INST_RETIRED.4_FLOPS + 8 * FP_ARIT= H_INST_RETIRED.256B_PACKED_SINGLE) / 1e9 / tma_info_system_time", "MetricGroup": "Cor;Flops;HPC", "MetricName": "tma_info_system_gflops", "PublicDescription": "Giga Floating Point Operations Per Second. A= ggregate across all supported options of: FP precisions, scalar and vector = instructions, vector-width" @@ -884,13 +890,14 @@ "MetricExpr": "INST_RETIRED.ANY / BR_INST_RETIRED.FAR_BRANCH:u", "MetricGroup": "Branches;OS", "MetricName": "tma_info_system_ipfarbranch", - "MetricThreshold": "tma_info_system_ipfarbranch < 1e6" + "MetricThreshold": "tma_info_system_ipfarbranch < 1000000" }, { "BriefDescription": "Cycles Per Instruction for the Operating Syst= em (OS) Kernel mode", "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P:k / INST_RETIRED.ANY_P:k"= , "MetricGroup": "OS", - "MetricName": "tma_info_system_kernel_cpi" + "MetricName": "tma_info_system_kernel_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "Fraction of cycles spent in the Operating Sys= tem (OS) Kernel mode", @@ -901,18 +908,31 @@ }, { "BriefDescription": "Average number of parallel data read requests= to external memory", - "MetricExpr": "UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x18= 2@ / UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\=3D0x182\\,thresh\\=3D1@", + "MetricExpr": "cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\= =3D0x182@ / cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filter_opc\\=3D0x182@", "MetricGroup": "Mem;MemoryBW;SoC", "MetricName": "tma_info_system_mem_parallel_reads", "PublicDescription": "Average number of parallel data read request= s to external memory. Accounts for demand loads and L1/L2 prefetches" }, { "BriefDescription": "Average latency of data read request to exter= nal memory (in nanoseconds)", - "MetricExpr": "1e9 * (UNC_C_TOR_OCCUPANCY.MISS_OPCODE@filter_opc\\= =3D0x182@ / UNC_C_TOR_INSERTS.MISS_OPCODE@filter_opc\\=3D0x182@) / (tma_inf= o_system_socket_clks / duration_time)", + "MetricExpr": "1e9 * (cbox@UNC_C_TOR_OCCUPANCY.MISS_OPCODE\\,filte= r_opc\\=3D0x182@ / cbox@UNC_C_TOR_INSERTS.MISS_OPCODE\\,filter_opc\\=3D0x18= 2@) / (tma_info_system_socket_clks / tma_info_system_time)", "MetricGroup": "Mem;MemoryLat;SoC", "MetricName": "tma_info_system_mem_read_latency", "PublicDescription": "Average latency of data read request to exte= rnal memory (in nanoseconds). Accounts for demand loads and L1/L2 prefetche= s. ([RKL+]memory-controller only)" }, + { + "BriefDescription": "PerfMon Event Multiplexing accuracy indicator= ", + "MetricExpr": "CPU_CLK_UNHALTED.THREAD_P / CPU_CLK_UNHALTED.THREAD= ", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_mux", + "MetricThreshold": "tma_info_system_mux > 1.1 | tma_info_system_mu= x < 0.9" + }, + { + "BriefDescription": "Total package Power in Watts", + "MetricExpr": "(power@energy\\-pkg@ * 61 + 15.6 * power@energy\\-r= am@) / (duration_time * 1e6)", + "MetricGroup": "Power;SoC", + "MetricName": "tma_info_system_power" + }, { "BriefDescription": "Fraction of cycles where both hardware Logica= l Processors were active", "MetricExpr": "(1 - CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE / (CPU_CLK_= UNHALTED.REF_XCLK_ANY / 2) if #SMT_on else 0)", @@ -925,6 +945,13 @@ "MetricGroup": "SoC", "MetricName": "tma_info_system_socket_clks" }, + { + "BriefDescription": "Run duration time in seconds", + "MetricExpr": "duration_time", + "MetricGroup": "Summary", + "MetricName": "tma_info_system_time", + "MetricThreshold": "tma_info_system_time < 1" + }, { "BriefDescription": "Average Frequency Utilization relative nomina= l frequency", "MetricExpr": "tma_info_thread_clks / CPU_CLK_UNHALTED.REF_TSC", @@ -933,12 +960,12 @@ }, { "BriefDescription": "Measured Average Uncore Frequency for the SoC= [GHz]", - "MetricExpr": "tma_info_system_socket_clks / 1e9 / duration_time", + "MetricExpr": "tma_info_system_socket_clks / 1e9 / tma_info_system= _time", "MetricGroup": "SoC", "MetricName": "tma_info_system_uncore_frequency" }, { - "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active.", + "BriefDescription": "Per-Logical Processor actual clocks when the = Logical Processor is active", "MetricExpr": "CPU_CLK_UNHALTED.THREAD", "MetricGroup": "Pipeline", "MetricName": "tma_info_thread_clks" @@ -947,14 +974,15 @@ "BriefDescription": "Cycles Per Instruction (per Logical Processor= )", "MetricExpr": "1 / tma_info_thread_ipc", "MetricGroup": "Mem;Pipeline", - "MetricName": "tma_info_thread_cpi" + "MetricName": "tma_info_thread_cpi", + "ScaleUnit": "1per_instr" }, { "BriefDescription": "The ratio of Executed- by Issued-Uops", "MetricExpr": "UOPS_EXECUTED.THREAD / UOPS_ISSUED.ANY", "MetricGroup": "Cor;Pipeline", "MetricName": "tma_info_thread_execute_per_issue", - "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage." + "PublicDescription": "The ratio of Executed- by Issued-Uops. Ratio= > 1 suggests high rate of uop micro-fusions. Ratio < 1 suggest high rate o= f \"execute\" at rename stage" }, { "BriefDescription": "Instructions Per Cycle (per Logical Processor= )", @@ -980,24 +1008,24 @@ "MetricExpr": "UOPS_RETIRED.RETIRE_SLOTS / BR_INST_RETIRED.NEAR_TA= KEN", "MetricGroup": "Branches;Fed;FetchBW", "MetricName": "tma_info_thread_uptb", - "MetricThreshold": "tma_info_thread_uptb < 6" + "MetricThreshold": "tma_info_thread_uptb < 4 * 1.5" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU was stalled due to Instruction TLB (ITLB) misses", - "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_thread_= clks", + "MetricExpr": "(14 * ITLB_MISSES.STLB_HIT + cpu@ITLB_MISSES.WALK_D= URATION\\,cmask\\=3D0x1@ + 7 * ITLB_MISSES.WALK_COMPLETED) / tma_info_threa= d_clks", "MetricGroup": "BigFootprint;BvBC;FetchLat;MemoryTLB;TopdownL3;tma= _L3_group;tma_fetch_latency_group", "MetricName": "tma_itlb_misses", - "MetricThreshold": "tma_itlb_misses > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", + "MetricThreshold": "tma_itlb_misses > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Instruction TLB (ITLB) misses. Sample with: ITLB_M= ISSES.WALK_COMPLETED", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 data cache", + "BriefDescription": "This metric estimates how often the CPU was s= talled without loads missing the L1 Data (L1D) cache", "MetricExpr": "max((CYCLE_ACTIVITY.STALLS_MEM_ANY - CYCLE_ACTIVITY= .STALLS_L1D_MISS) / tma_info_thread_clks, 0)", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_issueL1;tma_issueMC;tma_memory_bound_group", "MetricName": "tma_l1_bound", - "MetricThreshold": "tma_l1_bound > 0.1 & (tma_memory_bound > 0.2 &= tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 data cache. The L1 data cache typical= ly has the shortest latency. However; in certain cases like loads blocked = on older stores; a load might suffer due to high latency even though it is = being satisfied by the L1. Another example is loads who miss in the TLB. Th= ese cases are characterized by execution unit stalls; while some non-comple= ted demand load lives in the machine without having that demand load missin= g the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT_PS;MEM_LOAD_UOPS_= RETIRED.HIT_LFB_PS. Related metrics: tma_clears_resteers, tma_machine_clear= s, tma_microcode_sequencer, tma_ms_switches, tma_ports_utilized_1", + "MetricThreshold": "tma_l1_bound > 0.1 & tma_memory_bound > 0.2 & = tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled without loads missing the L1 Data (L1D) cache. The L1D cache typic= ally has the shortest latency. However; in certain cases like loads blocke= d on older stores; a load might suffer due to high latency even though it i= s being satisfied by the L1D. Another example is loads who miss in the TLB.= These cases are characterized by execution unit stalls; while some non-com= pleted demand load lives in the machine without having that demand load mis= sing the L1 cache. Sample with: MEM_LOAD_UOPS_RETIRED.L1_HIT. Related metri= cs: tma_clears_resteers, tma_machine_clears, tma_microcode_sequencer, tma_m= s_switches, tma_ports_utilized_1", "ScaleUnit": "100%" }, { @@ -1005,8 +1033,8 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_L1D_MISS - CYCLE_ACTIVITY.ST= ALLS_L2_MISS) / tma_info_thread_clks", "MetricGroup": "BvML;CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_= L3_group;tma_memory_bound_group", "MetricName": "tma_l2_bound", - "MetricThreshold": "tma_l2_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT_PS", + "MetricThreshold": "tma_l2_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to L2 cache accesses by loads. Avoiding cache misses (i.e. L1 = misses/L2 hits) can improve the latency and increase performance. Sample wi= th: MEM_LOAD_UOPS_RETIRED.L2_HIT", "ScaleUnit": "100%" }, { @@ -1015,8 +1043,8 @@ "MetricExpr": "MEM_LOAD_UOPS_RETIRED.L3_HIT / (MEM_LOAD_UOPS_RETIR= ED.L3_HIT + 7 * MEM_LOAD_UOPS_RETIRED.L3_MISS) * CYCLE_ACTIVITY.STALLS_L2_M= ISS / tma_info_thread_clks", "MetricGroup": "CacheHits;MemoryBound;TmaL3mem;TopdownL3;tma_L3_gr= oup;tma_memory_bound_group", "MetricName": "tma_l3_bound", - "MetricThreshold": "tma_l3_bound > 0.05 & (tma_memory_bound > 0.2 = & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS", + "MetricThreshold": "tma_l3_bound > 0.05 & tma_memory_bound > 0.2 &= tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often the CPU was = stalled due to loads accesses to L3 cache or contended with a sibling Core.= Avoiding cache misses (i.e. L2 misses/L3 hits) can improve the latency an= d increase performance. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT", "ScaleUnit": "100%" }, { @@ -1025,8 +1053,8 @@ "MetricExpr": "41 * (MEM_LOAD_UOPS_RETIRED.L3_HIT * (1 + MEM_LOAD_= UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOAD_UOPS_RETIRE= D.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UOPS_L3_HIT_RET= IRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_LOAD_UOPS_L3_= MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM + MEM_L= OAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _FWD))) / tma_info_thread_clks", "MetricGroup": "BvML;MemoryLat;TopdownL4;tma_L4_group;tma_issueLat= ;tma_l3_bound_group", "MetricName": "tma_l3_hit_latency", - "MetricThreshold": "tma_l3_hit_latency > 0.1 & (tma_l3_bound > 0.0= 5 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT_PS. Related metric= s: tma_mem_latency", + "MetricThreshold": "tma_l3_hit_latency > 0.1 & tma_l3_bound > 0.05= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles wit= h demand load accesses that hit the L3 cache under unloaded scenarios (poss= ibly L3 latency limited). Avoiding private cache misses (i.e. L2 misses/L3= hits) will improve the latency; reduce contention with sibling physical co= res and increase performance. Note the value of this node may overlap with= its siblings. Sample with: MEM_LOAD_UOPS_RETIRED.L3_HIT. Related metrics: = tma_branch_resteers, tma_mem_latency, tma_store_latency", "ScaleUnit": "100%" }, { @@ -1034,18 +1062,18 @@ "MetricExpr": "ILD_STALL.LCP / tma_info_thread_clks", "MetricGroup": "FetchLat;TopdownL3;tma_L3_group;tma_fetch_latency_= group;tma_issueFB", "MetricName": "tma_lcp", - "MetricThreshold": "tma_lcp > 0.05 & (tma_fetch_latency > 0.1 & tm= a_frontend_bound > 0.15)", - "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. #Link: Optim= ization Guide about LCP BKMs. Related metrics: tma_dsb_switches, tma_fetch_= bandwidth, tma_info_frontend_dsb_coverage, tma_info_inst_mix_iptb", + "MetricThreshold": "tma_lcp > 0.05 & tma_fetch_latency > 0.1 & tma= _frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles CP= U was stalled due to Length Changing Prefixes (LCPs). Using proper compiler= flags or Intel Compiler by default will certainly avoid this. Related metr= ics: tma_dsb_switches, tma_fetch_bandwidth, tma_info_frontend_dsb_coverage,= tma_info_inst_mix_iptb", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations -- instructions that require= no more than one uop (micro-operation)", + "BriefDescription": "This metric represents fraction of slots wher= e the CPU was retiring light-weight operations , instructions that require = no more than one uop (micro-operation)", "MetricExpr": "tma_retiring - tma_heavy_operations", "MetricGroup": "Retire;TmaL2;TopdownL2;tma_L2_group;tma_retiring_g= roup", "MetricName": "tma_light_operations", "MetricThreshold": "tma_light_operations > 0.6", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations -- instructions that requir= e no more than one uop (micro-operation). This correlates with total number= of instructions used by the program. A uops-per-instruction (see UopPI met= ric) ratio of 1 or less should be expected for decently optimized code runn= ing on Intel Core/Xeon products. While this often indicates efficient X86 i= nstructions were executed; high value does not necessarily mean better perf= ormance cannot be achieved. ([ICL+] Note this may undercount due to approxi= mation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIS= T", + "PublicDescription": "This metric represents fraction of slots whe= re the CPU was retiring light-weight operations , instructions that require= no more than one uop (micro-operation). This correlates with total number = of instructions used by the program. A uops-per-instruction (see UopPI metr= ic) ratio of 1 or less should be expected for decently optimized code runni= ng on Intel Core/Xeon products. While this often indicates efficient X86 in= structions were executed; high value does not necessarily mean better perfo= rmance cannot be achieved. ([ICL+] Note this may undercount due to approxim= ation using indirect events; [ADL+] .). Sample with: INST_RETIRED.PREC_DIST= ", "ScaleUnit": "100%" }, { @@ -1063,18 +1091,18 @@ "MetricExpr": "200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM * (= 1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LOA= D_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_UO= PS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM_= LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE= _DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_R= ETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;TopdownL5;tma_L5_group;tma_mem_latency_grou= p", "MetricName": "tma_local_mem", - "MetricThreshold": "tma_local_mem > 0.1 & (tma_mem_latency > 0.1 &= (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2)= ))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM_PS", + "MetricThreshold": "tma_local_mem > 0.1 & tma_mem_latency > 0.1 & = tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from local memory. Caching will = improve the latency and increase performance. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.LOCAL_DRAM", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles the= CPU spent handling cache misses due to lock operations", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "MEM_UOPS_RETIRED.LOCK_LOADS / MEM_UOPS_RETIRED.ALL_= STORES * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTSTANDING.CYCLES_W= ITH_DEMAND_RFO) / tma_info_thread_clks", - "MetricGroup": "Offcore;TopdownL4;tma_L4_group;tma_issueRFO;tma_l1= _bound_group", + "MetricGroup": "LockCont;Offcore;TopdownL4;tma_L4_group;tma_issueR= FO;tma_l1_bound_group", "MetricName": "tma_lock_latency", - "MetricThreshold": "tma_lock_latency > 0.2 & (tma_l1_bound > 0.1 &= (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS_PS. Related metrics: tma_store_latency", + "MetricThreshold": "tma_lock_latency > 0.2 & tma_l1_bound > 0.1 & = tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles th= e CPU spent handling cache misses due to lock operations. Due to the microa= rchitecture handling of locks; they are classified as L1_Bound regardless o= f what memory source satisfied them. Sample with: MEM_UOPS_RETIRED.LOCK_LOA= DS. Related metrics: tma_store_latency", "ScaleUnit": "100%" }, { @@ -1090,10 +1118,10 @@ }, { "BriefDescription": "This metric estimates fraction of cycles wher= e the core's performance was likely hurt due to approaching bandwidth limit= s of external memory - DRAM ([SPR-HBM] and/or HBM)", - "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D4@) / tma_info_thread_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", + "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, cpu@OFFCORE_REQUESTS_O= UTSTANDING.ALL_DATA_RD\\,cmask\\=3D0x4@) / tma_info_thread_clks", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_d= ram_bound_group;tma_issueBW", "MetricName": "tma_mem_bandwidth", - "MetricThreshold": "tma_mem_bandwidth > 0.2 & (tma_dram_bound > 0.= 1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_bandwidth > 0.2 & tma_dram_bound > 0.1= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the core's performance was likely hurt due to approaching bandwidth limi= ts of external memory - DRAM ([SPR-HBM] and/or HBM). The underlying heuris= tic assumes that a similar off-core traffic is generated by all IA cores. T= his metric does not aggregate non-data-read requests by this logical proces= sor; requests from other IA Logical Processors/Physical Cores/sockets; or o= ther non-IA devices like GPU; hence the maximum external memory bandwidth l= imits may or may not be approached when this metric is flagged (see Uncore = counters for that). Related metrics: tma_fb_full, tma_info_system_dram_bw_u= se, tma_sq_full", "ScaleUnit": "100%" }, @@ -1102,7 +1130,7 @@ "MetricExpr": "min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS_OUTST= ANDING.CYCLES_WITH_DATA_RD) / tma_info_thread_clks - tma_mem_bandwidth", "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= dram_bound_group;tma_issueLat", "MetricName": "tma_mem_latency", - "MetricThreshold": "tma_mem_latency > 0.1 & (tma_dram_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_mem_latency > 0.1 & tma_dram_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric estimates fraction of cycles whe= re the performance was likely hurt due to latency from external memory - DR= AM ([SPR-HBM] and/or HBM). This metric does not aggregate requests from ot= her Logical Processors/Physical Cores/sockets (see Uncore counters for that= ). Related metrics: tma_l3_hit_latency", "ScaleUnit": "100%" }, @@ -1114,7 +1142,7 @@ "MetricName": "tma_memory_bound", "MetricThreshold": "tma_memory_bound > 0.2 & tma_backend_bound > 0= .2", "MetricgroupNoGroup": "TopdownL2", - "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= .", + "PublicDescription": "This metric represents fraction of slots the= Memory subsystem within the Backend was a bottleneck. Memory Bound estima= tes fraction of slots where pipeline is likely stalled due to demand load o= r store instructions. This accounts mainly for (1) non-completed in-flight = memory demand loads which coincides with execution units starvation; in add= ition to (2) cases where stores could impose backpressure on the pipeline w= hen many of them get buffered at the same time (less common out of the two)= ", "ScaleUnit": "100%" }, { @@ -1131,8 +1159,8 @@ "MetricExpr": "BR_MISP_RETIRED.ALL_BRANCHES * tma_branch_resteers = / (BR_MISP_RETIRED.ALL_BRANCHES + MACHINE_CLEARS.COUNT + BACLEARS.ANY)", "MetricGroup": "BadSpec;BrMispredicts;BvMP;TopdownL4;tma_L4_group;= tma_branch_resteers_group;tma_issueBM", "MetricName": "tma_mispredicts_resteers", - "MetricThreshold": "tma_mispredicts_resteers > 0.05 & (tma_branch_= resteers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", - "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts, tma_info_bad= _spec_branch_misprediction_cost", + "MetricThreshold": "tma_mispredicts_resteers > 0.05 & tma_branch_r= esteers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to Branch Resteers as a result of Branch Mispredictio= n at execution stage. Related metrics: tma_branch_mispredicts", "ScaleUnit": "100%" }, { @@ -1141,7 +1169,7 @@ "MetricGroup": "DSBmiss;FetchBW;TopdownL3;tma_L3_group;tma_fetch_b= andwidth_group", "MetricName": "tma_mite", "MetricThreshold": "tma_mite > 0.1 & tma_fetch_bandwidth > 0.2", - "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck.", + "PublicDescription": "This metric represents Core fraction of cycl= es in which CPU was likely limited due to the MITE pipeline (the legacy dec= ode pipeline). This pipeline is used for code that was not pre-cached in th= e DSB or LSD. For example; inefficiencies due to asymmetric decoders; use o= f long immediate or LCP can manifest as MITE fetch bandwidth bottleneck", "ScaleUnit": "100%" }, { @@ -1149,8 +1177,8 @@ "MetricExpr": "2 * IDQ.MS_SWITCHES / tma_info_thread_clks", "MetricGroup": "FetchLat;MicroSeq;TopdownL3;tma_L3_group;tma_fetch= _latency_group;tma_issueMC;tma_issueMS;tma_issueMV;tma_issueSO", "MetricName": "tma_ms_switches", - "MetricThreshold": "tma_ms_switches > 0.05 & (tma_fetch_latency > = 0.1 & tma_frontend_bound > 0.15)", - "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer, tma_mix= ing_vectors, tma_serializing_operation", + "MetricThreshold": "tma_ms_switches > 0.05 & tma_fetch_latency > 0= .1 & tma_frontend_bound > 0.15", + "PublicDescription": "This metric estimates the fraction of cycles= when the CPU was stalled due to switches of uop delivery to the Microcode = Sequencer (MS). Commonly used instructions are optimized for delivery by th= e DSB (decoded i-cache) or MITE (legacy instruction decode) pipelines. Cert= ain operations cannot be handled natively by the execution pipeline; and mu= st be performed by microcode (small programs injected into the execution st= ream). Switching to the MS too often can negatively impact performance. The= MS is designated to deliver long uop flows required by CISC instructions l= ike CPUID; or uncommon conditions like Floating Point Assists when dealing = with Denormals. Sample with: IDQ.MS_SWITCHES. Related metrics: tma_clears_r= esteers, tma_l1_bound, tma_machine_clears, tma_microcode_sequencer", "ScaleUnit": "100%" }, { @@ -1159,7 +1187,7 @@ "MetricGroup": "Compute;TopdownL6;tma_L6_group;tma_alu_op_utilizat= ion_group;tma_issue2P", "MetricName": "tma_port_0", "MetricThreshold": "tma_port_0 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vecto= r_512b, tma_port_1, tma_port_5, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 0 ([SNB+] ALU; [HSW+] ALU and 2nd = branch). Sample with: UOPS_DISPATCHED_PORT.PORT_0. Related metrics: tma_fp_= scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_1, = tma_port_5, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1168,7 +1196,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_1", "MetricThreshold": "tma_port_1 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_fp_vector_512b, tma_port_0, tma_port_5, tm= a_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 1 (ALU). Sample with: UOPS_DISPATC= HED_PORT.PORT_1. Related metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vect= or_128b, tma_fp_vector_256b, tma_port_0, tma_port_5, tma_port_6, tma_ports_= utilized_2", "ScaleUnit": "100%" }, { @@ -1204,7 +1232,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_5", "MetricThreshold": "tma_port_5 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED.PORT_5. Related metrics: tma_fp_scalar,= tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector_512b,= tma_port_0, tma_port_1, tma_port_6, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 5 ([SNB+] Branches and ALU; [HSW+]= ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_5. Related metrics: tma_fp_sc= alar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, tm= a_port_1, tma_port_6, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1213,7 +1241,7 @@ "MetricGroup": "TopdownL6;tma_L6_group;tma_alu_op_utilization_grou= p;tma_issue2P", "MetricName": "tma_port_6", "MetricThreshold": "tma_port_6 > 0.6", - "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_6. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_fp_vector= _512b, tma_port_0, tma_port_1, tma_port_5, tma_ports_utilized_2", + "PublicDescription": "This metric represents Core fraction of cycl= es CPU dispatched uops on execution port 6 ([HSW+] Primary Branch and simpl= e ALU). Sample with: UOPS_DISPATCHED_PORT.PORT_1. Related metrics: tma_fp_s= calar, tma_fp_vector, tma_fp_vector_128b, tma_fp_vector_256b, tma_port_0, t= ma_port_1, tma_port_5, tma_ports_utilized_2", "ScaleUnit": "100%" }, { @@ -1231,43 +1259,43 @@ "MetricExpr": "(CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES= _GE_1_UOP_EXEC - (UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC if tma_info_thread_ip= c > 1.8 else UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) - (RS_EVENTS.EMPTY_CYCLES= if tma_fetch_latency > 0.1 else 0) + RESOURCE_STALLS.SB - RESOURCE_STALLS.= SB - CYCLE_ACTIVITY.STALLS_MEM_ANY) / tma_info_thread_clks", "MetricGroup": "PortsUtil;TopdownL3;tma_L3_group;tma_core_bound_gr= oup", "MetricName": "tma_ports_utilization", - "MetricThreshold": "tma_ports_utilization > 0.15 & (tma_core_bound= > 0.1 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations.", + "MetricThreshold": "tma_ports_utilization > 0.15 & tma_core_bound = > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU performance was potentially limited due to Core computation issues (no= n divider-related). Two distinct categories can be attributed into this me= tric: (1) heavy data-dependency among contiguous instructions would manifes= t in this metric - such cases are often referred to as low Instruction Leve= l Parallelism (ILP). (2) Contention on some hardware execution unit other t= han Divider. For example; when there are too many multiply operations", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed no uops on any execution port (Logical Processor cycles since ICL= , Physical Core cycles otherwise)", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\,cmask\\=3D1@ / 2 if= #SMT_on else (CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCLES if tma= _fetch_latency > 0.1 else 0)) / tma_info_core_core_clks)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,inv\\=3D0x1\\,cmask\\=3D0= x1@ / 2 if #SMT_on else CYCLE_ACTIVITY.STALLS_TOTAL - (RS_EVENTS.EMPTY_CYCL= ES if tma_fetch_latency > 0.1 else 0)) / tma_info_core_core_clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_ports_utiliza= tion_group", "MetricName": "tma_ports_utilized_0", - "MetricThreshold": "tma_ports_utilized_0 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric.", + "MetricThreshold": "tma_ports_utilized_0 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed no uops on any execution port (Logical Processor cycles since IC= L, Physical Core cycles otherwise). Long-latency instructions like divides = may contribute to this metric", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles whe= re the CPU executed total of 1 uop per cycle on all execution ports (Logica= l Processor cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D1@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D2@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_clks= )", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x1@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x2@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_1_UOP_EXEC - UOPS_EXECUTED.CYCLES_GE_2_UOPS_EXEC) / tma_info_core_core_c= lks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issueL1;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_1", - "MetricThreshold": "tma_ports_utilized_1 > 0.2 & (tma_ports_utiliz= ation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_1 > 0.2 & tma_ports_utiliza= tion > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "PublicDescription": "This metric represents fraction of cycles wh= ere the CPU executed total of 1 uop per cycle on all execution ports (Logic= al Processor cycles since ICL, Physical Core cycles otherwise). This can be= due to heavy data-dependency among software instructions; or over oversubs= cribing a particular hardware resource. In some other cases with high 1_Por= t_Utilized and L1_Bound; this metric can point to L1 data-cache latency bot= tleneck that may not necessarily manifest with complete execution starvatio= n (due to the short L1 latency e.g. walking a linked list) - looking at the= assembly can be helpful. Related metrics: tma_l1_bound", "ScaleUnit": "100%" }, { "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 2 uops per cycle on all execution ports (Logical Process= or cycles since ICL, Physical Core cycles otherwise)", - "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D2@ - cpu@UOPS_= EXECUTED.CORE\\,cmask\\=3D3@) / 2 if #SMT_on else (UOPS_EXECUTED.CYCLES_GE_= 2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clk= s)", + "MetricExpr": "((cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x2@ - cpu@UOP= S_EXECUTED.CORE\\,cmask\\=3D0x3@) / 2 if #SMT_on else UOPS_EXECUTED.CYCLES_= GE_2_UOPS_EXEC - UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_= clks", "MetricGroup": "PortsUtil;TopdownL4;tma_L4_group;tma_issue2P;tma_p= orts_utilization_group", "MetricName": "tma_ports_utilized_2", - "MetricThreshold": "tma_ports_utilized_2 > 0.15 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_fp_vector_512b, tma_port_0, tma_port_1, tma_port_5, tma_port= _6", + "MetricThreshold": "tma_ports_utilized_2 > 0.15 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents fraction of cycles CP= U executed total of 2 uops per cycle on all execution ports (Logical Proces= sor cycles since ICL, Physical Core cycles otherwise). Loop Vectorization = -most compilers feature auto-Vectorization options today- reduces pressure = on the execution ports as multiple elements are calculated with same uop. R= elated metrics: tma_fp_scalar, tma_fp_vector, tma_fp_vector_128b, tma_fp_ve= ctor_256b, tma_port_0, tma_port_1, tma_port_5, tma_port_6", "ScaleUnit": "100%" }, { - "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise).", - "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D3@ / 2 if #SMT_= on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", + "BriefDescription": "This metric represents fraction of cycles CPU= executed total of 3 or more uops per cycle on all execution ports (Logical= Processor cycles since ICL, Physical Core cycles otherwise)", + "MetricExpr": "(cpu@UOPS_EXECUTED.CORE\\,cmask\\=3D0x3@ / 2 if #SM= T_on else UOPS_EXECUTED.CYCLES_GE_3_UOPS_EXEC) / tma_info_core_core_clks", "MetricGroup": "BvCB;PortsUtil;TopdownL4;tma_L4_group;tma_ports_ut= ilization_group", "MetricName": "tma_ports_utilized_3m", - "MetricThreshold": "tma_ports_utilized_3m > 0.4 & (tma_ports_utili= zation > 0.15 & (tma_core_bound > 0.1 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_ports_utilized_3m > 0.4 & tma_ports_utiliz= ation > 0.15 & tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%" }, { @@ -1276,8 +1304,8 @@ "MetricExpr": "(200 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM *= (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_L= OAD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_= UOPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + ME= M_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMO= TE_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS= _RETIRED.REMOTE_FWD))) + 180 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD)))) / tma_info_thread_clks", "MetricGroup": "Offcore;Server;Snoop;TopdownL5;tma_L5_group;tma_is= sueSyncxn;tma_mem_latency_group", "MetricName": "tma_remote_cache", - "MetricThreshold": "tma_remote_cache > 0.05 & (tma_mem_latency > 0= .1 & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > = 0.2)))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. #link to NUMA article. Sample with: MEM_LOAD_UOPS_L3_M= ISS_RETIRED.REMOTE_HITM_PS;MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD_PS. Rel= ated metrics: tma_contested_accesses, tma_data_sharing, tma_false_sharing, = tma_machine_clears", + "MetricThreshold": "tma_remote_cache > 0.05 & tma_mem_latency > 0.= 1 & tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2= ", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote cache in other socke= ts including synchronizations issues. This is caused often due to non-optim= al NUMA allocations. Sample with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM= , MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_FWD. Related metrics: tma_contested_= accesses, tma_data_sharing, tma_false_sharing, tma_machine_clears", "ScaleUnit": "100%" }, { @@ -1285,8 +1313,8 @@ "MetricExpr": "310 * (MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM * = (1 + MEM_LOAD_UOPS_RETIRED.HIT_LFB / (MEM_LOAD_UOPS_RETIRED.L2_HIT + MEM_LO= AD_UOPS_RETIRED.L3_HIT + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_HIT + MEM_LOAD_U= OPS_L3_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_L3_HIT_RETIRED.XSNP_MISS + MEM= _LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOT= E_DRAM + MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_HITM + MEM_LOAD_UOPS_L3_MISS_= RETIRED.REMOTE_FWD))) / tma_info_thread_clks", "MetricGroup": "Server;Snoop;TopdownL5;tma_L5_group;tma_mem_latenc= y_group", "MetricName": "tma_remote_mem", - "MetricThreshold": "tma_remote_mem > 0.1 & (tma_mem_latency > 0.1 = & (tma_dram_bound > 0.1 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2= )))", - "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. #link to NUMA article. Sample= with: MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM_PS", + "MetricThreshold": "tma_remote_mem > 0.1 & tma_mem_latency > 0.1 &= tma_dram_bound > 0.1 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles whi= le the memory subsystem was handling loads from remote memory. This is caus= ed often due to non-optimal NUMA allocations. Sample with: MEM_LOAD_UOPS_L3= _MISS_RETIRED.REMOTE_DRAM", "ScaleUnit": "100%" }, { @@ -1305,8 +1333,8 @@ "MetricExpr": "tma_info_memory_load_miss_real_latency * LD_BLOCKS.= NO_SR / tma_info_thread_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_split_loads", - "MetricThreshold": "tma_split_loads > 0.2 & (tma_l1_bound > 0.1 & = (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS_PS", + "MetricThreshold": "tma_split_loads > 0.3", + "PublicDescription": "This metric estimates fraction of cycles han= dling memory load split accesses - load that cross 64-byte cache line bound= ary. Sample with: MEM_UOPS_RETIRED.SPLIT_LOADS", "ScaleUnit": "100%" }, { @@ -1314,16 +1342,16 @@ "MetricExpr": "2 * MEM_UOPS_RETIRED.SPLIT_STORES / tma_info_core_c= ore_clks", "MetricGroup": "TopdownL4;tma_L4_group;tma_issueSpSt;tma_store_bou= nd_group", "MetricName": "tma_split_stores", - "MetricThreshold": "tma_split_stores > 0.2 & (tma_store_bound > 0.= 2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES_PS. Related metrics: tma_port_= 4", + "MetricThreshold": "tma_split_stores > 0.2 & tma_store_bound > 0.2= & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric represents rate of split store a= ccesses. Consider aligning your data to the 64-byte cache line granularity= . Sample with: MEM_UOPS_RETIRED.SPLIT_STORES. Related metrics: tma_port_4", "ScaleUnit": "100%" }, { "BriefDescription": "This metric measures fraction of cycles where= the Super Queue (SQ) was full taking into account all request-types and bo= th hardware SMT threads (Logical Processors)", "MetricExpr": "(OFFCORE_REQUESTS_BUFFER.SQ_FULL / 2 if #SMT_on els= e OFFCORE_REQUESTS_BUFFER.SQ_FULL) / tma_info_core_core_clks", - "MetricGroup": "BvMS;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", + "MetricGroup": "BvMB;MemoryBW;Offcore;TopdownL4;tma_L4_group;tma_i= ssueBW;tma_l3_bound_group", "MetricName": "tma_sq_full", - "MetricThreshold": "tma_sq_full > 0.3 & (tma_l3_bound > 0.05 & (tm= a_memory_bound > 0.2 & tma_backend_bound > 0.2))", + "MetricThreshold": "tma_sq_full > 0.3 & tma_l3_bound > 0.05 & tma_= memory_bound > 0.2 & tma_backend_bound > 0.2", "PublicDescription": "This metric measures fraction of cycles wher= e the Super Queue (SQ) was full taking into account all request-types and b= oth hardware SMT threads (Logical Processors). Related metrics: tma_fb_full= , tma_info_system_dram_bw_use, tma_mem_bandwidth", "ScaleUnit": "100%" }, @@ -1332,8 +1360,8 @@ "MetricExpr": "RESOURCE_STALLS.SB / tma_info_thread_clks", "MetricGroup": "MemoryBound;TmaL3mem;TopdownL3;tma_L3_group;tma_me= mory_bound_group", "MetricName": "tma_store_bound", - "MetricThreshold": "tma_store_bound > 0.2 & (tma_memory_bound > 0.= 2 & tma_backend_bound > 0.2)", - "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES_PS", + "MetricThreshold": "tma_store_bound > 0.2 & tma_memory_bound > 0.2= & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates how often CPU was stal= led due to RFO store memory accesses; RFO store issue a read-for-ownership= request before the write. Even though store accesses do not typically stal= l out-of-order CPUs; there are few cases where stores can lead to actual st= alls. This metric will be flagged should RFO stores be a bottleneck. Sample= with: MEM_UOPS_RETIRED.ALL_STORES", "ScaleUnit": "100%" }, { @@ -1341,18 +1369,18 @@ "MetricExpr": "13 * LD_BLOCKS.STORE_FORWARD / tma_info_thread_clks= ", "MetricGroup": "TopdownL4;tma_L4_group;tma_l1_bound_group", "MetricName": "tma_store_fwd_blk", - "MetricThreshold": "tma_store_fwd_blk > 0.1 & (tma_l1_bound > 0.1 = & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading.", + "MetricThreshold": "tma_store_fwd_blk > 0.1 & tma_l1_bound > 0.1 &= tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric roughly estimates fraction of cy= cles when the memory subsystem had loads blocked since they could not forwa= rd data from earlier (in program order) overlapping stores. To streamline m= emory operations in the pipeline; a load can avoid waiting for memory if a = prior in-flight store is writing the data that the load wants to read (stor= e forwarding process). However; in some cases the load may be blocked for a= significant time pending the store forward. For example; when the prior st= ore is writing a smaller region than the load is reading", "ScaleUnit": "100%" }, { "BriefDescription": "This metric estimates fraction of cycles the = CPU spent handling L1D store misses", "MetricConstraint": "NO_GROUP_EVENTS", "MetricExpr": "(L2_RQSTS.RFO_HIT * 9 * (1 - MEM_UOPS_RETIRED.LOCK_= LOADS / MEM_UOPS_RETIRED.ALL_STORES) + (1 - MEM_UOPS_RETIRED.LOCK_LOADS / M= EM_UOPS_RETIRED.ALL_STORES) * min(CPU_CLK_UNHALTED.THREAD, OFFCORE_REQUESTS= _OUTSTANDING.CYCLES_WITH_DEMAND_RFO)) / tma_info_thread_clks", - "MetricGroup": "BvML;MemoryLat;Offcore;TopdownL4;tma_L4_group;tma_= issueRFO;tma_issueSL;tma_store_bound_group", + "MetricGroup": "BvML;LockCont;MemoryLat;Offcore;TopdownL4;tma_L4_g= roup;tma_issueRFO;tma_issueSL;tma_store_bound_group", "MetricName": "tma_store_latency", - "MetricThreshold": "tma_store_latency > 0.1 & (tma_store_bound > 0= .2 & (tma_memory_bound > 0.2 & tma_backend_bound > 0.2))", - "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_fb_full, tma_lock_latency", + "MetricThreshold": "tma_store_latency > 0.1 & tma_store_bound > 0.= 2 & tma_memory_bound > 0.2 & tma_backend_bound > 0.2", + "PublicDescription": "This metric estimates fraction of cycles the= CPU spent handling L1D store misses. Store accesses usually less impact ou= t-of-order core performance; however; holding resources for longer time can= lead into undesired implications (e.g. contention on L1D fill-buffer entri= es - see FB_Full). Related metrics: tma_branch_resteers, tma_fb_full, tma_l= 3_hit_latency, tma_lock_latency", "ScaleUnit": "100%" }, { @@ -1368,7 +1396,7 @@ "MetricExpr": "tma_branch_resteers - tma_mispredicts_resteers - tm= a_clears_resteers", "MetricGroup": "BigFootprint;BvBC;FetchLat;TopdownL4;tma_L4_group;= tma_branch_resteers_group", "MetricName": "tma_unknown_branches", - "MetricThreshold": "tma_unknown_branches > 0.05 & (tma_branch_rest= eers > 0.05 & (tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15))", + "MetricThreshold": "tma_unknown_branches > 0.05 & tma_branch_reste= ers > 0.05 & tma_fetch_latency > 0.1 & tma_frontend_bound > 0.15", "PublicDescription": "This metric represents fraction of cycles th= e CPU was stalled due to new branch address clears. These are fetched branc= hes the Branch Prediction Unit was unable to recognize (e.g. first time the= branch is fetched or hitting BTB capacity limit) hence called Unknown Bran= ches. Sample with: BACLEARS.ANY", "ScaleUnit": "100%" }, @@ -1377,8 +1405,8 @@ "MetricExpr": "INST_RETIRED.X87 * tma_info_thread_uoppi / UOPS_RET= IRED.RETIRE_SLOTS", "MetricGroup": "Compute;TopdownL4;tma_L4_group;tma_fp_arith_group"= , "MetricName": "tma_x87_use", - "MetricThreshold": "tma_x87_use > 0.1 & (tma_fp_arith > 0.2 & tma_= light_operations > 0.6)", - "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint.", + "MetricThreshold": "tma_x87_use > 0.1 & tma_fp_arith > 0.2 & tma_l= ight_operations > 0.6", + "PublicDescription": "This metric serves as an approximation of le= gacy x87 usage. It accounts for instructions beyond X87 FP arithmetic opera= tions; hence may be used as a thermometer to avoid X87 high usage and prefe= rably upgrade to modern ISA. See Tip under Tuning Hint", "ScaleUnit": "100%" }, { diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json b/tools/p= erf/pmu-events/arch/x86/broadwellx/cache.json index beeda41b428a..59bc8c71487e 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/cache.json @@ -22,7 +22,7 @@ "Counter": "2", "EventCode": "0x48", "EventName": "L1D_PEND_MISS.PENDING", - "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch.\nNote: In the L1D, a Dema= nd Read contains cacheable or noncacheable demand loads, including ones cau= sing cache-line splits and reads due to page walks resulted from any reques= t type.", + "PublicDescription": "This event counts duration of L1D miss outst= anding, that is each cycle number of Fill Buffers (FB) outstanding required= by Demand Reads. FB either is held by demand loads, or it is held by non-d= emand loads and gets hit at least once by demand. The valid outstanding int= erval is defined until the FB deallocation by one of the following ways: fr= om FB allocation, if FB is allocated by demand; from the demand Hit FB, if = it is allocated by hardware or software prefetch. Note: In the L1D, a Deman= d Read contains cacheable or noncacheable demand loads, including ones caus= ing cache-line splits and reads due to page walks resulted from any request= type.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -434,7 +434,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.HIT_LFB", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready.\nNote: Only two data= -sources of L1/FB are applicable for AVX-256bit even though the correspond= ing AVX load could be serviced by a deeper level in the memory hierarchy. D= ata source is reported for the Low-half load.", + "PublicDescription": "This event counts retired load uops which da= ta sources were load uops missed L1 but hit a fill buffer due to a precedin= g miss to the same cache line with the data not ready. Note: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load.", "SampleAfterValue": "100003", "UMask": "0x40" }, @@ -445,7 +445,7 @@ "EventCode": "0xD1", "EventName": "MEM_LOAD_UOPS_RETIRED.L1_HIT", "PEBS": "1", - "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache.\nNote: Only two data-= sources of L1/FB are applicable for AVX-256bit even though the correspondi= ng AVX load could be serviced by a deeper level in the memory hierarchy. Da= ta source is reported for the Low-half load. This event also counts SW pref= etches independent of the actual data source.", + "PublicDescription": "This event counts retired load uops which da= ta sources were hits in the nearest-level (L1) cache. Note: Only two data-s= ources of L1/FB are applicable for AVX-256bit even though the correspondin= g AVX load could be serviced by a deeper level in the memory hierarchy. Dat= a source is reported for the Low-half load. This event also counts SW prefe= tches independent of the actual data source.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -634,7 +634,7 @@ "Counter": "0,1,2,3", "EventCode": "0xb2", "EventName": "OFFCORE_REQUESTS_BUFFER.SQ_FULL", - "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full.\nNote: Writeback pending FIFO has s= ix entries.", + "PublicDescription": "This event counts the number of cases when t= he offcore requests buffer cannot take more entries for the core. This can = happen when the superqueue does not contain eligible entries, or when L1D w= riteback pending FIFO requests is full. Note: Writeback pending FIFO has si= x entries.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -697,7 +697,7 @@ "Errata": "BDM76", "EventCode": "0x60", "EventName": "OFFCORE_REQUESTS_OUTSTANDING.DEMAND_DATA_RD", - "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS.\nNote: A prefetch promoted to Demand is coun= ted from the promotion point.", + "PublicDescription": "This event counts the number of offcore outs= tanding Demand Data Read transactions in the super queue (SQ) every cycle. = A transaction is considered to be in the Offcore outstanding state between = L2 miss and transaction completion sent to requestor. See the corresponding= Umask under OFFCORE_REQUESTS. Note: A prefetch promoted to Demand is count= ed from the promotion point.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/frontend.json index db3488abf9fc..018020a51436 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/frontend.json @@ -12,7 +12,7 @@ "Counter": "0,1,2,3", "EventCode": "0xAB", "EventName": "DSB2MITE_SWITCHES.PENALTY_CYCLES", - "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. \nMM is placed before Instruction Decode Queue (IDQ) = to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths.= Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode S= tream Buffer (DSB)-to-MITE switch occurs.\nPenalty: A Decode Stream Buffer = (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six = cycles in which no uops are delivered to the IDQ. Most often, such switches= from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.= ", + "PublicDescription": "This event counts Decode Stream Buffer (DSB)= -to-MITE switch true penalty cycles. These cycles do not include uops route= d through because of the switch itself, for example, when Instruction Decod= e Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (I= DQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge = mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receivin= g the first MITE uop. MM is placed before Instruction Decode Queue (IDQ) t= o merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. = Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode St= ream Buffer (DSB)-to-MITE switch occurs. Penalty: A Decode Stream Buffer (D= SB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cy= cles in which no uops are delivered to the IDQ. Most often, such switches f= rom the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.", "SampleAfterValue": "2000003", "UMask": "0x2" }, @@ -212,7 +212,7 @@ "Counter": "0,1,2,3", "EventCode": "0x9C", "EventName": "IDQ_UOPS_NOT_DELIVERED.CORE", - "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when:\n a. IDQ-Resource Allocation = Table (RAT) pipe serves the other thread;\n b. Resource Allocation Table (R= AT) is stalled for the thread (including uop drops and clear BE conditions)= ; \n c. Instruction Decode Queue (IDQ) delivers four uops.", + "PublicDescription": "This event counts the number of uops not del= ivered to Resource Allocation Table (RAT) per thread adding 4 x when Resou= rce Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ= ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0= ,1,2,3}). Counting does not cover cases when: a. IDQ-Resource Allocation T= able (RAT) pipe serves the other thread; b. Resource Allocation Table (RAT= ) is stalled for the thread (including uop drops and clear BE conditions); = c. Instruction Decode Queue (IDQ) delivers four uops.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json b/tools/= perf/pmu-events/arch/x86/broadwellx/memory.json index 86246f632d79..093c8b564fc0 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/memory.json @@ -68,7 +68,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc8", "EventName": "HLE_RETIRED.START", - "PublicDescription": "Number of times we entered an HLE region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an HLE region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -77,7 +77,7 @@ "Counter": "0,1,2,3", "EventCode": "0xC3", "EventName": "MACHINE_CLEARS.MEMORY_ORDERING", - "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following:\n1. memory disambiguation,\n2. external snoop, or\n3= . cross SMT-HW-thread snoop (stores) hitting load buffer.", + "PublicDescription": "This event counts the number of memory order= ing Machine Clears detected. Memory Ordering Machine Clears can result from= one of the following: 1. memory disambiguation, 2. external snoop, or 3. c= ross SMT-HW-thread snoop (stores) hitting load buffer.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -470,7 +470,7 @@ "Counter": "0,1,2,3", "EventCode": "0xc9", "EventName": "RTM_RETIRED.START", - "PublicDescription": "Number of times we entered an RTM region\n d= oes not count nested transactions.", + "PublicDescription": "Number of times we entered an RTM region do= es not count nested transactions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json b/= tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json index 4193c90c3459..0863375bdead 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/metricgroups.json @@ -9,6 +9,7 @@ "BvCB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvFB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvIO": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", + "BvMB": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvML": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMP": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", "BvMS": "Grouping from Top-down Microarchitecture Analysis Metrics spr= eadsheet", @@ -34,6 +35,7 @@ "InsType": "Grouping from Top-down Microarchitecture Analysis Metrics = spreadsheet", "L2Evicts": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "LSD": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", + "LockCont": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "MachineClears": "Grouping from Top-down Microarchitecture Analysis Me= trics spreadsheet", "Machine_Clears": "Grouping from Top-down Microarchitecture Analysis M= etrics spreadsheet", "Mem": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -51,6 +53,7 @@ "Pipeline": "Grouping from Top-down Microarchitecture Analysis Metrics= spreadsheet", "PortsUtil": "Grouping from Top-down Microarchitecture Analysis Metric= s spreadsheet", "Power": "Grouping from Top-down Microarchitecture Analysis Metrics sp= readsheet", + "Prefetches": "Grouping from Top-down Microarchitecture Analysis Metri= cs spreadsheet", "Ret": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", "Retire": "Grouping from Top-down Microarchitecture Analysis Metrics s= preadsheet", "SMT": "Grouping from Top-down Microarchitecture Analysis Metrics spre= adsheet", @@ -78,6 +81,7 @@ "tma_bad_speculation_group": "Metrics contributing to tma_bad_speculat= ion category", "tma_branch_resteers_group": "Metrics contributing to tma_branch_reste= ers category", "tma_core_bound_group": "Metrics contributing to tma_core_bound catego= ry", + "tma_divider_group": "Metrics contributing to tma_divider category", "tma_dram_bound_group": "Metrics contributing to tma_dram_bound catego= ry", "tma_dtlb_load_group": "Metrics contributing to tma_dtlb_load category= ", "tma_dtlb_store_group": "Metrics contributing to tma_dtlb_store catego= ry", @@ -103,6 +107,7 @@ "tma_issueSpSt": "Metrics related by the issue $issueSpSt", "tma_issueSyncxn": "Metrics related by the issue $issueSyncxn", "tma_issueTLB": "Metrics related by the issue $issueTLB", + "tma_itlb_misses_group": "Metrics contributing to tma_itlb_misses cate= gory", "tma_l1_bound_group": "Metrics contributing to tma_l1_bound category", "tma_l3_bound_group": "Metrics contributing to tma_l3_bound category", "tma_light_operations_group": "Metrics contributing to tma_light_opera= tions category", diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json b/tool= s/perf/pmu-events/arch/x86/broadwellx/pipeline.json index c03f77539362..962cd07eb307 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/pipeline.json @@ -379,7 +379,7 @@ "BriefDescription": "Reference cycles when the core is not in halt= state.", "Counter": "Fixed counter 2", "EventName": "CPU_CLK_UNHALTED.REF_TSC", - "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. \nNote: On all current platforms this event stops counting during 'thr= ottling (TM)' states duty off periods the processor is 'halted'. This even= t is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is= done at a lower clock rate then the core clock the overflow status bit for= this counter may appear 'sticky'. After the counter has overflowed and so= ftware clears the overflow status bit and resets the counter to less than M= AX. The reset value to the counter is not clocked immediately so the overfl= ow status bit will flip 'high (1)' and generate another PMI (if enabled) af= ter which the reset value gets clocked into the counter. Therefore, softwar= e will get the interrupt, read the overflow status bit '1 for bit 34 while = the counter value is less than MAX. Software should ignore this case.", + "PublicDescription": "This event counts the number of reference cy= cles when the core is not in a halt state. The core enters the halt state w= hen it is running the HLT instruction or the MWAIT instruction. This event = is not affected by core frequency changes (for example, P states, TM2 trans= itions) but has the same incrementing frequency as the time stamp counter. = This event can approximate elapsed time while the core was not in a halt st= ate. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK eve= nt. It is counted on a dedicated fixed counter, leaving the four (eight whe= n Hyperthreading is disabled) programmable counters available for other eve= nts. Note: On all current platforms this event stops counting during 'thro= ttling (TM)' states duty off periods the processor is 'halted'. This event= is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is = done at a lower clock rate then the core clock the overflow status bit for = this counter may appear 'sticky'. After the counter has overflowed and sof= tware clears the overflow status bit and resets the counter to less than MA= X. The reset value to the counter is not clocked immediately so the overflo= w status bit will flip 'high (1)' and generate another PMI (if enabled) aft= er which the reset value gets clocked into the counter. Therefore, software= will get the interrupt, read the overflow status bit '1 for bit 34 while t= he counter value is less than MAX. Software should ignore this case.", "SampleAfterValue": "2000003", "UMask": "0x3" }, @@ -579,7 +579,7 @@ "BriefDescription": "Instructions retired from execution.", "Counter": "Fixed counter 0", "EventName": "INST_RETIRED.ANY", - "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. \nNotes: INST_RETIRED.ANY is counted by a designated fixed coun= ter, leaving the four (eight when Hyperthreading is disabled) programmable = counters available for other events. INST_RETIRED.ANY_P is counted by a pro= grammable counter and it is an architectural performance event. \nCounting:= Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as ret= ired instructions.", + "PublicDescription": "This event counts the number of instructions= retired from execution. For instructions that consist of multiple micro-op= s, this event counts the retirement of the last micro-op of the instruction= . Counting continues during hardware interrupts, traps, and inside interrup= t handlers. Notes: INST_RETIRED.ANY is counted by a designated fixed count= er, leaving the four (eight when Hyperthreading is disabled) programmable c= ounters available for other events. INST_RETIRED.ANY_P is counted by a prog= rammable counter and it is an architectural performance event. Counting: F= aulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retir= ed instructions.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -654,7 +654,7 @@ "Counter": "0,1,2,3", "EventCode": "0x03", "EventName": "LD_BLOCKS.STORE_FORWARD", - "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when:\n - preceding store conflicts with the load (= incomplete overlap);\n - store forwarding is impossible due to u-arch limit= ations;\n - preceding lock RMW operations are not forwarded;\n - store has = the no-forward bit set (uncacheable/page-split/masked stores);\n - all-bloc= king stores are used (mostly, fences and port I/O);\nand others.\nThe most = common case is a load blocked due to its address range overlapping with a p= receding smaller uncompleted store. Note: This event does not take into acc= ount cases of out-of-SW-control (for example, SbTailHit), unknown physical = STA, and cases of blocking loads on store due to being non-WB memory type o= r a lock. These cases are covered by other events.\nSee the table of not su= pported store forwards in the Optimization Guide.", + "PublicDescription": "This event counts how many times the load op= eration got the true Block-on-Store blocking code preventing store forwardi= ng. This includes cases when: - preceding store conflicts with the load (i= ncomplete overlap); - store forwarding is impossible due to u-arch limitat= ions; - preceding lock RMW operations are not forwarded; - store has the = no-forward bit set (uncacheable/page-split/masked stores); - all-blocking = stores are used (mostly, fences and port I/O); and others. The most common = case is a load blocked due to its address range overlapping with a precedin= g smaller uncompleted store. Note: This event does not take into account ca= ses of out-of-SW-control (for example, SbTailHit), unknown physical STA, an= d cases of blocking loads on store due to being non-WB memory type or a loc= k. These cases are covered by other events. See the table of not supported = store forwards in the Optimization Guide.", "SampleAfterValue": "100003", "UMask": "0x2" }, @@ -822,7 +822,7 @@ "Counter": "0,1,2,3", "EventCode": "0x5E", "EventName": "RS_EVENTS.EMPTY_CYCLES", - "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread.\nNote: In ST-mode, not acti= ve thread should drive 0. This is usually caused by severely costly branch = mispredictions, or allocator/FE issues.", + "PublicDescription": "This event counts cycles during which the re= servation station (RS) is empty for the thread. Note: In ST-mode, not activ= e thread should drive 0. This is usually caused by severely costly branch m= ispredictions, or allocator/FE issues.", "SampleAfterValue": "2000003", "UMask": "0x1" }, @@ -1177,7 +1177,7 @@ "Counter": "0,1,2,3", "EventCode": "0x0E", "EventName": "UOPS_ISSUED.FLAGS_MERGE", - "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive\n added by GSR u-arch.", + "PublicDescription": "Number of flags-merge uops being allocated. = Such uops considered perf sensitive added by GSR u-arch.", "SampleAfterValue": "2000003", "UMask": "0x10" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json b/= tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json index b55b305aecaa..8f6955615bfd 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-cache.json @@ -802,7 +802,7 @@ "EventCode": "0x12", "EventName": "UNC_C_RxR_EXT_STARVED.IPQ", "PerPkg": "1", - "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally startved and therefore we are blocking the IRQ.", + "PublicDescription": "Counts cycles in external starvation. This = occurs when one of the ingress queues is being starved by the other queues.= ; IPQ is externally starved and therefore we are blocking the IRQ.", "UMask": "0x2", "Unit": "CBOX" }, @@ -1859,7 +1859,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.NOT_TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that could not take the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that could not take the bypass.", "UMask": "0x2", "Unit": "HA" }, @@ -1869,7 +1869,7 @@ "EventCode": "0x14", "EventName": "UNC_H_BYPASS_IMC.TAKEN", "PerPkg": "1", - "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filted = by when the bypass was taken and when it was not.; Filter for transactions = that succeeded in taking the bypass.", + "PublicDescription": "Counts the number of times when the HA was a= ble to bypass was attempted. This is a latency optimization for situations= when there is light loadings on the memory subsystem. This can be filtere= d by when the bypass was taken and when it was not.; Filter for transaction= s that succeeded in taking the bypass.", "UMask": "0x1", "Unit": "HA" }, @@ -2884,7 +2884,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -2894,7 +2894,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -2904,7 +2904,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -2914,7 +2914,7 @@ "EventCode": "0x15", "EventName": "UNC_H_RPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high banwidth workloads should be able to make use of all of the regular b= uffers, but it will be difficult (and uncommon) to make use of both the reg= ular and special buffers at the same time. One can filter based on the mem= ory controller channel. One or more channels can be tracked at a given tim= e.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting reads from the HA into the iMC. In= order to send reads into the memory controller, the HA must first acquire = a credit for the iMC's RPQ (read pending queue). This queue is broken into= regular credits/buffers that are used by general reads, and special reques= ts such as ISOCH reads. This count only tracks the regular credits Common= high bandwidth workloads should be able to make use of all of the regular = buffers, but it will be difficult (and uncommon) to make use of both the re= gular and special buffers at the same time. One can filter based on the me= mory controller channel. One or more channels can be tracked at a given ti= me.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, @@ -3448,7 +3448,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.ALL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; Requests coming from both local and remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; Requests coming from both local and remote sockets.", "UMask": "0x3", "Unit": "HA" }, @@ -3458,7 +3458,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.LOCAL", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from the local socket.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from the local socket.", "UMask": "0x1", "Unit": "HA" }, @@ -3468,7 +3468,7 @@ "EventCode": "0x3", "EventName": "UNC_H_TRACKER_CYCLES_NE.REMOTE", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occpancy. HA tra= ckers are allocated as soon as a request enters the HA if an HT (Home Track= er) entry is available and is released after the snoop response and data re= turn (or post in the case of a write) and the response is returned on the r= ing.; This filter includes only requests coming from remote sockets.", + "PublicDescription": "Counts the number of cycles when the local H= A tracker pool is not empty. This can be used with edge detect to identify= the number of situations when the pool became empty. This should not be c= onfused with RTID credit usage -- which must be tracked inside each cbo ind= ividually -- but represents the actual tracker buffer structure. In other = words, this buffer could be completely empty, but there may still be credit= s in use by the CBos. This stat can be used in conjunction with the occupa= ncy accumulation stat in order to calculate average queue occupancy. HA tr= ackers are allocated as soon as a request enters the HA if an HT (Home Trac= ker) entry is available and is released after the snoop response and data r= eturn (or post in the case of a write) and the response is returned on the = ring.; This filter includes only requests coming from remote sockets.", "UMask": "0x2", "Unit": "HA" }, @@ -3888,7 +3888,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN0", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 0 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 0 only.", "UMask": "0x1", "Unit": "HA" }, @@ -3898,7 +3898,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN1", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 1 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 1 only.", "UMask": "0x2", "Unit": "HA" }, @@ -3908,7 +3908,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN2", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 2 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 2 only.", "UMask": "0x4", "Unit": "HA" }, @@ -3918,7 +3918,7 @@ "EventCode": "0x18", "EventName": "UNC_H_WPQ_CYCLES_NO_REG_CREDITS.CHN3", "PerPkg": "1", - "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high banwidth workloads should be able to make use of all of the regu= lar buffers, but it will be difficult (and uncommon) to make use of both th= e regular and special buffers at the same time. One can filter based on th= e memory controller channel. One or more channels can be tracked at a give= n time.; Filter for memory controller channel 3 only.", + "PublicDescription": "Counts the number of cycles when there are n= o regular credits available for posting writes from the HA into the iMC. I= n order to send writes into the memory controller, the HA must first acquir= e a credit for the iMC's WPQ (write pending queue). This queue is broken i= nto regular credits/buffers that are used by general writes, and special re= quests such as ISOCH writes. This count only tracks the regular credits C= ommon high bandwidth workloads should be able to make use of all of the reg= ular buffers, but it will be difficult (and uncommon) to make use of both t= he regular and special buffers at the same time. One can filter based on t= he memory controller channel. One or more channels can be tracked at a giv= en time.; Filter for memory controller channel 3 only.", "UMask": "0x8", "Unit": "HA" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.= json b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json index 765d44012bba..56d729c8e946 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-interconnect.json @@ -1,24 +1,4 @@ [ - { - "BriefDescription": "Number of non data (control) flits transmitte= d . Derived from unc_q_txl_flits_g0.non_data", - "Counter": "0,1,2,3", - "EventName": "QPI_CTL_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of non-NULL= non-data flits transmitted across QPI. This basically tracks the protocol= overhead on the QPI link. One can get a good picture of the QPI-link char= acteristics by evaluating the protocol flits, data flits, and idle/null fli= ts. This includes the header flits for data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x4", - "Unit": "QPI" - }, - { - "BriefDescription": "Number of data flits transmitted . Derived fr= om unc_q_txl_flits_g0.data", - "Counter": "0,1,2,3", - "EventName": "QPI_DATA_BANDWIDTH_TX", - "PerPkg": "1", - "PublicDescription": "Counts the number of flits transmitted acros= s the QPI Link. It includes filters for Idle, protocol, and Data Flits. E= ach flit is made up of 80 bits of information (in addition to some ECC data= ). In full-width (L0) mode, flits are made up of four fits, each of which = contains 20 bits of data (along with some additional ECC data). In half-w= idth (L0p) mode, the fits are only 10 bits, and therefore it takes twice as= many fits to transmit a flit. When one talks about QPI speed (for example= , 8.0 GT/s), the transfers here refer to fits. Therefore, in L0, the syste= m will transfer 1 flit at the rate of 1/4th the QPI speed. One can calcula= te the bandwidth of the link by taking: flits*80b/time. Note that this is = not the same as data bandwidth. For example, when we are transferring a 64= B cacheline across QPI, we will break it into 9 flits -- 1 with header info= rmation and 8 with 64 bits of actual data and an additional 16 bits of othe= r information. To calculate data bandwidth, one should therefore do: data = flits * 8B / time (for L0) or 4B instead of 8B for L0p.; Number of data fli= ts transmitted over QPI. Each flit contains 64b of data. This includes bo= th DRS and NCB data flits (coherent and non-coherent). This can be used to= calculate the data bandwidth of the QPI link. One can get a good picture = of the QPI-link characteristics by evaluating the protocol flits, data flit= s, and idle/null flits. This does not include the header flits that go in = data packets.", - "ScaleUnit": "8Bytes", - "UMask": "0x2", - "Unit": "QPI" - }, { "BriefDescription": "Total Write Cache Occupancy; Any Source", "Counter": "0,1", @@ -53,7 +33,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CLFLUSH", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x80", "Unit": "IRP" }, @@ -63,7 +43,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.CRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x2", "Unit": "IRP" }, @@ -73,7 +53,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.DRD", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x4", "Unit": "IRP" }, @@ -83,7 +63,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIDCAHINT", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x20", "Unit": "IRP" }, @@ -93,7 +73,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCIRDCUR", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x1", "Unit": "IRP" }, @@ -103,7 +83,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.PCITOM", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x10", "Unit": "IRP" }, @@ -113,7 +93,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.RFO", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x8", "Unit": "IRP" }, @@ -123,7 +103,7 @@ "EventCode": "0x13", "EventName": "UNC_I_COHERENT_OPS.WBMTOI", "PerPkg": "1", - "PublicDescription": "Counts the number of coherency related opera= tions servied by the IRP", + "PublicDescription": "Counts the number of coherency related opera= tions serviced by the IRP", "UMask": "0x40", "Unit": "IRP" }, diff --git a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json b= /tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json index 45555316f8ea..ca09c1286485 100644 --- a/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json +++ b/tools/perf/pmu-events/arch/x86/broadwellx/uncore-memory.json @@ -184,6 +184,7 @@ { "BriefDescription": "This event is deprecated. Refer to new event = UNC_M_CLOCKTICKS_P", "Counter": "0,1,2,3", + "Deprecated": "1", "EventName": "UNC_M_DCLOCKTICKS", "PerPkg": "1", "Unit": "iMC" diff --git a/tools/perf/pmu-events/arch/x86/mapfile.csv b/tools/perf/pmu-ev= ents/arch/x86/mapfile.csv index ccbe9be240b3..6405c7627e27 100644 --- a/tools/perf/pmu-events/arch/x86/mapfile.csv +++ b/tools/perf/pmu-events/arch/x86/mapfile.csv @@ -5,7 +5,7 @@ GenuineIntel-6-C[56],v1.07,arrowlake,core GenuineIntel-6-(1C|26|27|35|36),v5,bonnell,core GenuineIntel-6-(3D|47),v30,broadwell,core GenuineIntel-6-56,v12,broadwellde,core -GenuineIntel-6-4F,v22,broadwellx,core +GenuineIntel-6-4F,v23,broadwellx,core GenuineIntel-6-55-[56789ABCDEF],v1.22,cascadelakex,core GenuineIntel-6-9[6C],v1.05,elkhartlake,core GenuineIntel-6-CF,v1.09,emeraldrapids,core --=20 2.48.1.502.g6dc24dfdaf-goog