[PATCH AUTOSEL 6.13 09/17] perf/x86/intel: Use better start period for frequency mode

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Kan Liang <kan.liang@linux.intel.com>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Sasha Levin <sashal@kernel.org>,
	mingo@redhat.com, acme@kernel.org, namhyung@kernel.org,
	tglx@linutronix.de, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, linux-perf-users@vger.kernel.org
Subject: [PATCH AUTOSEL 6.13 09/17] perf/x86/intel: Use better start period for frequency mode
Date: Mon,  3 Mar 2025 11:29:41 -0500	[thread overview]
Message-ID: <20250303162951.3763346-9-sashal@kernel.org> (raw)
In-Reply-To: <20250303162951.3763346-1-sashal@kernel.org>

From: Kan Liang <kan.liang@linux.intel.com>

[ Upstream commit a26b24b2e21f6222635a95426b9ef9eec63d69b1 ]

Freqency mode is the current default mode of Linux perf. A period of 1 is
used as a starting period. The period is auto-adjusted on each tick or an
overflow, to meet the frequency target.

The start period of 1 is too low and may trigger some issues:

- Many HWs do not support period 1 well.
  https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/

- For an event that occurs frequently, period 1 is too far away from the
  real period. Lots of samples are generated at the beginning.
  The distribution of samples may not be even.

- A low starting period for frequently occurring events also challenges
  virtualization, which has a longer path to handle a PMI.

The limit_period value only checks the minimum acceptable value for HW.
It cannot be used to set the start period, because some events may
need a very low period. The limit_period cannot be set too high. It
doesn't help with the events that occur frequently.

It's hard to find a universal starting period for all events. The idea
implemented by this patch is to only give an estimate for the popular
HW and HW cache events. For the rest of the events, start from the lowest
possible recommended value.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250117151913.3043942-3-kan.liang@linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/events/intel/core.c | 85 ++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2bba1d934efb0..6651360d6a525 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3949,6 +3949,85 @@ static inline bool intel_pmu_has_cap(struct perf_event *event, int idx)
 	return test_bit(idx, (unsigned long *)&intel_cap->capabilities);
 }
 
+static u64 intel_pmu_freq_start_period(struct perf_event *event)
+{
+	int type = event->attr.type;
+	u64 config, factor;
+	s64 start;
+
+	/*
+	 * The 127 is the lowest possible recommended SAV (sample after value)
+	 * for a 4000 freq (default freq), according to the event list JSON file.
+	 * Also, assume the workload is idle 50% time.
+	 */
+	factor = 64 * 4000;
+	if (type != PERF_TYPE_HARDWARE && type != PERF_TYPE_HW_CACHE)
+		goto end;
+
+	/*
+	 * The estimation of the start period in the freq mode is
+	 * based on the below assumption.
+	 *
+	 * For a cycles or an instructions event, 1GHZ of the
+	 * underlying platform, 1 IPC. The workload is idle 50% time.
+	 * The start period = 1,000,000,000 * 1 / freq / 2.
+	 *		    = 500,000,000 / freq
+	 *
+	 * Usually, the branch-related events occur less than the
+	 * instructions event. According to the Intel event list JSON
+	 * file, the SAV (sample after value) of a branch-related event
+	 * is usually 1/4 of an instruction event.
+	 * The start period of branch-related events = 125,000,000 / freq.
+	 *
+	 * The cache-related events occurs even less. The SAV is usually
+	 * 1/20 of an instruction event.
+	 * The start period of cache-related events = 25,000,000 / freq.
+	 */
+	config = event->attr.config & PERF_HW_EVENT_MASK;
+	if (type == PERF_TYPE_HARDWARE) {
+		switch (config) {
+		case PERF_COUNT_HW_CPU_CYCLES:
+		case PERF_COUNT_HW_INSTRUCTIONS:
+		case PERF_COUNT_HW_BUS_CYCLES:
+		case PERF_COUNT_HW_STALLED_CYCLES_FRONTEND:
+		case PERF_COUNT_HW_STALLED_CYCLES_BACKEND:
+		case PERF_COUNT_HW_REF_CPU_CYCLES:
+			factor = 500000000;
+			break;
+		case PERF_COUNT_HW_BRANCH_INSTRUCTIONS:
+		case PERF_COUNT_HW_BRANCH_MISSES:
+			factor = 125000000;
+			break;
+		case PERF_COUNT_HW_CACHE_REFERENCES:
+		case PERF_COUNT_HW_CACHE_MISSES:
+			factor = 25000000;
+			break;
+		default:
+			goto end;
+		}
+	}
+
+	if (type == PERF_TYPE_HW_CACHE)
+		factor = 25000000;
+end:
+	/*
+	 * Usually, a prime or a number with less factors (close to prime)
+	 * is chosen as an SAV, which makes it less likely that the sampling
+	 * period synchronizes with some periodic event in the workload.
+	 * Minus 1 to make it at least avoiding values near power of twos
+	 * for the default freq.
+	 */
+	start = DIV_ROUND_UP_ULL(factor, event->attr.sample_freq) - 1;
+
+	if (start > x86_pmu.max_period)
+		start = x86_pmu.max_period;
+
+	if (x86_pmu.limit_period)
+		x86_pmu.limit_period(event, &start);
+
+	return start;
+}
+
 static int intel_pmu_hw_config(struct perf_event *event)
 {
 	int ret = x86_pmu_hw_config(event);
@@ -3960,6 +4039,12 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	if (ret)
 		return ret;
 
+	if (event->attr.freq && event->attr.sample_freq) {
+		event->hw.sample_period = intel_pmu_freq_start_period(event);
+		event->hw.last_period = event->hw.sample_period;
+		local64_set(&event->hw.period_left, event->hw.sample_period);
+	}
+
 	if (event->attr.precise_ip) {
 		if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
 			return -EINVAL;
-- 
2.39.5

next prev parent reply	other threads:[~2025-03-03 16:30 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-03 16:29 [PATCH AUTOSEL 6.13 01/17] phy: ti: gmii-sel: Do not use syscon helper to build regmap Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 02/17] ASoC: tas2770: Fix volume scale Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 03/17] ASoC: tas2764: Fix power control mask Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 04/17] ASoC: tas2764: Set the SDOUT polarity correctly Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 05/17] fuse: don't truncate cached, mutated symlink Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 06/17] ASoC: dapm-graph: set fill colour of turned on nodes Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 07/17] ASoC: SOF: Intel: don't check number of sdw links when set dmic_fixup Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 08/17] drm/vkms: Round fixp2int conversion in lerp_u16 Sasha Levin
2025-03-03 16:29 ` Sasha Levin [this message]
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 10/17] x86/of: Don't use DTB for SMP setup if ACPI is enabled Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 11/17] x86/irq: Define trace events conditionally Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 12/17] perf/x86/rapl: Add support for Intel Arrow Lake U Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 13/17] mptcp: safety check before fallback Sasha Levin
2025-03-03 17:05   ` Matthieu Baerts
2025-03-15  1:39     ` Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 14/17] drm/nouveau: Do not override forced connector status Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 15/17] net: Handle napi_schedule() calls from non-interrupt Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 16/17] block: fix 'kmem_cache of name 'bio-108' already exists' Sasha Levin
2025-03-03 16:29 ` [PATCH AUTOSEL 6.13 17/17] vhost: return task creation error instead of NULL Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:2bba1d934efb dfblob:6651360d6a52 )
 OR (
bs:"[PATCH AUTOSEL 6.13 09/17] perf/x86/intel: Use better start period for frequency mode" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250303162951.3763346-9-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=acme@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox