[RFC PATCH] perf: New start period for the freq mode

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] perf: New start period for the freq mode
@ 2024-08-29 15:20 kan.liang
  2024-08-30  6:13 ` Namhyung Kim
  0 siblings, 1 reply; 5+ messages in thread
From: kan.liang @ 2024-08-29 15:20 UTC (permalink / raw)
  To: peterz, mingo, tglx, acme, namhyung, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter,
	linux-perf-users, linux-kernel
  Cc: ravi.bangoria, sandipan.das, atrajeev, luogengkun, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The freq mode is the current default mode of Linux perf. 1 period is
used as a start period. The period is auto-adjusted in each tick or an
overflow to meet the frequency target.

The start period 1 is too low and may trigger some issues.
- Many HWs do not support period 1 well.
  https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/
- For an event that occurs frequently, period 1 is too far away from the
  real period. Lots of the samples are generated at the beginning.
  The distribution of samples may not be even.

It's hard to find a universal start period for all events. The idea is
only to give an estimate for the popular HW and HW cache events. For the
rest of the events, start from the lowest possible recommended value.

Only the Intel event list JSON file provides the recommended SAV
(sample after value) for each event. The estimation is based on the
Intel's SAV.

This patch implements a generic perf_freq_start_period() which impacts
all ARCHs.
If the other ARCHs don't like the start period, a per-pmu
(*freq_start_period) may be introduced instead. Or make it a __weak
function.
The other option would be exposing a start_period knob in the sysfs or a
per-event config. So the end users can set their preferred start period.
Please let me know your thoughts.

SW events may need to be specially handled, which is not implemented in
the patch.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 kernel/events/core.c | 65 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4b855b018a79..7a028474caef 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12017,6 +12017,69 @@ static void account_event(struct perf_event *event)
 	account_pmu_sb_event(event);
 }
 
+static u64 perf_freq_start_period(struct perf_event *event)
+{
+	int type = event->attr.type;
+	u64 config, factor;
+
+	/*
+	 * The 127 is the lowest possible recommended SAV (sample after value)
+	 * for a 4000 freq (default freq), according to Intel event list JSON
+	 * file, which is the only JSON file that provides a recommended value.
+	 */
+	factor = 127 * 4000;
+	if (type != PERF_TYPE_HARDWARE && type != PERF_TYPE_HW_CACHE)
+		goto end;
+
+	/*
+	 * The estimation of the start period in the freq mode is
+	 * based on the below assumption.
+	 *
+	 * For a cycles or an instructions event, 1GHZ of the
+	 * underlying platform, 1 IPC. The workload is idle 50% time.
+	 * The start period = 1,000,000,000 * 1 / freq / 2.
+	 *		    = 500,000,000 / freq
+	 *
+	 * Usually, the branch-related events occur less than the
+	 * instructions event. According to the Intel event list JSON
+	 * file, the SAV (sample after value) of a branch-related event
+	 * is usually 1/4 of an instruction event.
+	 * The start period of branch-related events = 125,000,000 / freq.
+	 *
+	 * The cache-related events occurs even less. The SAV is usually
+	 * 1/20 of an instruction event.
+	 * The start period of cache-related events = 25,000,000 / freq.
+	 */
+	config = event->attr.config & PERF_HW_EVENT_MASK;
+	if (type == PERF_TYPE_HARDWARE) {
+		switch (config) {
+		case PERF_COUNT_HW_CPU_CYCLES:
+		case PERF_COUNT_HW_INSTRUCTIONS:
+		case PERF_COUNT_HW_BUS_CYCLES:
+		case PERF_COUNT_HW_STALLED_CYCLES_FRONTEND:
+		case PERF_COUNT_HW_STALLED_CYCLES_BACKEND:
+		case PERF_COUNT_HW_REF_CPU_CYCLES:
+			factor = 500000000;
+			break;
+		case PERF_COUNT_HW_BRANCH_INSTRUCTIONS:
+		case PERF_COUNT_HW_BRANCH_MISSES:
+			factor = 125000000;
+			break;
+		case PERF_COUNT_HW_CACHE_REFERENCES:
+		case PERF_COUNT_HW_CACHE_MISSES:
+			factor = 25000000;
+			break;
+		default:
+			goto end;
+		}
+	}
+
+	if (type == PERF_TYPE_HW_CACHE)
+		factor = 25000000;
+end:
+	return DIV_ROUND_UP_ULL(factor, event->attr.sample_freq);
+}
+
 /*
  * Allocate and initialize an event structure
  */
@@ -12140,7 +12203,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	hwc = &event->hw;
 	hwc->sample_period = attr->sample_period;
 	if (attr->freq && attr->sample_freq)
-		hwc->sample_period = 1;
+		hwc->sample_period = perf_freq_start_period(event);
 	hwc->last_period = hwc->sample_period;
 
 	local64_set(&hwc->period_left, hwc->sample_period);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] perf: New start period for the freq mode
  2024-08-29 15:20 [RFC PATCH] perf: New start period for the freq mode kan.liang
@ 2024-08-30  6:13 ` Namhyung Kim
  2024-08-30 14:49   ` Liang, Kan
  2024-09-02 10:38   ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Namhyung Kim @ 2024-08-30  6:13 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, tglx, acme, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, linux-perf-users, linux-kernel,
	ravi.bangoria, sandipan.das, atrajeev, luogengkun, ak

Hi Kan,

On Thu, Aug 29, 2024 at 08:20:36AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> The freq mode is the current default mode of Linux perf. 1 period is
> used as a start period. The period is auto-adjusted in each tick or an
> overflow to meet the frequency target.
> 
> The start period 1 is too low and may trigger some issues.
> - Many HWs do not support period 1 well.
>   https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/
> - For an event that occurs frequently, period 1 is too far away from the
>   real period. Lots of the samples are generated at the beginning.
>   The distribution of samples may not be even.
> 
> It's hard to find a universal start period for all events. The idea is
> only to give an estimate for the popular HW and HW cache events. For the
> rest of the events, start from the lowest possible recommended value.
> 
> Only the Intel event list JSON file provides the recommended SAV
> (sample after value) for each event. The estimation is based on the
> Intel's SAV.
> 
> This patch implements a generic perf_freq_start_period() which impacts
> all ARCHs.
> If the other ARCHs don't like the start period, a per-pmu
> (*freq_start_period) may be introduced instead. Or make it a __weak
> function.
> The other option would be exposing a start_period knob in the sysfs or a
> per-event config. So the end users can set their preferred start period.
> Please let me know your thoughts.
> 
> SW events may need to be specially handled, which is not implemented in
> the patch.

Sounds like a per-pmu callback is fine.  PMUs don't have the callback
(including SW) can use 1 same as of now.

Thanks,
Namhyung

> 
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> ---
>  kernel/events/core.c | 65 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4b855b018a79..7a028474caef 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -12017,6 +12017,69 @@ static void account_event(struct perf_event *event)
>  	account_pmu_sb_event(event);
>  }
>  
> +static u64 perf_freq_start_period(struct perf_event *event)
> +{
> +	int type = event->attr.type;
> +	u64 config, factor;
> +
> +	/*
> +	 * The 127 is the lowest possible recommended SAV (sample after value)
> +	 * for a 4000 freq (default freq), according to Intel event list JSON
> +	 * file, which is the only JSON file that provides a recommended value.
> +	 */
> +	factor = 127 * 4000;
> +	if (type != PERF_TYPE_HARDWARE && type != PERF_TYPE_HW_CACHE)
> +		goto end;
> +
> +	/*
> +	 * The estimation of the start period in the freq mode is
> +	 * based on the below assumption.
> +	 *
> +	 * For a cycles or an instructions event, 1GHZ of the
> +	 * underlying platform, 1 IPC. The workload is idle 50% time.
> +	 * The start period = 1,000,000,000 * 1 / freq / 2.
> +	 *		    = 500,000,000 / freq
> +	 *
> +	 * Usually, the branch-related events occur less than the
> +	 * instructions event. According to the Intel event list JSON
> +	 * file, the SAV (sample after value) of a branch-related event
> +	 * is usually 1/4 of an instruction event.
> +	 * The start period of branch-related events = 125,000,000 / freq.
> +	 *
> +	 * The cache-related events occurs even less. The SAV is usually
> +	 * 1/20 of an instruction event.
> +	 * The start period of cache-related events = 25,000,000 / freq.
> +	 */
> +	config = event->attr.config & PERF_HW_EVENT_MASK;
> +	if (type == PERF_TYPE_HARDWARE) {
> +		switch (config) {
> +		case PERF_COUNT_HW_CPU_CYCLES:
> +		case PERF_COUNT_HW_INSTRUCTIONS:
> +		case PERF_COUNT_HW_BUS_CYCLES:
> +		case PERF_COUNT_HW_STALLED_CYCLES_FRONTEND:
> +		case PERF_COUNT_HW_STALLED_CYCLES_BACKEND:
> +		case PERF_COUNT_HW_REF_CPU_CYCLES:
> +			factor = 500000000;
> +			break;
> +		case PERF_COUNT_HW_BRANCH_INSTRUCTIONS:
> +		case PERF_COUNT_HW_BRANCH_MISSES:
> +			factor = 125000000;
> +			break;
> +		case PERF_COUNT_HW_CACHE_REFERENCES:
> +		case PERF_COUNT_HW_CACHE_MISSES:
> +			factor = 25000000;
> +			break;
> +		default:
> +			goto end;
> +		}
> +	}
> +
> +	if (type == PERF_TYPE_HW_CACHE)
> +		factor = 25000000;
> +end:
> +	return DIV_ROUND_UP_ULL(factor, event->attr.sample_freq);
> +}
> +
>  /*
>   * Allocate and initialize an event structure
>   */
> @@ -12140,7 +12203,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>  	hwc = &event->hw;
>  	hwc->sample_period = attr->sample_period;
>  	if (attr->freq && attr->sample_freq)
> -		hwc->sample_period = 1;
> +		hwc->sample_period = perf_freq_start_period(event);
>  	hwc->last_period = hwc->sample_period;
>  
>  	local64_set(&hwc->period_left, hwc->sample_period);
> -- 
> 2.38.1
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] perf: New start period for the freq mode
  2024-08-30  6:13 ` Namhyung Kim
@ 2024-08-30 14:49   ` Liang, Kan
  2024-09-02 10:38   ` Peter Zijlstra
  1 sibling, 0 replies; 5+ messages in thread
From: Liang, Kan @ 2024-08-30 14:49 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: peterz, mingo, tglx, acme, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, linux-perf-users, linux-kernel,
	ravi.bangoria, sandipan.das, atrajeev, luogengkun, ak



On 2024-08-30 2:13 a.m., Namhyung Kim wrote:
> Hi Kan,
> 
> On Thu, Aug 29, 2024 at 08:20:36AM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> The freq mode is the current default mode of Linux perf. 1 period is
>> used as a start period. The period is auto-adjusted in each tick or an
>> overflow to meet the frequency target.
>>
>> The start period 1 is too low and may trigger some issues.
>> - Many HWs do not support period 1 well.
>>   https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/
>> - For an event that occurs frequently, period 1 is too far away from the
>>   real period. Lots of the samples are generated at the beginning.
>>   The distribution of samples may not be even.
>>
>> It's hard to find a universal start period for all events. The idea is
>> only to give an estimate for the popular HW and HW cache events. For the
>> rest of the events, start from the lowest possible recommended value.
>>
>> Only the Intel event list JSON file provides the recommended SAV
>> (sample after value) for each event. The estimation is based on the
>> Intel's SAV.
>>
>> This patch implements a generic perf_freq_start_period() which impacts
>> all ARCHs.
>> If the other ARCHs don't like the start period, a per-pmu
>> (*freq_start_period) may be introduced instead. Or make it a __weak
>> function.
>> The other option would be exposing a start_period knob in the sysfs or a
>> per-event config. So the end users can set their preferred start period.
>> Please let me know your thoughts.
>>
>> SW events may need to be specially handled, which is not implemented in
>> the patch.
> 
> Sounds like a per-pmu callback is fine.  PMUs don't have the callback
> (including SW) can use 1 same as of now.

I once hope the new start period can benefit more ARCHs. But yes, a
per-pmu setting should be more practical.

Thanks,
Kan
> 
> Thanks,
> Namhyung
> 
>>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>  kernel/events/core.c | 65 +++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 64 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 4b855b018a79..7a028474caef 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -12017,6 +12017,69 @@ static void account_event(struct perf_event *event)
>>  	account_pmu_sb_event(event);
>>  }
>>  
>> +static u64 perf_freq_start_period(struct perf_event *event)
>> +{
>> +	int type = event->attr.type;
>> +	u64 config, factor;
>> +
>> +	/*
>> +	 * The 127 is the lowest possible recommended SAV (sample after value)
>> +	 * for a 4000 freq (default freq), according to Intel event list JSON
>> +	 * file, which is the only JSON file that provides a recommended value.
>> +	 */
>> +	factor = 127 * 4000;
>> +	if (type != PERF_TYPE_HARDWARE && type != PERF_TYPE_HW_CACHE)
>> +		goto end;
>> +
>> +	/*
>> +	 * The estimation of the start period in the freq mode is
>> +	 * based on the below assumption.
>> +	 *
>> +	 * For a cycles or an instructions event, 1GHZ of the
>> +	 * underlying platform, 1 IPC. The workload is idle 50% time.
>> +	 * The start period = 1,000,000,000 * 1 / freq / 2.
>> +	 *		    = 500,000,000 / freq
>> +	 *
>> +	 * Usually, the branch-related events occur less than the
>> +	 * instructions event. According to the Intel event list JSON
>> +	 * file, the SAV (sample after value) of a branch-related event
>> +	 * is usually 1/4 of an instruction event.
>> +	 * The start period of branch-related events = 125,000,000 / freq.
>> +	 *
>> +	 * The cache-related events occurs even less. The SAV is usually
>> +	 * 1/20 of an instruction event.
>> +	 * The start period of cache-related events = 25,000,000 / freq.
>> +	 */
>> +	config = event->attr.config & PERF_HW_EVENT_MASK;
>> +	if (type == PERF_TYPE_HARDWARE) {
>> +		switch (config) {
>> +		case PERF_COUNT_HW_CPU_CYCLES:
>> +		case PERF_COUNT_HW_INSTRUCTIONS:
>> +		case PERF_COUNT_HW_BUS_CYCLES:
>> +		case PERF_COUNT_HW_STALLED_CYCLES_FRONTEND:
>> +		case PERF_COUNT_HW_STALLED_CYCLES_BACKEND:
>> +		case PERF_COUNT_HW_REF_CPU_CYCLES:
>> +			factor = 500000000;
>> +			break;
>> +		case PERF_COUNT_HW_BRANCH_INSTRUCTIONS:
>> +		case PERF_COUNT_HW_BRANCH_MISSES:
>> +			factor = 125000000;
>> +			break;
>> +		case PERF_COUNT_HW_CACHE_REFERENCES:
>> +		case PERF_COUNT_HW_CACHE_MISSES:
>> +			factor = 25000000;
>> +			break;
>> +		default:
>> +			goto end;
>> +		}
>> +	}
>> +
>> +	if (type == PERF_TYPE_HW_CACHE)
>> +		factor = 25000000;
>> +end:
>> +	return DIV_ROUND_UP_ULL(factor, event->attr.sample_freq);
>> +}
>> +
>>  /*
>>   * Allocate and initialize an event structure
>>   */
>> @@ -12140,7 +12203,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>  	hwc = &event->hw;
>>  	hwc->sample_period = attr->sample_period;
>>  	if (attr->freq && attr->sample_freq)
>> -		hwc->sample_period = 1;
>> +		hwc->sample_period = perf_freq_start_period(event);
>>  	hwc->last_period = hwc->sample_period;
>>  
>>  	local64_set(&hwc->period_left, hwc->sample_period);
>> -- 
>> 2.38.1
>>
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] perf: New start period for the freq mode
  2024-08-30  6:13 ` Namhyung Kim
  2024-08-30 14:49   ` Liang, Kan
@ 2024-09-02 10:38   ` Peter Zijlstra
  2024-09-03 15:23     ` Liang, Kan
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2024-09-02 10:38 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: kan.liang, mingo, tglx, acme, mark.rutland, alexander.shishkin,
	jolsa, irogers, adrian.hunter, linux-perf-users, linux-kernel,
	ravi.bangoria, sandipan.das, atrajeev, luogengkun, ak

On Thu, Aug 29, 2024 at 11:13:42PM -0700, Namhyung Kim wrote:
> Hi Kan,
> 
> On Thu, Aug 29, 2024 at 08:20:36AM -0700, kan.liang@linux.intel.com wrote:
> > From: Kan Liang <kan.liang@linux.intel.com>
> > 
> > The freq mode is the current default mode of Linux perf. 1 period is
> > used as a start period. The period is auto-adjusted in each tick or an
> > overflow to meet the frequency target.
> > 
> > The start period 1 is too low and may trigger some issues.
> > - Many HWs do not support period 1 well.
> >   https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/

So we already have x86_pmu::limit_period and pmu::check_period to deal
with this. Don't they already capture the 1 and increase it where
appropriate?

> > - For an event that occurs frequently, period 1 is too far away from the
> >   real period. Lots of the samples are generated at the beginning.
> >   The distribution of samples may not be even.

Which is why samples include a WEIGHT option IIRC.

> Sounds like a per-pmu callback is fine.  PMUs don't have the callback
> (including SW) can use 1 same as of now.

This, but also, be very careful to not over-estimate, because ramping up
is fast, but having to adjust down can take a while.



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] perf: New start period for the freq mode
  2024-09-02 10:38   ` Peter Zijlstra
@ 2024-09-03 15:23     ` Liang, Kan
  0 siblings, 0 replies; 5+ messages in thread
From: Liang, Kan @ 2024-09-03 15:23 UTC (permalink / raw)
  To: Peter Zijlstra, Namhyung Kim
  Cc: mingo, tglx, acme, mark.rutland, alexander.shishkin, jolsa,
	irogers, adrian.hunter, linux-perf-users, linux-kernel,
	ravi.bangoria, sandipan.das, atrajeev, luogengkun, ak

On 2024-09-02 6:38 a.m., Peter Zijlstra wrote:
> On Thu, Aug 29, 2024 at 11:13:42PM -0700, Namhyung Kim wrote:
>> Hi Kan,
>>
>> On Thu, Aug 29, 2024 at 08:20:36AM -0700, kan.liang@linux.intel.com wrote:
>>> From: Kan Liang <kan.liang@linux.intel.com>
>>>
>>> The freq mode is the current default mode of Linux perf. 1 period is
>>> used as a start period. The period is auto-adjusted in each tick or an
>>> overflow to meet the frequency target.
>>>
>>> The start period 1 is too low and may trigger some issues.
>>> - Many HWs do not support period 1 well.
>>>   https://lore.kernel.org/lkml/875xs2oh69.ffs@tglx/
> 
> So we already have x86_pmu::limit_period and pmu::check_period to deal
> with this. Don't they already capture the 1 and increase it where
> appropriate?

The limit_period only checks the minimum acceptable value for HW. If the
value is lower than that, I think HW errors may be triggered. It's a
mandatory request.

However, it doesn't make it a perfect start value, which perf uses in
the default freq mode.
As you can see in Thomas's experiment, it doesn't trigger HW issue to
set the start period to 1. But the message "perf: interrupt took too
long (2503 > 2500), lowering ..." is printed. That should be a false
alarm. To avoid it, 32 is finally used for the limit_period.
https://lore.kernel.org/lkml/87plq9l5d2.ffs@tglx/

We cannot always use this way to address the above issue.
- It's impossible to test all the platforms to find a perfect "32" for
each platform.
- Some events may need a very low period. We cannot set the limit_period
too high.

Furthermore, a low start period for the frequently occurring event
challenges both HW and virtualization, which has a longer path to handle
a PMI.

I think we need a better start period for the default freq mode.

Yes, there is already a pmu::check_period which is period related. I
will check if it can be modified to feedback a start value somehow.

> 
>>> - For an event that occurs frequently, period 1 is too far away from the
>>>   real period. Lots of the samples are generated at the beginning.
>>>   The distribution of samples may not be even.
> 
> Which is why samples include a WEIGHT option IIRC.
>

The WEIGHT gives all kinds of latency to understand it. But it doesn't
help changing the distribution.

>> Sounds like a per-pmu callback is fine.  PMUs don't have the callback
>> (including SW) can use 1 same as of now.
> 
> This, but also, be very careful to not over-estimate, because ramping up
> is fast, but having to adjust down can take a while.

Sure.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-09-03 15:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-29 15:20 [RFC PATCH] perf: New start period for the freq mode kan.liang
2024-08-30  6:13 ` Namhyung Kim
2024-08-30 14:49   ` Liang, Kan
2024-09-02 10:38   ` Peter Zijlstra
2024-09-03 15:23     ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).