[PATCH] perf/x86/intel: Restrict period on Haswell

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] perf/x86/intel: Restrict period on Haswell
@ 2024-07-29 22:33 Li Huafei
  2024-07-31 19:20 ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Li Huafei @ 2024-07-29 22:33 UTC (permalink / raw)
  To: peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, tglx, bp, dave.hansen, x86, hpa,
	linux-perf-users, linux-kernel, lihuafei1

On my Haswell machine, running the ltp test cve-2015-3290 concurrently
reports the following warnings:

  perfevents: irq loop stuck!
  WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370
  CPU: 31 UID: 0 PID: 32438 Comm: cve-2015-3290 Kdump: loaded Tainted: G S      W          6.11.0-rc1+ #3
  ...
  Call Trace:
   <NMI>
   ? __warn+0xa4/0x220
   ? intel_pmu_handle_irq+0x285/0x370
   ? __report_bug+0x123/0x130
   ? intel_pmu_handle_irq+0x285/0x370
   ? __report_bug+0x123/0x130
   ? intel_pmu_handle_irq+0x285/0x370
   ? report_bug+0x3e/0xa0
   ? handle_bug+0x3c/0x70
   ? exc_invalid_op+0x18/0x50
   ? asm_exc_invalid_op+0x1a/0x20
   ? irq_work_claim+0x1e/0x40
   ? intel_pmu_handle_irq+0x285/0x370
   perf_event_nmi_handler+0x3d/0x60
   nmi_handle+0x104/0x330
   ? ___ratelimit+0xe4/0x1b0
   default_do_nmi+0x40/0x100
   exc_nmi+0x104/0x180
   end_repeat_nmi+0xf/0x53
   ...
   ? intel_pmu_lbr_enable_all+0x2a/0x90
   ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
   ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
   perf_ctx_enable+0x8e/0xc0
   __perf_install_in_context+0x146/0x3e0
   ? __pfx___perf_install_in_context+0x10/0x10
   remote_function+0x7c/0xa0
   ? __pfx_remote_function+0x10/0x10
   generic_exec_single+0xf8/0x150
   smp_call_function_single+0x1dc/0x230
   ? __pfx_remote_function+0x10/0x10
   ? __pfx_smp_call_function_single+0x10/0x10
   ? __pfx_remote_function+0x10/0x10
   ? lock_is_held_type+0x9e/0x120
   ? exclusive_event_installable+0x4f/0x140
   perf_install_in_context+0x197/0x330
   ? __pfx_perf_install_in_context+0x10/0x10
   ? __pfx___perf_install_in_context+0x10/0x10
   __do_sys_perf_event_open+0xb80/0x1100
   ? __pfx___do_sys_perf_event_open+0x10/0x10
   ? __pfx___lock_release+0x10/0x10
   ? lockdep_hardirqs_on_prepare+0x135/0x200
   ? ktime_get_coarse_real_ts64+0xee/0x100
   ? ktime_get_coarse_real_ts64+0x92/0x100
   do_syscall_64+0x70/0x180
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
   ...

My machine has 32 physical cores, each with two logical cores. During
testing, it executes the CVE-2015-3290 test case 100 times concurrently.

This warning was already present in [1] and a patch was given there to
limit period to 128 on Haswell, but that patch was not merged into the
mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
and 32 period on my machine and found that the problem could be
reproduced with a limit of 16, but the problem did not reproduce when
set to 32. It looks like we can limit the cycles to 32 on Haswell as
well.

[1] https://lore.kernel.org/lkml/20150501070226.GB18957@gmail.com/#r
[2] https://lore.kernel.org/all/1566256411-18820-1-git-send-email-johunt@akamai.com/T/#mf1479ab3f25d3f7f3a899244081baa2e7b7bc0b9

Signed-off-by: Li Huafei <lihuafei1@huawei.com>
---
 arch/x86/events/intel/core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0c9c2706d4ec..459dec2f07e3 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4625,6 +4625,11 @@ static void glc_limit_period(struct perf_event *event, s64 *left)
 		*left = max(*left, 128LL);
 }
 
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+	*left = max(*left, 32LL);
+}
+
 PMU_FORMAT_ATTR(event,	"config:0-7"	);
 PMU_FORMAT_ATTR(umask,	"config:8-15"	);
 PMU_FORMAT_ATTR(edge,	"config:18"	);
@@ -6767,6 +6772,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.hw_config = hsw_hw_config;
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
 		x86_pmu.lbr_double_abort = true;
+		x86_pmu.limit_period = hsw_limit_period;
 		extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
 			hsw_format_attr : nhm_format_attr;
 		td_attr  = hsw_events_attrs;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-07-29 22:33 [PATCH] perf/x86/intel: Restrict period on Haswell Li Huafei
@ 2024-07-31 19:20 ` Thomas Gleixner
  2024-08-13 13:13   ` Li Huafei
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-07-31 19:20 UTC (permalink / raw)
  To: Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, bp, dave.hansen, x86, hpa,
	linux-perf-users, linux-kernel, lihuafei1

On Tue, Jul 30 2024 at 06:33, Li Huafei wrote:
> On my Haswell machine, running the ltp test cve-2015-3290 concurrently
> reports the following warnings:
>
>   perfevents: irq loop stuck!
>   WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370
>   CPU: 31 UID: 0 PID: 32438 Comm: cve-2015-3290 Kdump: loaded Tainted: G S      W          6.11.0-rc1+ #3
>   ...
>   Call Trace:
>    <NMI>
>    ? __warn+0xa4/0x220
>    ? intel_pmu_handle_irq+0x285/0x370
>    ? __report_bug+0x123/0x130
>    ? intel_pmu_handle_irq+0x285/0x370
>    ? __report_bug+0x123/0x130
>    ? intel_pmu_handle_irq+0x285/0x370
>    ? report_bug+0x3e/0xa0
>    ? handle_bug+0x3c/0x70
>    ? exc_invalid_op+0x18/0x50
>    ? asm_exc_invalid_op+0x1a/0x20
>    ? irq_work_claim+0x1e/0x40
>    ? intel_pmu_handle_irq+0x285/0x370
>    perf_event_nmi_handler+0x3d/0x60
>    nmi_handle+0x104/0x330
>    ? ___ratelimit+0xe4/0x1b0
>    default_do_nmi+0x40/0x100
>    exc_nmi+0x104/0x180
>    end_repeat_nmi+0xf/0x53
>    ...
>    ? intel_pmu_lbr_enable_all+0x2a/0x90
>    ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
>    ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
>    perf_ctx_enable+0x8e/0xc0
>    __perf_install_in_context+0x146/0x3e0
>    ? __pfx___perf_install_in_context+0x10/0x10
>    remote_function+0x7c/0xa0
>    ? __pfx_remote_function+0x10/0x10
>    generic_exec_single+0xf8/0x150
>    smp_call_function_single+0x1dc/0x230
>    ? __pfx_remote_function+0x10/0x10
>    ? __pfx_smp_call_function_single+0x10/0x10
>    ? __pfx_remote_function+0x10/0x10
>    ? lock_is_held_type+0x9e/0x120
>    ? exclusive_event_installable+0x4f/0x140
>    perf_install_in_context+0x197/0x330
>    ? __pfx_perf_install_in_context+0x10/0x10
>    ? __pfx___perf_install_in_context+0x10/0x10
>    __do_sys_perf_event_open+0xb80/0x1100
>    ? __pfx___do_sys_perf_event_open+0x10/0x10
>    ? __pfx___lock_release+0x10/0x10
>    ? lockdep_hardirqs_on_prepare+0x135/0x200
>    ? ktime_get_coarse_real_ts64+0xee/0x100
>    ? ktime_get_coarse_real_ts64+0x92/0x100
>    do_syscall_64+0x70/0x180
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    ...

Please trim the backtrace to something useful:

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces

> My machine has 32 physical cores, each with two logical cores. During
> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>
> This warning was already present in [1] and a patch was given there to
> limit period to 128 on Haswell, but that patch was not merged into the
> mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
> and 32 period on my machine and found that the problem could be
> reproduced with a limit of 16, but the problem did not reproduce when
> set to 32. It looks like we can limit the cycles to 32 on Haswell as
> well.

It looks like? Either it works or not.

>  
> +static void hsw_limit_period(struct perf_event *event, s64 *left)
> +{
> +	*left = max(*left, 32LL);
> +}

And why do we need a copy of nhm_limit_period() ?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-07-31 19:20 ` Thomas Gleixner
@ 2024-08-13 13:13   ` Li Huafei
  2024-08-14 14:43     ` Thomas Gleixner
  2024-08-14 14:52     ` Thomas Gleixner
  0 siblings, 2 replies; 17+ messages in thread
From: Li Huafei @ 2024-08-13 13:13 UTC (permalink / raw)
  To: Thomas Gleixner, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, bp, dave.hansen, x86, hpa,
	linux-perf-users, linux-kernel


Hi Thomas, sorry for the late reply.

On 2024/8/1 3:20, Thomas Gleixner wrote:
> On Tue, Jul 30 2024 at 06:33, Li Huafei wrote:
>> On my Haswell machine, running the ltp test cve-2015-3290 concurrently
>> reports the following warnings:
>>
>>   perfevents: irq loop stuck!
>>   WARNING: CPU: 31 PID: 32438 at arch/x86/events/intel/core.c:3174 intel_pmu_handle_irq+0x285/0x370
>>   CPU: 31 UID: 0 PID: 32438 Comm: cve-2015-3290 Kdump: loaded Tainted: G S      W          6.11.0-rc1+ #3
>>   ...
>>   Call Trace:
>>    <NMI>
>>    ? __warn+0xa4/0x220
>>    ? intel_pmu_handle_irq+0x285/0x370
>>    ? __report_bug+0x123/0x130
>>    ? intel_pmu_handle_irq+0x285/0x370
>>    ? __report_bug+0x123/0x130
>>    ? intel_pmu_handle_irq+0x285/0x370
>>    ? report_bug+0x3e/0xa0
>>    ? handle_bug+0x3c/0x70
>>    ? exc_invalid_op+0x18/0x50
>>    ? asm_exc_invalid_op+0x1a/0x20
>>    ? irq_work_claim+0x1e/0x40
>>    ? intel_pmu_handle_irq+0x285/0x370
>>    perf_event_nmi_handler+0x3d/0x60
>>    nmi_handle+0x104/0x330
>>    ? ___ratelimit+0xe4/0x1b0
>>    default_do_nmi+0x40/0x100
>>    exc_nmi+0x104/0x180
>>    end_repeat_nmi+0xf/0x53
>>    ...
>>    ? intel_pmu_lbr_enable_all+0x2a/0x90
>>    ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
>>    ? __intel_pmu_enable_all.constprop.0+0x16d/0x1b0
>>    perf_ctx_enable+0x8e/0xc0
>>    __perf_install_in_context+0x146/0x3e0
>>    ? __pfx___perf_install_in_context+0x10/0x10
>>    remote_function+0x7c/0xa0
>>    ? __pfx_remote_function+0x10/0x10
>>    generic_exec_single+0xf8/0x150
>>    smp_call_function_single+0x1dc/0x230
>>    ? __pfx_remote_function+0x10/0x10
>>    ? __pfx_smp_call_function_single+0x10/0x10
>>    ? __pfx_remote_function+0x10/0x10
>>    ? lock_is_held_type+0x9e/0x120
>>    ? exclusive_event_installable+0x4f/0x140
>>    perf_install_in_context+0x197/0x330
>>    ? __pfx_perf_install_in_context+0x10/0x10
>>    ? __pfx___perf_install_in_context+0x10/0x10
>>    __do_sys_perf_event_open+0xb80/0x1100
>>    ? __pfx___do_sys_perf_event_open+0x10/0x10
>>    ? __pfx___lock_release+0x10/0x10
>>    ? lockdep_hardirqs_on_prepare+0x135/0x200
>>    ? ktime_get_coarse_real_ts64+0xee/0x100
>>    ? ktime_get_coarse_real_ts64+0x92/0x100
>>    do_syscall_64+0x70/0x180
>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>    ...
> 
> Please trim the backtrace to something useful:
> 
> https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces
> 

Okay, thanks for the tip!

>> My machine has 32 physical cores, each with two logical cores. During
>> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>>
>> This warning was already present in [1] and a patch was given there to
>> limit period to 128 on Haswell, but that patch was not merged into the
>> mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
>> and 32 period on my machine and found that the problem could be
>> reproduced with a limit of 16, but the problem did not reproduce when
>> set to 32. It looks like we can limit the cycles to 32 on Haswell as
>> well.
> 
> It looks like? Either it works or not.
> 

It worked for my test scenario. I say "looks like" because I'm not sure
how it circumvents the problem, and if the limit of 32 no longer works
if I increase the number of test cases executed in parallel. Any
suggestions?

>>  
>> +static void hsw_limit_period(struct perf_event *event, s64 *left)
>> +{
>> +	*left = max(*left, 32LL);
>> +}
> 
> And why do we need a copy of nhm_limit_period() ?
> 

Do you mean why the period is limited to 32 like nhm_limit_period()? I
referred to nhm_limit_period() and found that the problem cannot be
reproduced when the limit is 32, while it can be reproduced when the
limit is 16. Therefore, similar to nhm, the limit period is 32. As
mentioned earlier, I am not sure how it works and need expert advice.

Thanks,
Huafei

> Thanks,
> 
>         tglx
> 
> .
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-13 13:13   ` Li Huafei
@ 2024-08-14 14:43     ` Thomas Gleixner
  2024-08-14 14:52     ` Thomas Gleixner
  1 sibling, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-14 14:43 UTC (permalink / raw)
  To: Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, bp, dave.hansen, x86, hpa,
	linux-perf-users, linux-kernel

Li!

On Tue, Aug 13 2024 at 21:13, Li Huafei wrote:
> On 2024/8/1 3:20, Thomas Gleixner wrote:
>>> My machine has 32 physical cores, each with two logical cores. During
>>> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>>>
>>> This warning was already present in [1] and a patch was given there to
>>> limit period to 128 on Haswell, but that patch was not merged into the
>>> mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
>>> and 32 period on my machine and found that the problem could be
>>> reproduced with a limit of 16, but the problem did not reproduce when
>>> set to 32. It looks like we can limit the cycles to 32 on Haswell as
>>> well.
>> 
>> It looks like? Either it works or not.
>
> It worked for my test scenario. I say "looks like" because I'm not sure
> how it circumvents the problem, and if the limit of 32 no longer works
> if I increase the number of test cases executed in parallel. Any
> suggestions?

If you read back through the email history of these limits, then you can
see that too short periods cause that problem on Broadwell due to a
erratum, which is explained on top of the BDW limit.

Now looking at the HSW specification update specifically erratum HSW11:

  Performance Monitor Precise Instruction Retired Event May Present
  Wrong Indications

  Problem:
         When the Precise Distribution for Instructions Retired (PDIR)
         mechanism is activated (INST_RETIRED.ALL (event C0H, umask
         value 00H) on Counter 1 programmed in PEBS mode), the processor
         may return wrong PEBS or Performance Monitoring Interrupt (PMI)
         interrupts and/or incorrect counter values if the counter is
         reset with a Sample- After-Value (SAV) below 100 (the SAV is
         the counter reset value software programs in the MSR
         IA32_PMC1[47:0] in order to control interrupt frequency).

  Implication:
         Due to this erratum, when using low SAV values, the program may
         get incorrect PEBS or PMI interrupts and/or an invalid counter
         state.

  Workaround:
         The sampling driver should avoid using SAV<100.

IOW, that's exactly the same issue as the BDM11 erratum.

Kan: Can you please go through the various specification updates and
identify which generations are affected by this and fix it once and
forever in a sane way instead of relying on 'tried until it works by
some definition of works' hacks. These errata are there for a reason.

But that does not explain the fallout with that cve test because that
does not use PEBS AFAICT. It's using fixed counter 0.

Li, you added that huge useless backtrace to your changelog, but omitted
the output of perf_event_print_debug() after that. It should tell us
about which counter is causing that.

>>> +static void hsw_limit_period(struct perf_event *event, s64 *left)
>>> +{
>>> +	*left = max(*left, 32LL);
>>> +}
>> 
>> And why do we need a copy of nhm_limit_period() ?
>> 
>
> Do you mean why the period is limited to 32 like nhm_limit_period()?

No. If 32 is the correct limit, then we don't need another function
which does exactly the same. So you can assign exactly that function for
HSW, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-13 13:13   ` Li Huafei
  2024-08-14 14:43     ` Thomas Gleixner
@ 2024-08-14 14:52     ` Thomas Gleixner
  2024-08-14 18:15       ` Liang, Kan
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-14 14:52 UTC (permalink / raw)
  To: Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, kan.liang, bp, dave.hansen, x86, hpa,
	linux-perf-users, linux-kernel

Li!

On Tue, Aug 13 2024 at 21:13, Li Huafei wrote:
> On 2024/8/1 3:20, Thomas Gleixner wrote:
>>> My machine has 32 physical cores, each with two logical cores. During
>>> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>>>
>>> This warning was already present in [1] and a patch was given there to
>>> limit period to 128 on Haswell, but that patch was not merged into the
>>> mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
>>> and 32 period on my machine and found that the problem could be
>>> reproduced with a limit of 16, but the problem did not reproduce when
>>> set to 32. It looks like we can limit the cycles to 32 on Haswell as
>>> well.
>> 
>> It looks like? Either it works or not.
>
> It worked for my test scenario. I say "looks like" because I'm not sure
> how it circumvents the problem, and if the limit of 32 no longer works
> if I increase the number of test cases executed in parallel. Any
> suggestions?

If you read back through the email history of these limits, then you can
see that too short periods cause that problem on Broadwell due to a
erratum, which is explained on top of the BDW limit.

Now looking at the HSW specification update specifically erratum HSW11:

  Performance Monitor Precise Instruction Retired Event May Present
  Wrong Indications

  Problem:
         When the Precise Distribution for Instructions Retired (PDIR)
         mechanism is activated (INST_RETIRED.ALL (event C0H, umask
         value 00H) on Counter 1 programmed in PEBS mode), the processor
         may return wrong PEBS or Performance Monitoring Interrupt (PMI)
         interrupts and/or incorrect counter values if the counter is
         reset with a Sample- After-Value (SAV) below 100 (the SAV is
         the counter reset value software programs in the MSR
         IA32_PMC1[47:0] in order to control interrupt frequency).

  Implication:
         Due to this erratum, when using low SAV values, the program may
         get incorrect PEBS or PMI interrupts and/or an invalid counter
         state.

  Workaround:
         The sampling driver should avoid using SAV<100.

IOW, that's exactly the same issue as the BDM11 erratum.

Kan: Can you please go through the various specification updates and
identify which generations are affected by this and fix it once and
forever in a sane way instead of relying on 'tried until it works by
some definition of works' hacks. These errata are there for a reason.


But that does not explain the fallout with that cve test because that
does not use PEBS. It's using fixed counter 0.

Li, you added that huge useless backtrace but cut off the output of
perf_event_print_debug() after it. Can you please provide that
information so we can see what the counter states are?

>>> +static void hsw_limit_period(struct perf_event *event, s64 *left)
>>> +{
>>> +	*left = max(*left, 32LL);
>>> +}
>> 
>> And why do we need a copy of nhm_limit_period() ?
>> 
>
> Do you mean why the period is limited to 32 like nhm_limit_period()?

No. If 32 is the correct limit, then we don't need another function
which does exactly the same. So you can assign exactly that function for
HSW, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-14 14:52     ` Thomas Gleixner
@ 2024-08-14 18:15       ` Liang, Kan
  2024-08-14 19:01         ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Liang, Kan @ 2024-08-14 18:15 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel



On 2024-08-14 10:52 a.m., Thomas Gleixner wrote:
> Li!
> 
> On Tue, Aug 13 2024 at 21:13, Li Huafei wrote:
>> On 2024/8/1 3:20, Thomas Gleixner wrote:
>>>> My machine has 32 physical cores, each with two logical cores. During
>>>> testing, it executes the CVE-2015-3290 test case 100 times concurrently.
>>>>
>>>> This warning was already present in [1] and a patch was given there to
>>>> limit period to 128 on Haswell, but that patch was not merged into the
>>>> mainline.  In [2] the period on Nehalem was limited to 32. I tested 16
>>>> and 32 period on my machine and found that the problem could be
>>>> reproduced with a limit of 16, but the problem did not reproduce when
>>>> set to 32. It looks like we can limit the cycles to 32 on Haswell as
>>>> well.
>>>
>>> It looks like? Either it works or not.
>>
>> It worked for my test scenario. I say "looks like" because I'm not sure
>> how it circumvents the problem, and if the limit of 32 no longer works
>> if I increase the number of test cases executed in parallel. Any
>> suggestions?
> 
> If you read back through the email history of these limits, then you can
> see that too short periods cause that problem on Broadwell due to a
> erratum, which is explained on top of the BDW limit.
> 
> Now looking at the HSW specification update specifically erratum HSW11:
> 
>   Performance Monitor Precise Instruction Retired Event May Present
>   Wrong Indications
> 
>   Problem:
>          When the Precise Distribution for Instructions Retired (PDIR)
>          mechanism is activated (INST_RETIRED.ALL (event C0H, umask
>          value 00H) on Counter 1 programmed in PEBS mode), the processor
>          may return wrong PEBS or Performance Monitoring Interrupt (PMI)
>          interrupts and/or incorrect counter values if the counter is
>          reset with a Sample- After-Value (SAV) below 100 (the SAV is
>          the counter reset value software programs in the MSR
>          IA32_PMC1[47:0] in order to control interrupt frequency).
> 
>   Implication:
>          Due to this erratum, when using low SAV values, the program may
>          get incorrect PEBS or PMI interrupts and/or an invalid counter
>          state.
> 
>   Workaround:
>          The sampling driver should avoid using SAV<100.
> 
> IOW, that's exactly the same issue as the BDM11 erratum.
> 
> Kan: Can you please go through the various specification updates and
> identify which generations are affected by this and fix it once and
> forever in a sane way instead of relying on 'tried until it works by
> some definition of works' hacks. These errata are there for a reason.

Sure. I will check all the related erratum and propose a fix.

> 
> 
> But that does not explain the fallout with that cve test because that
> does not use PEBS. It's using fixed counter 0.

The errata also mentions about the PMI interrupts, which may imply
non-PEBS case. I will double check with the architect.

According to the description of the patch, if I understand correctly, it
runs 100 CVE-2015-3290 tests at the same time. If so, all the GP
counters are used. Huafei, could you please confirm?

Thanks,
Kan
> 
> Li, you added that huge useless backtrace but cut off the output of
> perf_event_print_debug() after it. Can you please provide that
> information so we can see what the counter states are?
> 
>>>> +static void hsw_limit_period(struct perf_event *event, s64 *left)
>>>> +{
>>>> +	*left = max(*left, 32LL);
>>>> +}
>>>
>>> And why do we need a copy of nhm_limit_period() ?
>>>
>>
>> Do you mean why the period is limited to 32 like nhm_limit_period()?
> 
> No. If 32 is the correct limit, then we don't need another function
> which does exactly the same. So you can assign exactly that function for
> HSW, no?
> 
> Thanks,
> 
>         tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-14 18:15       ` Liang, Kan
@ 2024-08-14 19:01         ` Thomas Gleixner
  2024-08-14 19:37           ` Liang, Kan
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-14 19:01 UTC (permalink / raw)
  To: Liang, Kan, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel

On Wed, Aug 14 2024 at 14:15, Kan Liang wrote:
> On 2024-08-14 10:52 a.m., Thomas Gleixner wrote:
>> Now looking at the HSW specification update specifically erratum HSW11:
>> 
>>   Performance Monitor Precise Instruction Retired Event May Present
>>   Wrong Indications
>> 
>>   Problem:
>>          When the Precise Distribution for Instructions Retired (PDIR)
>>          mechanism is activated (INST_RETIRED.ALL (event C0H, umask
>>          value 00H) on Counter 1 programmed in PEBS mode), the processor
>>          may return wrong PEBS or Performance Monitoring Interrupt (PMI)
>>          interrupts and/or incorrect counter values if the counter is
>>          reset with a Sample- After-Value (SAV) below 100 (the SAV is
>>          the counter reset value software programs in the MSR
>>          IA32_PMC1[47:0] in order to control interrupt frequency).
>> 
>>   Implication:
>>          Due to this erratum, when using low SAV values, the program may
>>          get incorrect PEBS or PMI interrupts and/or an invalid counter
>>          state.
>> 
>>   Workaround:
>>          The sampling driver should avoid using SAV<100.
>> 
>> IOW, that's exactly the same issue as the BDM11 erratum.
>> 
>> Kan: Can you please go through the various specification updates and
>> identify which generations are affected by this and fix it once and
>> forever in a sane way instead of relying on 'tried until it works by
>> some definition of works' hacks. These errata are there for a reason.
>
> Sure. I will check all the related erratum and propose a fix.
>
>> But that does not explain the fallout with that cve test because that
>> does not use PEBS. It's using fixed counter 0.
>
> The errata also mentions about the PMI interrupts, which may imply
> non-PEBS case. I will double check with the architect.

Ah. Indeed.

> According to the description of the patch, if I understand correctly, it
> runs 100 CVE-2015-3290 tests at the same time. If so, all the GP
> counters are used. Huafei, could you please confirm?

I can reproduce that way on my quad socket HSW almost instantaneously:

[10473.376928] CPU#16: ctrl:       0000000000000000
[10473.376930] CPU#16: status:     0000000000000000
[10473.376931] CPU#16: overflow:   0000000000000000
[10473.376932] CPU#16: fixed:      00000000000000bb
[10473.376933] CPU#16: pebs:       0000000000000000
[10473.376934] CPU#16: debugctl:   0000000000004000
[10473.376935] CPU#16: active:     0000000300000000
[10473.376937] CPU#16:   gen-PMC0 ctrl:  0000000000134f2e
[10473.376938] CPU#16:   gen-PMC0 count: 0000ffffffffffca
[10473.376940] CPU#16:   gen-PMC0 left:  000000000000003b
[10473.376941] CPU#16:   gen-PMC1 ctrl:  0000000000000000
[10473.376943] CPU#16:   gen-PMC1 count: 0000000000000000
[10473.376944] CPU#16:   gen-PMC1 left:  0000000000000000
[10473.376946] CPU#16:   gen-PMC2 ctrl:  0000000000000000
[10473.376947] CPU#16:   gen-PMC2 count: 0000000000000000
[10473.376948] CPU#16:   gen-PMC2 left:  0000000000000000
[10473.376949] CPU#16:   gen-PMC3 ctrl:  0000000000000000
[10473.376950] CPU#16:   gen-PMC3 count: 0000000000000000
[10473.376952] CPU#16:   gen-PMC3 left:  0000000000000000
[10473.376953] CPU#16: fixed-PMC0 count: 0000fffffffffffe
[10473.376954] CPU#16: fixed-PMC1 count: 0000fffbabf57908
[10473.376955] CPU#16: fixed-PMC2 count: 0000000000000000

[10473.376928] CPU#88: ctrl:       0000000000000000
[10473.376930] CPU#88: status:     0000000000000000
[10473.376931] CPU#88: overflow:   0000000000000000
[10473.376932] CPU#88: fixed:      00000000000000bb
[10473.376933] CPU#88: pebs:       0000000000000000
[10473.376934] CPU#88: debugctl:   0000000000004000
[10473.376935] CPU#88: active:     0000000300000000
[10473.376937] CPU#88:   gen-PMC0 ctrl:  0000000000134f2e
[10473.376939] CPU#88:   gen-PMC0 count: 0000fffffffffff2
[10473.376940] CPU#88:   gen-PMC0 left:  00000000000000a8
[10473.376942] CPU#88:   gen-PMC1 ctrl:  0000000000000000
[10473.376944] CPU#88:   gen-PMC1 count: 0000000000000000
[10473.376945] CPU#88:   gen-PMC1 left:  0000000000000000
[10473.376946] CPU#88:   gen-PMC2 ctrl:  0000000000000000
[10473.376947] CPU#88:   gen-PMC2 count: 0000000000000000
[10473.376949] CPU#88:   gen-PMC2 left:  0000000000000000
[10473.376950] CPU#88:   gen-PMC3 ctrl:  0000000000000000
[10473.376951] CPU#88:   gen-PMC3 count: 0000000000000000
[10473.376952] CPU#88:   gen-PMC3 left:  0000000000000000
[10473.376953] CPU#88: fixed-PMC0 count: 0000fffffffffffe
[10473.376955] CPU#88: fixed-PMC1 count: 0000fffa79a83958
[10473.376956] CPU#88: fixed-PMC2 count: 0000000000000000

This happens at the very same time and CPU#88 is the HT sibling of
CPU#16

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-14 19:01         ` Thomas Gleixner
@ 2024-08-14 19:37           ` Liang, Kan
  2024-08-14 22:47             ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Liang, Kan @ 2024-08-14 19:37 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel



On 2024-08-14 3:01 p.m., Thomas Gleixner wrote:
> On Wed, Aug 14 2024 at 14:15, Kan Liang wrote:
>> On 2024-08-14 10:52 a.m., Thomas Gleixner wrote:
>>> Now looking at the HSW specification update specifically erratum HSW11:
>>>
>>>   Performance Monitor Precise Instruction Retired Event May Present
>>>   Wrong Indications
>>>
>>>   Problem:
>>>          When the Precise Distribution for Instructions Retired (PDIR)
>>>          mechanism is activated (INST_RETIRED.ALL (event C0H, umask
>>>          value 00H) on Counter 1 programmed in PEBS mode), the processor
>>>          may return wrong PEBS or Performance Monitoring Interrupt (PMI)
>>>          interrupts and/or incorrect counter values if the counter is
>>>          reset with a Sample- After-Value (SAV) below 100 (the SAV is
>>>          the counter reset value software programs in the MSR
>>>          IA32_PMC1[47:0] in order to control interrupt frequency).
>>>
>>>   Implication:
>>>          Due to this erratum, when using low SAV values, the program may
>>>          get incorrect PEBS or PMI interrupts and/or an invalid counter
>>>          state.
>>>
>>>   Workaround:
>>>          The sampling driver should avoid using SAV<100.
>>>
>>> IOW, that's exactly the same issue as the BDM11 erratum.
>>>
>>> Kan: Can you please go through the various specification updates and
>>> identify which generations are affected by this and fix it once and
>>> forever in a sane way instead of relying on 'tried until it works by
>>> some definition of works' hacks. These errata are there for a reason.
>>
>> Sure. I will check all the related erratum and propose a fix.
>>
>>> But that does not explain the fallout with that cve test because that
>>> does not use PEBS. It's using fixed counter 0.
>>
>> The errata also mentions about the PMI interrupts, which may imply
>> non-PEBS case. I will double check with the architect.
> 
> Ah. Indeed.
> 
>> According to the description of the patch, if I understand correctly, it
>> runs 100 CVE-2015-3290 tests at the same time. If so, all the GP
>> counters are used. Huafei, could you please confirm?
> 
> I can reproduce that way on my quad socket HSW almost instantaneously:
> 
> [10473.376928] CPU#16: ctrl:       0000000000000000
> [10473.376930] CPU#16: status:     0000000000000000
> [10473.376931] CPU#16: overflow:   0000000000000000
> [10473.376932] CPU#16: fixed:      00000000000000bb
> [10473.376933] CPU#16: pebs:       0000000000000000
> [10473.376934] CPU#16: debugctl:   0000000000004000
> [10473.376935] CPU#16: active:     0000000300000000
> [10473.376937] CPU#16:   gen-PMC0 ctrl:  0000000000134f2e
> [10473.376938] CPU#16:   gen-PMC0 count: 0000ffffffffffca
> [10473.376940] CPU#16:   gen-PMC0 left:  000000000000003b
> [10473.376941] CPU#16:   gen-PMC1 ctrl:  0000000000000000
> [10473.376943] CPU#16:   gen-PMC1 count: 0000000000000000
> [10473.376944] CPU#16:   gen-PMC1 left:  0000000000000000
> [10473.376946] CPU#16:   gen-PMC2 ctrl:  0000000000000000
> [10473.376947] CPU#16:   gen-PMC2 count: 0000000000000000
> [10473.376948] CPU#16:   gen-PMC2 left:  0000000000000000
> [10473.376949] CPU#16:   gen-PMC3 ctrl:  0000000000000000
> [10473.376950] CPU#16:   gen-PMC3 count: 0000000000000000
> [10473.376952] CPU#16:   gen-PMC3 left:  0000000000000000
> [10473.376953] CPU#16: fixed-PMC0 count: 0000fffffffffffe
> [10473.376954] CPU#16: fixed-PMC1 count: 0000fffbabf57908
> [10473.376955] CPU#16: fixed-PMC2 count: 0000000000000000
> 
> [10473.376928] CPU#88: ctrl:       0000000000000000
> [10473.376930] CPU#88: status:     0000000000000000
> [10473.376931] CPU#88: overflow:   0000000000000000
> [10473.376932] CPU#88: fixed:      00000000000000bb
> [10473.376933] CPU#88: pebs:       0000000000000000
> [10473.376934] CPU#88: debugctl:   0000000000004000
> [10473.376935] CPU#88: active:     0000000300000000
> [10473.376937] CPU#88:   gen-PMC0 ctrl:  0000000000134f2e
> [10473.376939] CPU#88:   gen-PMC0 count: 0000fffffffffff2
> [10473.376940] CPU#88:   gen-PMC0 left:  00000000000000a8
> [10473.376942] CPU#88:   gen-PMC1 ctrl:  0000000000000000
> [10473.376944] CPU#88:   gen-PMC1 count: 0000000000000000
> [10473.376945] CPU#88:   gen-PMC1 left:  0000000000000000
> [10473.376946] CPU#88:   gen-PMC2 ctrl:  0000000000000000
> [10473.376947] CPU#88:   gen-PMC2 count: 0000000000000000
> [10473.376949] CPU#88:   gen-PMC2 left:  0000000000000000
> [10473.376950] CPU#88:   gen-PMC3 ctrl:  0000000000000000
> [10473.376951] CPU#88:   gen-PMC3 count: 0000000000000000
> [10473.376952] CPU#88:   gen-PMC3 left:  0000000000000000
> [10473.376953] CPU#88: fixed-PMC0 count: 0000fffffffffffe
> [10473.376955] CPU#88: fixed-PMC1 count: 0000fffa79a83958
> [10473.376956] CPU#88: fixed-PMC2 count: 0000000000000000
> 
> This happens at the very same time and CPU#88 is the HT sibling of
> CPU#16
> 

The fixed counter 0 is used which doesn't match of what the HSW11
describes. I will check if the HSW11 missed the case, or if there is
another issue.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-14 19:37           ` Liang, Kan
@ 2024-08-14 22:47             ` Thomas Gleixner
  2024-08-15 15:39               ` Liang, Kan
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-14 22:47 UTC (permalink / raw)
  To: Liang, Kan, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel

On Wed, Aug 14 2024 at 15:37, Kan Liang wrote:
> On 2024-08-14 3:01 p.m., Thomas Gleixner wrote:
>> This happens at the very same time and CPU#88 is the HT sibling of
>> CPU#16
>
> The fixed counter 0 is used which doesn't match of what the HSW11
> describes. I will check if the HSW11 missed the case, or if there is
> another issue.

This looks like a plain stupid software issue at the first glance. The
hack I use to look at that is at the end of the mail. Most of the output
here is heavily trimmed for readability:

65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000

65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000         

and this continues up to 100 times.

If I'm not missing something then a period of 1 or even 2 is way too
small for fixed counter 0 which is rearmed in the NMI and counts user
_and_ kernel.

But what's weird is that earlier in the trace I can see in the context
of a different task the following w/o looping in the handler:

65.084029: x86_perf_event_set_period: idx: 32 period:          1 left:          2
65.084033: x86_perf_event_set_period: idx: 32 period:          1 left:          2
65.085654: x86_perf_event_set_period: idx: 32 period:          1 left:          2
65.085660: x86_perf_event_set_period: idx: 32 period:          1 left:          2
65.085667: x86_perf_event_set_period: idx: 32 period:          1 left:          2
65.085673: x86_perf_event_set_period: idx: 32 period:          2 left:          2
65.085681: x86_perf_event_set_period: idx: 32 period:          4 left:          4
65.085687: x86_perf_event_set_period: idx: 32 period:          7 left:          7
65.085693: x86_perf_event_set_period: idx: 32 period:         14 left:         14
65.085699: x86_perf_event_set_period: idx: 32 period:         26 left:         26
65.085705: x86_perf_event_set_period: idx: 32 period:         49 left:         49
65.085708: x86_perf_event_set_period: idx: 32 period:         95 left:         95
65.085711: x86_perf_event_set_period: idx: 32 period:        303 left:        303
65.085713: x86_perf_event_set_period: idx: 32 period:        967 left:        550
65.085716: x86_perf_event_set_period: idx: 32 period:       3118 left:       2799
65.085722: x86_perf_event_set_period: idx: 32 period:       9723 left:       9411

This goes on to almost 100k period and then goes back down to 50k.

The test case sets it up with

    attr::freq        = 1
    attr::sample_freq = max_sample_rate / 5

max_sample_rate is read from /proc/sys/kernel/perf_event_max_sample_rate,
which contains 100000 after boot, so the requested value is 20000.

So in the good case the period = 1 manages to not have the status bit
set at, after handling.

The bad case stays there forever. Of course setting a limit makes this
magically go away, but honestly this is not a solution.

Another one magically cures itself:

65.131743: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131745: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131745: intel_pmu_handle_irq:      loops: 001 status: 100000000
65.131746: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131746: intel_pmu_handle_irq:      loops: 002 status: 100000000
65.131747: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131944: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131950: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131955: x86_perf_event_set_period: idx:    32 period:         1 left:          2
65.131961: x86_perf_event_set_period: idx:    32 period:         2 left:          2
65.131965: x86_perf_event_set_period: idx:    32 period:         5 left:          5
....
65.132331: x86_perf_event_set_period: idx:    32 period:     83183 left:      82871

I just wanted to look at something else and started a single instance of
the cve test right after booting and that ran into the same problem.

Full trace at: https://tglx.de/~tglx/t.txt

I think I see a pattern with that now. In all cases I saw so far the
problem happens when two HT siblings get the PMI at the very same time.

# grep handle t.txt

shows you the cases where it goes into the loop. I checked all the
previous traces and the pattern is always identical.

That aside. What really puzzles me is this period adjustment
algorithm.

# grep 'cve-2015-3290-2715.*idx: 32' t.txt

316.966607: x86_perf_event_set_period: idx: 32 period:          1 left:          2
316.966621: x86_perf_event_set_period: idx: 32 period:          1 left:          2
316.966977: x86_perf_event_set_period: idx: 32 period:          1 left:          2
316.966985: x86_perf_event_set_period: idx: 32 period:          1 left:          2
316.970507: x86_perf_event_set_period: idx: 32 period:       9980 left:       9980
316.970516: x86_perf_event_set_period: idx: 32 period:       9980 left:       9616
316.970562: x86_perf_event_set_period: idx: 32 period:       9980 left:       9674
316.970580: x86_perf_event_set_period: idx: 32 period:       8751 left:       8446
316.970596: x86_perf_event_set_period: idx: 32 period:      10733 left:      10428

This looks more than broken .... Seriously. 

    attr::freq        = 1
    attr::sample_freq = 20000

means 20000 samples per second, i.e. one sample every 50 microseconds,
unless this uses some magic new fangled math.

That CPU runs with 3.3GHz. Let's assume 1.0 IPC for simplicity. That
means in 50us it executes 16500 instructions, right?

So why on earth start with 1 as the estimate for the frequency
especially for this particular event which is guaranteed to fire
immediately? That makes no sense at all.

But even when you start with 1, then latest at the third event in the
loop or the third event within a couple of microseconds the frequency
estimate algorithm should notice that a period of 1 is bonkers.

But ranting^Wreasoning about all of this made me understand what goes
actually wrong.

Look at the "good" case:

316.966607: x86_perf_event_set_period: idx: 32 period:          1 left:          2
316.966621: x86_perf_event_set_period: idx: 32 period:          1 left:          2

Now the bad case:

# grep -E '\[063|135\]' t.txt

   cve-2015-3290-2725    [063] d..3.   316.967339: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2725    [063] d.Z3.   316.967343: x86_perf_event_set_period: idx: 32 period:          1 left:          2
           <...>-2743    [135] d..3.   316.968473: x86_perf_event_set_period: idx: 32 period:          1 left:          2
           <...>-2743    [135] d.Z3.   316.968478: x86_perf_event_set_period: idx: 32 period:          1 left:          2
           <...>-2743    [135] d.h2.   316.970502: x86_perf_event_set_period: idx: 32 period:       5596 left:       5596
   cve-2015-3290-2725    [063] d.h2.   316.970503: x86_perf_event_set_period: idx: 32 period:       9385 left:       9385

Here the two hyperthread NMIs are interleaved by a few microseconds,
which is still good by some definition of good, but later it goes south:

   cve-2015-3290-2808    [063] d.Z3.   316.970712: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2806    [135] d.Z3.   316.970712: x86_perf_event_set_period: idx: 32 period:          1 left:          2

Starting here they are not longer interleaved. They happen simultanously.

   cve-2015-3290-2806    [135] d.Z3.   316.970713: intel_pmu_handle_irq: 001        100000000
   cve-2015-3290-2808    [063] d.Z3.   316.970713: intel_pmu_handle_irq: 001        100000000
   cve-2015-3290-2808    [063] d.Z3.   316.970713: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2806    [135] d.Z3.   316.970713: x86_perf_event_set_period: idx: 32 period:          1 left:          2

...
   cve-2015-3290-2808    [063] d.Z3.   316.970819: intel_pmu_handle_irq: 099        100000000
   cve-2015-3290-2806    [135] d.Z3.   316.970819: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2808    [063] d.Z3.   316.970819: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2806    [135] d.Z3.   316.970819: intel_pmu_handle_irq: 099        100000000
   cve-2015-3290-2808    [063] d.Z3.   316.970820: intel_pmu_handle_irq: 100        100000000
   cve-2015-3290-2806    [135] d.Z3.   316.970820: x86_perf_event_set_period: idx: 32 period:          1 left:          2
   cve-2015-3290-2806    [135] d.Z3.   316.970820: intel_pmu_handle_irq: 100        100000000

Which means they are almost in lockstep. TBH, I could not be bothered to
repeat the experiment and turn on nanoseconds resolution for the trace
because it's too obvious what's going on.

In the single threaded case period == 1 (left == 2) does not matter
because the status register stays zero after handling the event and is
only updated after the NMI returns which makes the NMI come back
immediately, but that does not cause a loop.

But in the HT sibling concurrent case the hardware behaves differently
and the status register is updated for whatever reason before returning
from the NMI, which causes the endless loop because both hyper threads
get that treatment.

To prove my point I disabled hyperthreading via /sys/.../cpu/smt/control
and as expected the test case can't trigger the problem anymore.
Grepping for the loop trace_printk() comes back empty. I disabled the
other one to reduce the noise over several runs, which keeps the trace
completely empty.

Reverse engineering hardware is fun, isn't it?

It's not hard either because every reproducible problem has a pattern.
You just have to look for it.

Now the conclusion of this fun exercise is:

    1) The hardware behaves differently when the perf event happens
       concurrently on HT siblings

    2) The frequency estimation algorithm is broken

    3) Using a 'limit' guestimate is just papering over the underlying
       problems

Thanks,

        tglx
---
 arch/x86/events/core.c       |    1 +
 arch/x86/events/intel/core.c |    5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1400,6 +1400,7 @@ int x86_perf_event_set_period(struct per

 	static_call_cond(x86_pmu_limit_period)(event, &left);

+	trace_printk("idx: %2d period: %10lld left: %10lld\n", idx, period, left);
 	this_cpu_write(pmc_prev_left[idx], left);

 	/*
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3164,13 +3164,16 @@ static int intel_pmu_handle_irq(struct p

 	loops = 0;
 again:
+	if (loops)
+		trace_printk("%03d %16llx\n,", loops, status);
 	intel_pmu_lbr_read();
 	intel_pmu_ack_status(status);
 	if (++loops > 100) {
 		static bool warned;

 		if (!warned) {
-			WARN(1, "perfevents: irq loop stuck!\n");
+			tracing_off();
+			//WARN(1, "perfevents: irq loop stuck!\n");
 			perf_event_print_debug();
 			warned = true;
 		}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-14 22:47             ` Thomas Gleixner
@ 2024-08-15 15:39               ` Liang, Kan
  2024-08-15 18:26                 ` Thomas Gleixner
  2024-08-15 19:01                 ` Vince Weaver
  0 siblings, 2 replies; 17+ messages in thread
From: Liang, Kan @ 2024-08-15 15:39 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver

Hi Thomas,

Thank you very much for the detailed analysis.

On 2024-08-14 6:47 p.m., Thomas Gleixner wrote:
> On Wed, Aug 14 2024 at 15:37, Kan Liang wrote:
>> On 2024-08-14 3:01 p.m., Thomas Gleixner wrote:
>>> This happens at the very same time and CPU#88 is the HT sibling of
>>> CPU#16
>>
>> The fixed counter 0 is used which doesn't match of what the HSW11
>> describes. I will check if the HSW11 missed the case, or if there is
>> another issue.
> 
> This looks like a plain stupid software issue at the first glance. The
> hack I use to look at that is at the end of the mail. Most of the output
> here is heavily trimmed for readability:
> 
> 65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000
> 
> 65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000         
> 
> and this continues up to 100 times.
> 
> If I'm not missing something then a period of 1 or even 2 is way too
> small for fixed counter 0 which is rearmed in the NMI and counts user
> _and_ kernel.
> 
> But what's weird is that earlier in the trace I can see in the context
> of a different task the following w/o looping in the handler:
> 
> 65.084029: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 65.084033: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 65.085654: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 65.085660: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 65.085667: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 65.085673: x86_perf_event_set_period: idx: 32 period:          2 left:          2
> 65.085681: x86_perf_event_set_period: idx: 32 period:          4 left:          4
> 65.085687: x86_perf_event_set_period: idx: 32 period:          7 left:          7
> 65.085693: x86_perf_event_set_period: idx: 32 period:         14 left:         14
> 65.085699: x86_perf_event_set_period: idx: 32 period:         26 left:         26
> 65.085705: x86_perf_event_set_period: idx: 32 period:         49 left:         49
> 65.085708: x86_perf_event_set_period: idx: 32 period:         95 left:         95
> 65.085711: x86_perf_event_set_period: idx: 32 period:        303 left:        303
> 65.085713: x86_perf_event_set_period: idx: 32 period:        967 left:        550
> 65.085716: x86_perf_event_set_period: idx: 32 period:       3118 left:       2799
> 65.085722: x86_perf_event_set_period: idx: 32 period:       9723 left:       9411
> 
> This goes on to almost 100k period and then goes back down to 50k.
> 
> The test case sets it up with
> 
>     attr::freq        = 1
>     attr::sample_freq = max_sample_rate / 5
> 
> max_sample_rate is read from /proc/sys/kernel/perf_event_max_sample_rate,
> which contains 100000 after boot, so the requested value is 20000.
> 
> So in the good case the period = 1 manages to not have the status bit
> set at, after handling.
> 
> The bad case stays there forever. Of course setting a limit makes this
> magically go away, but honestly this is not a solution.
> 
> Another one magically cures itself:
> 
> 65.131743: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131745: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131745: intel_pmu_handle_irq:      loops: 001 status: 100000000
> 65.131746: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131746: intel_pmu_handle_irq:      loops: 002 status: 100000000
> 65.131747: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131944: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131950: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131955: x86_perf_event_set_period: idx:    32 period:         1 left:          2
> 65.131961: x86_perf_event_set_period: idx:    32 period:         2 left:          2
> 65.131965: x86_perf_event_set_period: idx:    32 period:         5 left:          5
> ....
> 65.132331: x86_perf_event_set_period: idx:    32 period:     83183 left:      82871
> 
> I just wanted to look at something else and started a single instance of
> the cve test right after booting and that ran into the same problem.
> 
> Full trace at: https://tglx.de/~tglx/t.txt
> 
> I think I see a pattern with that now. In all cases I saw so far the
> problem happens when two HT siblings get the PMI at the very same time.
> 
> # grep handle t.txt
> 
> shows you the cases where it goes into the loop. I checked all the
> previous traces and the pattern is always identical.
> 
> That aside. What really puzzles me is this period adjustment
> algorithm.
> 
> # grep 'cve-2015-3290-2715.*idx: 32' t.txt
> 
> 316.966607: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 316.966621: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 316.966977: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 316.966985: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 316.970507: x86_perf_event_set_period: idx: 32 period:       9980 left:       9980
> 316.970516: x86_perf_event_set_period: idx: 32 period:       9980 left:       9616
> 316.970562: x86_perf_event_set_period: idx: 32 period:       9980 left:       9674
> 316.970580: x86_perf_event_set_period: idx: 32 period:       8751 left:       8446
> 316.970596: x86_perf_event_set_period: idx: 32 period:      10733 left:      10428
> 
> This looks more than broken .... Seriously. 
> 
>     attr::freq        = 1
>     attr::sample_freq = 20000
> 
> means 20000 samples per second, i.e. one sample every 50 microseconds,
> unless this uses some magic new fangled math.
> 
> That CPU runs with 3.3GHz. Let's assume 1.0 IPC for simplicity. That
> means in 50us it executes 16500 instructions, right?
> 
> So why on earth start with 1 as the estimate for the frequency
> especially for this particular event which is guaranteed to fire
> immediately? That makes no sense at all.
> 
> But even when you start with 1, then latest at the third event in the
> loop or the third event within a couple of microseconds the frequency
> estimate algorithm should notice that a period of 1 is bonkers.
> 
> 
> But ranting^Wreasoning about all of this made me understand what goes
> actually wrong.
> 
> Look at the "good" case:
> 
> 316.966607: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 316.966621: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 
> Now the bad case:
> 
> # grep -E '\[063|135\]' t.txt
> 
>    cve-2015-3290-2725    [063] d..3.   316.967339: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2725    [063] d.Z3.   316.967343: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>            <...>-2743    [135] d..3.   316.968473: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>            <...>-2743    [135] d.Z3.   316.968478: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>            <...>-2743    [135] d.h2.   316.970502: x86_perf_event_set_period: idx: 32 period:       5596 left:       5596
>    cve-2015-3290-2725    [063] d.h2.   316.970503: x86_perf_event_set_period: idx: 32 period:       9385 left:       9385
> 
> Here the two hyperthread NMIs are interleaved by a few microseconds,
> which is still good by some definition of good, but later it goes south:
> 
>    cve-2015-3290-2808    [063] d.Z3.   316.970712: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2806    [135] d.Z3.   316.970712: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 
> Starting here they are not longer interleaved. They happen simultanously.
> 
>    cve-2015-3290-2806    [135] d.Z3.   316.970713: intel_pmu_handle_irq: 001        100000000
>    cve-2015-3290-2808    [063] d.Z3.   316.970713: intel_pmu_handle_irq: 001        100000000
>    cve-2015-3290-2808    [063] d.Z3.   316.970713: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2806    [135] d.Z3.   316.970713: x86_perf_event_set_period: idx: 32 period:          1 left:          2
> 
> ...
>    cve-2015-3290-2808    [063] d.Z3.   316.970819: intel_pmu_handle_irq: 099        100000000
>    cve-2015-3290-2806    [135] d.Z3.   316.970819: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2808    [063] d.Z3.   316.970819: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2806    [135] d.Z3.   316.970819: intel_pmu_handle_irq: 099        100000000
>    cve-2015-3290-2808    [063] d.Z3.   316.970820: intel_pmu_handle_irq: 100        100000000
>    cve-2015-3290-2806    [135] d.Z3.   316.970820: x86_perf_event_set_period: idx: 32 period:          1 left:          2
>    cve-2015-3290-2806    [135] d.Z3.   316.970820: intel_pmu_handle_irq: 100        100000000
> 
> Which means they are almost in lockstep. TBH, I could not be bothered to
> repeat the experiment and turn on nanoseconds resolution for the trace
> because it's too obvious what's going on.
> 
> In the single threaded case period == 1 (left == 2) does not matter
> because the status register stays zero after handling the event and is
> only updated after the NMI returns which makes the NMI come back
> immediately, but that does not cause a loop.
> 
> But in the HT sibling concurrent case the hardware behaves differently
> and the status register is updated for whatever reason before returning
> from the NMI, which causes the endless loop because both hyper threads
> get that treatment.
> 
> To prove my point I disabled hyperthreading via /sys/.../cpu/smt/control
> and as expected the test case can't trigger the problem anymore.
> Grepping for the loop trace_printk() comes back empty. I disabled the
> other one to reduce the noise over several runs, which keeps the trace
> completely empty.
> 
> Reverse engineering hardware is fun, isn't it?
> 
> It's not hard either because every reproducible problem has a pattern.
> You just have to look for it.
> 
> Now the conclusion of this fun exercise is:
> 
>     1) The hardware behaves differently when the perf event happens
>        concurrently on HT siblings

I think I found a related erratum.
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/4th-gen-core-family-mobile-specification-update.pdf

HSM154. Fixed-Function Performance Counter May Over Count Instructions
Retired by 32 When Intel® Hyper-Threading Technology is Enabled

Problem: If, while Intel Hyper-Threading Technology is enabled, the
IA32_FIXED_CTR0 MSR
(309H) is enabled by setting bits 0 and/or 1 in the
IA32_PERF_FIXED_CTR_CTRL MSR
(38DH) before setting bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) then
IA32_FIXED_CTR0 may over count by up to 32.

Implication: When this erratum occurs, the fixed-function performance
counter IA32_FIXED_CTR0 may over count by up to 32.

Workaround: The following sequence avoids this erratum (steps 1 and 2
are needed if the counter was previously enabled):
1. Clear bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) and clear bits 1
and 0 in the IA32_PERF_FIXED_CTR_CTRL MSR (38DH).
2. Zero the IA32_FIXED_CTR0 MSR.
3. Set bit 32 in the IA32_PERF_GLOBAL_CTRL MSR.
4. Set bits 0 and/or 1 in the IA32_PERF_FIXED_CTR_CTRL MSR as desired.

It should explains that the issue is gone with the magic number 32 or
disabling the Hyper-Threading.

I also found a related discussion about 9 years ago.
https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
Vince tried the workaround but it seems not work.

So limiting the min period of the fixed counter 0 to 32 seems the only
workaround for now.

The errata of the later platforms don't mention the issue. It should
only impacts the Haswell. I will double check.

> 
>     2) The frequency estimation algorithm is broken


For the events which occurs frequently, e.g., instructions, cycles, yes,
the frequency estimation algorithm doesn't work well.

But there are events that may not occur frequently. If a big init period
is set, it may be impossible to get the required freq for those events.

It's really hard to pick a universal init period that works for all events.

I'm thinking perf may only calculate/pre-set a init period for the Linux
defined architectural events, e.g., instructions, cycles, branches,
cache related events, etc. For the other ARCH specific events, I'm
afraid the period has to start 1.

> 
>     3) Using a 'limit' guestimate is just papering over the underlying
>        problems

It's possible that a user set a small number with -c. If the number is
less than the 'limit', it needs to be adjusted to avoid HW failure.
I think the 'limit' is still required.

Thanks,
Kan

> 
> Thanks,
> 
>         tglx
> ---
>  arch/x86/events/core.c       |    1 +
>  arch/x86/events/intel/core.c |    5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -1400,6 +1400,7 @@ int x86_perf_event_set_period(struct per
>  
>  	static_call_cond(x86_pmu_limit_period)(event, &left);
>  
> +	trace_printk("idx: %2d period: %10lld left: %10lld\n", idx, period, left);
>  	this_cpu_write(pmc_prev_left[idx], left);
>  
>  	/*
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3164,13 +3164,16 @@ static int intel_pmu_handle_irq(struct p
>  
>  	loops = 0;
>  again:
> +	if (loops)
> +		trace_printk("%03d %16llx\n,", loops, status);
>  	intel_pmu_lbr_read();
>  	intel_pmu_ack_status(status);
>  	if (++loops > 100) {
>  		static bool warned;
>  
>  		if (!warned) {
> -			WARN(1, "perfevents: irq loop stuck!\n");
> +			tracing_off();
> +			//WARN(1, "perfevents: irq loop stuck!\n");
>  			perf_event_print_debug();
>  			warned = true;
>  		}
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-15 15:39               ` Liang, Kan
@ 2024-08-15 18:26                 ` Thomas Gleixner
  2024-08-15 20:15                   ` Liang, Kan
  2024-08-15 19:01                 ` Vince Weaver
  1 sibling, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-15 18:26 UTC (permalink / raw)
  To: Liang, Kan, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver

Kan!

On Thu, Aug 15 2024 at 11:39, Kan Liang wrote:
> On 2024-08-14 6:47 p.m., Thomas Gleixner wrote:
>> Now the conclusion of this fun exercise is:
>> 
>>     1) The hardware behaves differently when the perf event happens
>>        concurrently on HT siblings
>
> I think I found a related erratum.

> HSM154. Fixed-Function Performance Counter May Over Count Instructions
> Retired by 32 When Intel® Hyper-Threading Technology is Enabled
>
> Problem: If, while Intel Hyper-Threading Technology is enabled, the
> IA32_FIXED_CTR0 MSR
> (309H) is enabled by setting bits 0 and/or 1 in the
> IA32_PERF_FIXED_CTR_CTRL MSR
> (38DH) before setting bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) then
> IA32_FIXED_CTR0 may over count by up to 32.
>
> Implication: When this erratum occurs, the fixed-function performance
> counter IA32_FIXED_CTR0 may over count by up to 32.

Sure. That's only explaining half of the problem.

As I demonstrated in the non-contended case even with a count of 2 (I
tried 1 too) the status bit is never set on the second check.

Which is weird, because the number of instructions between setting the
count and re-checking the status MSR is definitely larger than 2 (or 1).

> Workaround: The following sequence avoids this erratum (steps 1 and 2
> are needed if the counter was previously enabled):
> 1. Clear bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) and clear bits 1
> and 0 in the IA32_PERF_FIXED_CTR_CTRL MSR (38DH).
> 2. Zero the IA32_FIXED_CTR0 MSR.
> 3. Set bit 32 in the IA32_PERF_GLOBAL_CTRL MSR.
> 4. Set bits 0 and/or 1 in the IA32_PERF_FIXED_CTR_CTRL MSR as desired.
>
> It should explains that the issue is gone with the magic number 32 or
> disabling the Hyper-Threading.

It explains only half of it. If you use 32, then the counter is set to
-32 so the overcount of 32 will still bring it to 0, which should set
the status bit, no?

> I also found a related discussion about 9 years ago.
> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
> Vince tried the workaround but it seems not work.

Let me play with that. :)

>>     2) The frequency estimation algorithm is broken
>
> For the events which occurs frequently, e.g., instructions, cycles, yes,
> the frequency estimation algorithm doesn't work well.
>
> But there are events that may not occur frequently. If a big init period
> is set, it may be impossible to get the required freq for those events.
>
> It's really hard to pick a universal init period that works for all
> events.

I understand that, but especially for RETIRED it's obvious :)

> I'm thinking perf may only calculate/pre-set a init period for the Linux
> defined architectural events, e.g., instructions, cycles, branches,
> cache related events, etc. For the other ARCH specific events, I'm
> afraid the period has to start 1.

Yes, that would be way better than what we have now.

>> 
>>     3) Using a 'limit' guestimate is just papering over the underlying
>>        problems
>
> It's possible that a user set a small number with -c. If the number is
> less than the 'limit', it needs to be adjusted to avoid HW failure.
> I think the 'limit' is still required.

I'm not against the limit per se. I'm against guestimated limits which
are thrown into the code without understanding the underlying problem.

The just paper over it up to the point where they bite back because the
guestimate was off by $N.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-15 18:26                 ` Thomas Gleixner
@ 2024-08-15 20:15                   ` Liang, Kan
  2024-08-15 23:43                     ` Thomas Gleixner
  0 siblings, 1 reply; 17+ messages in thread
From: Liang, Kan @ 2024-08-15 20:15 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver



On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
> Kan!
> 
> On Thu, Aug 15 2024 at 11:39, Kan Liang wrote:
>> On 2024-08-14 6:47 p.m., Thomas Gleixner wrote:
>>> Now the conclusion of this fun exercise is:
>>>
>>>     1) The hardware behaves differently when the perf event happens
>>>        concurrently on HT siblings
>>
>> I think I found a related erratum.
> 
>> HSM154. Fixed-Function Performance Counter May Over Count Instructions
>> Retired by 32 When Intel® Hyper-Threading Technology is Enabled
>>
>> Problem: If, while Intel Hyper-Threading Technology is enabled, the
>> IA32_FIXED_CTR0 MSR
>> (309H) is enabled by setting bits 0 and/or 1 in the
>> IA32_PERF_FIXED_CTR_CTRL MSR
>> (38DH) before setting bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) then
>> IA32_FIXED_CTR0 may over count by up to 32.
>>
>> Implication: When this erratum occurs, the fixed-function performance
>> counter IA32_FIXED_CTR0 may over count by up to 32.
> 
> Sure. That's only explaining half of the problem.
> 
> As I demonstrated in the non-contended case even with a count of 2 (I
> tried 1 too) the status bit is never set on the second check.
> 

Do you mean the below example? The status bit (32) of the fixed counter
0 is always set.


65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000

65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000

> Which is weird, because the number of instructions between setting the
> count and re-checking the status MSR is definitely larger than 2 (or 1).
> 
>> Workaround: The following sequence avoids this erratum (steps 1 and 2
>> are needed if the counter was previously enabled):
>> 1. Clear bit 32 in the IA32_PERF_GLOBAL_CTRL MSR (38FH) and clear bits 1
>> and 0 in the IA32_PERF_FIXED_CTR_CTRL MSR (38DH).
>> 2. Zero the IA32_FIXED_CTR0 MSR.
>> 3. Set bit 32 in the IA32_PERF_GLOBAL_CTRL MSR.
>> 4. Set bits 0 and/or 1 in the IA32_PERF_FIXED_CTR_CTRL MSR as desired.
>>
>> It should explains that the issue is gone with the magic number 32 or
>> disabling the Hyper-Threading.
> 
> It explains only half of it. If you use 32, then the counter is set to
> -32 so the overcount of 32 will still bring it to 0, which should set
> the status bit, no?

I think it's up to 32, not always 32.
I don't have more details regarding the issue. The architect of HSW has
left. I'm asking around internally to find the original bug report of
the erratum. Hope there are more details in the report.

> 
>> I also found a related discussion about 9 years ago.
>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>> Vince tried the workaround but it seems not work.
> 
> Let me play with that. :)

Appreciate it.

> 
>>>     2) The frequency estimation algorithm is broken
>>
>> For the events which occurs frequently, e.g., instructions, cycles, yes,
>> the frequency estimation algorithm doesn't work well.
>>
>> But there are events that may not occur frequently. If a big init period
>> is set, it may be impossible to get the required freq for those events.
>>
>> It's really hard to pick a universal init period that works for all
>> events.
> 
> I understand that, but especially for RETIRED it's obvious :)
> 
>> I'm thinking perf may only calculate/pre-set a init period for the Linux
>> defined architectural events, e.g., instructions, cycles, branches,
>> cache related events, etc. For the other ARCH specific events, I'm
>> afraid the period has to start 1.
> 
> Yes, that would be way better than what we have now.

Great. I will post a patch to improve it.

> 
>>>
>>>     3) Using a 'limit' guestimate is just papering over the underlying
>>>        problems
>>
>> It's possible that a user set a small number with -c. If the number is
>> less than the 'limit', it needs to be adjusted to avoid HW failure.
>> I think the 'limit' is still required.
> 
> I'm not against the limit per se. I'm against guestimated limits which
> are thrown into the code without understanding the underlying problem.
> 
> The just paper over it up to the point where they bite back because the
> guestimate was off by $N.
> 

Got it.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-15 20:15                   ` Liang, Kan
@ 2024-08-15 23:43                     ` Thomas Gleixner
  2024-08-16 19:27                       ` Liang, Kan
  0 siblings, 1 reply; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-15 23:43 UTC (permalink / raw)
  To: Liang, Kan, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver

On Thu, Aug 15 2024 at 16:15, Kan Liang wrote:
> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
>>> Implication: When this erratum occurs, the fixed-function performance
>>> counter IA32_FIXED_CTR0 may over count by up to 32.
>> 
>> Sure. That's only explaining half of the problem.
>> 
>> As I demonstrated in the non-contended case even with a count of 2 (I
>> tried 1 too) the status bit is never set on the second check.
>> 
>
> Do you mean the below example? The status bit (32) of the fixed counter
> 0 is always set.

When HT is off or the threads are not running the handler concurrently
then there is zero looping. Once they start do fiddle concurrently the
looping starts and potentially never ends.

> 65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000
>
> 65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000

So in the non-concurrent (which includes !HT) case the status check
after handling the event is always 0. This never gets into a loop, not
even once.

>>> It should explains that the issue is gone with the magic number 32 or
>>> disabling the Hyper-Threading.
>> 
>> It explains only half of it. If you use 32, then the counter is set to
>> -32 so the overcount of 32 will still bring it to 0, which should set
>> the status bit, no?
>
> I think it's up to 32, not always 32.
> I don't have more details regarding the issue. The architect of HSW has
> left. I'm asking around internally to find the original bug report of
> the erratum. Hope there are more details in the report.

See below.

>>> I also found a related discussion about 9 years ago.
>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>>> Vince tried the workaround but it seems not work.
>> 
>> Let me play with that. :)
>
> Appreciate it.

I got it actually working. The inital sequence which "worked" is:

    1) Clear bit 32  in IA32_PERF_GLOBAL_CTRL
    2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
    3) Zero the IA32_FIXED_CTR0 MSR
    4) Set IA32_FIXED_CTR0 to (-left) & mask;
    5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
    6) Set bit 32  in IA32_PERF_GLOBAL_CTRL

If I omit #3 it does not work. If I flip #5 and #6 it does not work.

So the initial "working" variant I had was hacking this sequence into
x86_perf_event_set_period() (omitting the fixed counter 0 conditionals
for readability):

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

Now I thought, that needs to be in intel/core.c and implemented a proper
quirk. And of course being smart I decided it's a brilliant idea to use
the cached values instead of the rdmsrl()s.

	cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask;
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

        cfixd = cpuc->fixed_ctrl_val;
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

Surprise, surprise, that does not work. So I went back and wanted to
know which rdmslr() is curing it:

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

        cfixd = cpuc->fixed_ctrl_val;
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL
did not.

Now I got bold and boiled it down to:

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

and the whole thing still worked. *BLINK*

Exactly zero loop entries in the trace when running 100 instances of
that cve test case, which otherwise spams the trace with entries and
ends up in the loop > 100 path within a split second.

Removing the zeroing of the counter makes it come back, but reducing the
whole nonsense to:

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

makes the loop problem go away, but it "works" so well that the stupid
frequency adjustment algorithm keeps the left == 1, i.e count == 2 case
stay long enough around to trigger 'hung task messages' ....

Now I looked more at the dmesg output of all the experiments. In all
"working" cases except one running these 100 instances of cve... results
in a varying cascade of

   perf: interrupt took too long (2503 > 2500), lowering ...

messages.

The one case where this does not happen is when the limit is
unconditionally set to 32. But when I apply this limit only for the
fixed counter 0 it comes back. 

Now I looked at when these 'took too long' problems surface aside of the
obvious case of extensive looping. They are unrelated to the hyper
threading issue as I can reproduce with smt=off too.

They always happen when a counter was armed with a count < 32 and two
events expired in the same NMI. The test case uses fixed counter 0 and
general counter 0 for the other event.

So that erratum is a good hint, but that hardware does have more issues
than it tells.

So I think we should just apply that limit patch with a proper change
log and also make it:

hsw_limit(...)
{
	*left = max(*left, erratum_hsw11(event) ? 128 : 32;);
}

or such.

That limit won't cure the overcount issue from that HSM154 erratum, but
*SHRUG*.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-15 23:43                     ` Thomas Gleixner
@ 2024-08-16 19:27                       ` Liang, Kan
  2024-08-17 12:22                         ` Liang, Kan
  2024-08-17 12:23                         ` Thomas Gleixner
  0 siblings, 2 replies; 17+ messages in thread
From: Liang, Kan @ 2024-08-16 19:27 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver

Hi Thomas,

On 2024-08-15 7:43 p.m., Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 16:15, Kan Liang wrote:
>> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
>>>> Implication: When this erratum occurs, the fixed-function performance
>>>> counter IA32_FIXED_CTR0 may over count by up to 32.
>>>
>>> Sure. That's only explaining half of the problem.
>>>
>>> As I demonstrated in the non-contended case even with a count of 2 (I
>>> tried 1 too) the status bit is never set on the second check.
>>>
>>
>> Do you mean the below example? The status bit (32) of the fixed counter
>> 0 is always set.
> 
> When HT is off or the threads are not running the handler concurrently
> then there is zero looping. Once they start do fiddle concurrently the
> looping starts and potentially never ends.
> 
>> 65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
>> 65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000
>>
>> 65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
>> 65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000
> 
> So in the non-concurrent (which includes !HT) case the status check
> after handling the event is always 0. This never gets into a loop, not
> even once.
> 
>>>> It should explains that the issue is gone with the magic number 32 or
>>>> disabling the Hyper-Threading.
>>>
>>> It explains only half of it. If you use 32, then the counter is set to
>>> -32 so the overcount of 32 will still bring it to 0, which should set
>>> the status bit, no?
>>
>> I think it's up to 32, not always 32.
>> I don't have more details regarding the issue. The architect of HSW has
>> left. I'm asking around internally to find the original bug report of
>> the erratum. Hope there are more details in the report.
> 
> See below.
> 
>>>> I also found a related discussion about 9 years ago.
>>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>>>> Vince tried the workaround but it seems not work.
>>>
>>> Let me play with that. :)
>>
>> Appreciate it.
> 
> I got it actually working. The inital sequence which "worked" is:
> 
>     1) Clear bit 32  in IA32_PERF_GLOBAL_CTRL
>     2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
>     3) Zero the IA32_FIXED_CTR0 MSR
>     4) Set IA32_FIXED_CTR0 to (-left) & mask;
>     5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
>     6) Set bit 32  in IA32_PERF_GLOBAL_CTRL
> 
> If I omit #3 it does not work. If I flip #5 and #6 it does not work.
> 
> So the initial "working" variant I had was hacking this sequence into
> x86_perf_event_set_period() (omitting the fixed counter 0 conditionals
> for readability):
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
> 	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> Now I thought, that needs to be in intel/core.c and implemented a proper
> quirk. And of course being smart I decided it's a brilliant idea to use
> the cached values instead of the rdmsrl()s.
> 
> 	cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask;
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
>         cfixd = cpuc->fixed_ctrl_val;
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> Surprise, surprise, that does not work. So I went back and wanted to
> know which rdmslr() is curing it:
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
>         cfixd = cpuc->fixed_ctrl_val;
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL
> did not.
> 
> Now I got bold and boiled it down to:
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> and the whole thing still worked. *BLINK*
> 
> Exactly zero loop entries in the trace when running 100 instances of
> that cve test case, which otherwise spams the trace with entries and
> ends up in the loop > 100 path within a split second.
> 
> Removing the zeroing of the counter makes it come back, but reducing the
> whole nonsense to:
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 

Thanks for all the testing. So the trick is to clear the counter before
writing to it. That sounds like a usual way to trigger some ucode magic.


> makes the loop problem go away, but it "works" so well that the stupid
> frequency adjustment algorithm keeps the left == 1, i.e count == 2 case
> stay long enough around to trigger 'hung task messages' ....
> 
> Now I looked more at the dmesg output of all the experiments. In all
> "working" cases except one running these 100 instances of cve... results
> in a varying cascade of
> 
>    perf: interrupt took too long (2503 > 2500), lowering ...
> 
> messages.
> 
> The one case where this does not happen is when the limit is
> unconditionally set to 32. But when I apply this limit only for the
> fixed counter 0 it comes back.


Yes, that sounds there is something wrong with < 32 as well.

> 
> Now I looked at when these 'took too long' problems surface aside of the
> obvious case of extensive looping. They are unrelated to the hyper
> threading issue as I can reproduce with smt=off too.
> 
> They always happen when a counter was armed with a count < 32 and two
> events expired in the same NMI. The test case uses fixed counter 0 and
> general counter 0 for the other event.
> 
> So that erratum is a good hint, but that hardware does have more issues
> than it tells.
> 
> So I think we should just apply that limit patch with a proper change
> log and also make it:
> 
> hsw_limit(...)
> {
> 	*left = max(*left, erratum_hsw11(event) ? 128 : 32;);
> }
>

The HSW11 is also BDM11. It sounds like we need the trick from both bdw
and nhm.

How about this?

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e8bd45556c30..42f557a128b9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event
*event, s64 *left)
 	*left = max(*left, 32LL);
 }

+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+	nhm_limit_period(event, left);
+	bdw_limit_period(event, left);
+}
 static void glc_limit_period(struct perf_event *event, s64 *left)
 {
 	if (event->attr.precise_ip == 3)

Do you plan to post the "limit" patch for HSW?
Or should I send the patch?

> or such.
> 
> That limit won't cure the overcount issue from that HSM154 erratum, but
> *SHRUG*.
> 

The code is there without the fixed for ~10 years. I didn't find
complains. I guess it should be fine to leave it as is.

Thanks,
Kan

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-16 19:27                       ` Liang, Kan
@ 2024-08-17 12:22                         ` Liang, Kan
  2024-08-17 12:23                         ` Thomas Gleixner
  1 sibling, 0 replies; 17+ messages in thread
From: Liang, Kan @ 2024-08-17 12:22 UTC (permalink / raw)
  To: Thomas Gleixner, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver



On 2024-08-16 3:27 p.m., Liang, Kan wrote:
> The HSW11 is also BDM11. It sounds like we need the trick from both bdw
> and nhm.
> 
> How about this?
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index e8bd45556c30..42f557a128b9 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event
> *event, s64 *left)
>  	*left = max(*left, 32LL);
>  }
> 
> +static void hsw_limit_period(struct perf_event *event, s64 *left)
> +{
> +	nhm_limit_period(event, left);


Sigh, apparently, I used an old specification update (Rev 003) for HSW.
It claims that the BDM55 is also applied to HSW (HSW75).
https://www.mouser.com/pdfdocs/xeone31200v3specupdate.PDF
So I thought the nhm_limit_period() should be used for HSW as well.

However, a newer version (Rev 016) deleted the HSW75 for HSW.
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e3-1200v3-spec-update-oct2016.pdf

Yes, as you suggested, something as below is required.


diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e8bd45556c30..b22a4289553b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4634,6 +4634,17 @@ static enum hybrid_cpu_type
adl_get_hybrid_cpu_type(void)
 	return HYBRID_INTEL_CORE;
 }

+static inline bool erratum_hsw11(struct perf_event *event)
+{
+	return (event->hw.config & INTEL_ARCH_EVENT_MASK) ==
+		X86_CONFIG(.event=0xc0, .umask=0x01);
+}
+
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+	*left = max(*left, erratum_hsw11(event) ? 128 : 32);
+}
+
 /*
  * Broadwell:
  *
@@ -4651,8 +4662,7 @@ static enum hybrid_cpu_type
adl_get_hybrid_cpu_type(void)
  */
 static void bdw_limit_period(struct perf_event *event, s64 *left)
 {
-	if ((event->hw.config & INTEL_ARCH_EVENT_MASK) ==
-			X86_CONFIG(.event=0xc0, .umask=0x01)) {
+	if (erratum_hsw11(event)) {
 		if (*left < 128)
 			*left = 128;
 		*left &= ~0x3fULL;
@@ -6821,6 +6831,7 @@ __init int intel_pmu_init(void)

 		x86_pmu.hw_config = hsw_hw_config;
 		x86_pmu.get_event_constraints = hsw_get_event_constraints;
+		x86_pmu.limit_period = hsw_limit_period;
 		x86_pmu.lbr_double_abort = true;
 		extra_attr = boot_cpu_has(X86_FEATURE_RTM) ?
 			hsw_format_attr : nhm_format_attr;
Thanks,
Kan

> +	bdw_limit_period(event, left);
> +}
>  static void glc_limit_period(struct perf_event *event, s64 *left)
>  {
>  	if (event->attr.precise_ip == 3)
> 
> Do you plan to post the "limit" patch for HSW?
> Or should I send the patch?

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-16 19:27                       ` Liang, Kan
  2024-08-17 12:22                         ` Liang, Kan
@ 2024-08-17 12:23                         ` Thomas Gleixner
  1 sibling, 0 replies; 17+ messages in thread
From: Thomas Gleixner @ 2024-08-17 12:23 UTC (permalink / raw)
  To: Liang, Kan, Li Huafei, peterz, mingo
  Cc: acme, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, bp, dave.hansen, x86, hpa, linux-perf-users,
	linux-kernel, Andi Kleen, Vince Weaver

On Fri, Aug 16 2024 at 15:27, Kan Liang wrote:
> On 2024-08-15 7:43 p.m., Thomas Gleixner wrote:
>
> The HSW11 is also BDM11. It sounds like we need the trick from both bdw
> and nhm.
>
> How about this?
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index e8bd45556c30..42f557a128b9 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event
> *event, s64 *left)
>  	*left = max(*left, 32LL);
>  }
>
> +static void hsw_limit_period(struct perf_event *event, s64 *left)
> +{
> +	nhm_limit_period(event, left);
> +	bdw_limit_period(event, left);
> +}
>  static void glc_limit_period(struct perf_event *event, s64 *left)
>  {
>  	if (event->attr.precise_ip == 3)
>
> Do you plan to post the "limit" patch for HSW?
> Or should I send the patch?

Go wild...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] perf/x86/intel: Restrict period on Haswell
  2024-08-15 15:39               ` Liang, Kan
  2024-08-15 18:26                 ` Thomas Gleixner
@ 2024-08-15 19:01                 ` Vince Weaver
  1 sibling, 0 replies; 17+ messages in thread
From: Vince Weaver @ 2024-08-15 19:01 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Thomas Gleixner, Li Huafei, peterz, mingo, acme, namhyung,
	mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter,
	bp, dave.hansen, x86, hpa, linux-perf-users, linux-kernel,
	Andi Kleen, Vince Weaver

On Thu, 15 Aug 2024, Liang, Kan wrote:

> I also found a related discussion about 9 years ago.
> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
> Vince tried the workaround but it seems not work.
> 
> So limiting the min period of the fixed counter 0 to 32 seems the only
> workaround for now.

I'm actually still lurking on this discussion.  My regular fuzzing machine 
is still a Haswell machine (the same one from 9 years ago) and it reliably 
hits this issue within hours.  I hadn't realized there was an official 
reproducer.

If a patch does come out of this I'll be glad to test it.

Vince Weaver
vincent.weaver@maine.edu

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-17 12:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-29 22:33 [PATCH] perf/x86/intel: Restrict period on Haswell Li Huafei
2024-07-31 19:20 ` Thomas Gleixner
2024-08-13 13:13   ` Li Huafei
2024-08-14 14:43     ` Thomas Gleixner
2024-08-14 14:52     ` Thomas Gleixner
2024-08-14 18:15       ` Liang, Kan
2024-08-14 19:01         ` Thomas Gleixner
2024-08-14 19:37           ` Liang, Kan
2024-08-14 22:47             ` Thomas Gleixner
2024-08-15 15:39               ` Liang, Kan
2024-08-15 18:26                 ` Thomas Gleixner
2024-08-15 20:15                   ` Liang, Kan
2024-08-15 23:43                     ` Thomas Gleixner
2024-08-16 19:27                       ` Liang, Kan
2024-08-17 12:22                         ` Liang, Kan
2024-08-17 12:23                         ` Thomas Gleixner
2024-08-15 19:01                 ` Vince Weaver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).