public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Liang, Kan" <kan.liang@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@redhat.com, linux-kernel@vger.kernel.org,
	irogers@google.com, eranian@google.com, ak@linux.intel.com
Subject: Re: [PATCH] perf/core: Fix endless multiplex timer
Date: Tue, 3 Mar 2020 20:40:10 -0500	[thread overview]
Message-ID: <b71515e4-484e-d80a-37db-2e51abe69928@linux.intel.com> (raw)
In-Reply-To: <20200303210812.GA4745@worktop.programming.kicks-ass.net>



On 3/3/2020 4:08 PM, Peter Zijlstra wrote:
> On Tue, Mar 03, 2020 at 12:28:19PM -0800, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> A lot of time are spent in writing uncore MSRs even though no perf is
>> running.
>>
>>    4.66%  swapper      [kernel.kallsyms]        [k] native_write_msr
>>              |
>>               --4.56%--native_write_msr
>>                         |
>>                         |--1.68%--snbep_uncore_msr_enable_box
>>                         |          perf_mux_hrtimer_handler
>>                         |          __hrtimer_run_queues
>>                         |          hrtimer_interrupt
>>                         |          smp_apic_timer_interrupt
>>                         |          apic_timer_interrupt
>>                         |          cpuidle_enter_state
>>                         |          cpuidle_enter
>>                         |          do_idle
>>                         |          cpu_startup_entry
>>                         |          start_kernel
>>                         |          secondary_startup_64
>>
>> The root cause is that multiplex timer was not stopped when perf stat
>> finished.
>> Current perf relies on rotate_necessary to determine whether the
>> multiplex timer should be stopped. The variable only be reset in
>> ctx_sched_out(), which is not enough for system-wide event.
>> Perf stat invokes PERF_EVENT_IOC_DISABLE to stop system-wide event
>> before closing it.
>>    perf_ioctl()
>>      perf_event_disable()
>>        event_sched_out()
>> The rotate_necessary will never be reset.
>>
>> The issue is a generic issue, not just impact the uncore.
>>
>> Check whether we had been multiplexing. If yes, reset rotate_necessary
>> for the last active event in __perf_event_disable().
>>
>> Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily")
>> Reported-by: Andi Kleen <ak@linux.intel.com>
>> Reviewed-by: Andi Kleen <ak@linux.intel.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> ---
>>   kernel/events/core.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 3f1f77de7247..50688de56181 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2242,6 +2242,16 @@ static void __perf_event_disable(struct perf_event *event,
>>   		update_cgrp_time_from_event(event);
>>   	}
>>   
>> +	/*
>> +	 * If we had been multiplexing,
>> +	 * stop the rotations for the last active event.
>> +	 * Only need to check system wide events.
>> +	 * For task events, it will be checked in ctx_sched_out().
>> +	 */
>> +	if ((cpuctx->ctx.nr_events != cpuctx->ctx.nr_active) &&
>> +	    (cpuctx->ctx.nr_active == 1))
>> +		cpuctx->ctx.rotate_necessary = 0;
>> +
>>   	if (event == event->group_leader)
>>   		group_sched_out(event, cpuctx, ctx);
>>   	else
> 
> 
> I'm thinking this is wrong.
> 
> That is, yes, this fixes the observed problem, but it also misses at
> least one other site. Which seems to suggest we ought to take a
> different approach.
> 
> But even with that; I wonder if the actual condition isn't wrong.
> Suppose the event was exclusive, and other events weren't scheduled
> because of that. Then you disable the one exclusive event _and_ kill
> rotation, so then nothing else will ever get on.
> 
> So what I think was supposed to happen is rotation killing itself;
> rotation will schedule out the context -- which will clear the flag, and
> then schedule the thing back in -- which will set the flag again when
> needed.
> 
> Now, that isn't happening, and I think I see why, because when we drop
> to !nr_active, we terminate ctx_sched_out() before we get to clearing
> the flag, oops!
> 
> So how about something like this?
> 
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index e453589da97c..7947bd3271a9 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2182,6 +2182,7 @@ __perf_remove_from_context(struct perf_event *event,
>   
>   	if (!ctx->nr_events && ctx->is_active) {
>   		ctx->is_active = 0;
> +		ctx->rotate_necessary = 0;
>   		if (ctx->task) {
>   			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
>   			cpuctx->task_ctx = NULL;


The patch can fix the observed problem with uncore PMU.
But it cannot fix all the cases with core PMU, especially when NMI 
watchdog is enabled.
Because the ctx->nr_events never be 0 with NMI watchdog enabled.

For example,
$echo 1 > /proc/sys/kernel/nmi_watchdog
$sudo perf stat -e ref-cycles,ref-cycles -a sleep 1
  Performance counter stats for 'system wide':
     549,432,900      ref-cycles   (50.01%) 

     545,358,967      ref-cycles   (49.99%) 

  1.002367894 seconds time elapsed
$perf probe --add perf_rotate_context
$perf record -e probe:perf_rotate_context -aR -g sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.896 MB perf.data (1750 samples) ]
$perf report --stdio
    100.00%   100.00%  (ffffffffa5604e84)
             |
              --99.83%--0xffffffffa54000d4
                        start_secondary
                        cpu_startup_entry
                        do_idle
                        call_cpuidle
                        cpuidle_enter
                        cpuidle_enter_state
                        apic_timer_interrupt
                        smp_apic_timer_interrupt
                        hrtimer_interrupt
                        __hrtimer_run_queues
                        perf_mux_hrtimer_handler

Thanks,
Kan

> @@ -3074,15 +3075,15 @@ static void ctx_sched_out(struct perf_event_context *ctx,
>   
>   	is_active ^= ctx->is_active; /* changed bits */
>   
> -	if (!ctx->nr_active || !(is_active & EVENT_ALL))
> -		return;
> -
>   	/*
>   	 * If we had been multiplexing, no rotations are necessary, now no events
>   	 * are active.
>   	 */
>   	ctx->rotate_necessary = 0;
>   
> +	if (!ctx->nr_active || !(is_active & EVENT_ALL))
> +		return;
> +
>   	perf_pmu_disable(ctx->pmu);
>   	if (is_active & EVENT_PINNED) {
>   		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> 

  reply	other threads:[~2020-03-04  1:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-03 20:28 [PATCH] perf/core: Fix endless multiplex timer kan.liang
2020-03-03 21:08 ` Peter Zijlstra
2020-03-04  1:40   ` Liang, Kan [this message]
2020-03-04  9:33     ` Peter Zijlstra
2020-03-04 14:20       ` Liang, Kan
2020-03-05 12:38         ` Peter Zijlstra
2020-03-05 17:56           ` Liang, Kan
2020-03-20 12:58           ` [tip: perf/core] " tip-bot2 for Peter Zijlstra
2020-08-06 18:11             ` Robin Murphy
2020-08-06 18:53               ` Greg KH
2020-08-06 20:40                 ` Robin Murphy
2020-03-24  6:00 ` [perf/core] 92b1f046a2: BUG:kernel_NULL_pointer_dereference,address kernel test robot
2020-03-24 12:52   ` Liang, Kan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b71515e4-484e-d80a-37db-2e51abe69928@linux.intel.com \
    --to=kan.liang@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=eranian@google.com \
    --cc=irogers@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox