From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0978D33CD2; Wed, 7 Aug 2024 15:17:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.12 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723043845; cv=none; b=J9CFwCsvWmMRiRi3yYeLQfvXlEEfnomyHCKjDM/9LQ86aLYPrn9rGtwbwHJTXUxNg9+uye5xmbTJ2pHpXuK31w0rumBkNDCnoHywIxTt1v5CSJSofnWMnoAeNl43bajscEREW4asL1fwvrf3KWFjVKEoWE8XM269uv9B0lghLsc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723043845; c=relaxed/simple; bh=uK3eMVCO9lr4xJ4ncEL+z7nxBH9o1xcjKd+B8geHLFo=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=mNReqJkvAP/o52/pfHQnhwPtnIHJBV92t+BPeyR2ybkzG+CUsWCrZ7sPU40bktzYj2M0lT6tBQ8/+P2kpO2Lmbp0GBhl/SmKy7tdDq5I9tooBZqgoEhHE5cXboPejQNMLqZFLez7BxKxb1EgaJQRTOTHV3ftcpVXTJ2qjOQI7ic= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=HBDf3MXC; arc=none smtp.client-ip=198.175.65.12 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="HBDf3MXC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1723043843; x=1754579843; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=uK3eMVCO9lr4xJ4ncEL+z7nxBH9o1xcjKd+B8geHLFo=; b=HBDf3MXCvplm0VrtfRdsSjBQgEQpTdx3nsRgGkQz8BFt4ue2MQmJoSrz QkuAnMJdjfCTGnamd8wMSD749GoRKLDbVdL2MAu1FTpfFFfG2elvU0hdO dHuTErSbojzKr0oNji7Jpyctj8z9qpvEj1MlmkoS5fMafYrAkoNRUMl7+ tvLpI/ihBrf7N7QyDuo/09JFCiOSD4RPE9K36nG4MjLjkOmjopSX+aRyM hs2JWfjXs/oguBUH6hB9SaDBc5m8zNYE72ELJfBf1B+SRmmW0R8QOvbVF ELI84B1S9eS2Z56QKqxQPHAuu0Z3iosGFFwmDhBpZL20kKekPaDZ7xP61 Q==; X-CSE-ConnectionGUID: J/h4fC8OTEOBWqFob7x/GQ== X-CSE-MsgGUID: 4EDE9KN9QX6I0owlU/xW4Q== X-IronPort-AV: E=McAfee;i="6700,10204,11157"; a="32511419" X-IronPort-AV: E=Sophos;i="6.09,270,1716274800"; d="scan'208";a="32511419" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2024 08:17:22 -0700 X-CSE-ConnectionGUID: KhtWx5ugTzSv9+F7VYJOkg== X-CSE-MsgGUID: yAtXTmGRQ7iuFp3qgDJfQg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,270,1716274800"; d="scan'208";a="57099236" Received: from linux.intel.com ([10.54.29.200]) by orviesa006.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Aug 2024 08:17:21 -0700 Received: from [10.212.63.105] (kliang2-mobl1.ccr.corp.intel.com [10.212.63.105]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 465FC20CFED6; Wed, 7 Aug 2024 08:17:20 -0700 (PDT) Message-ID: Date: Wed, 7 Aug 2024 11:17:18 -0400 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 4/5] perf: Add context time freeze To: Peter Zijlstra , mingo@kernel.org Cc: acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org References: <20240807112924.448091402@infradead.org> <20240807115550.250637571@infradead.org> Content-Language: en-US From: "Liang, Kan" In-Reply-To: <20240807115550.250637571@infradead.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 2024-08-07 7:29 a.m., Peter Zijlstra wrote: > Many of the the context reschedule users are of the form: > > ctx_sched_out(.type = EVENT_TIME); > ... modify context > ctx_resched(); > > With the idea that the whole reschedule happens with a single > time-stamp, rather than with each ctx_sched_out() advancing time and > ctx_sched_in() re-starting time, creating a non-atomic experience. > > However, Kan noticed that since this completely stops time, it > actually looses a bit of time between the stop and start. Worse, now > that we can do partial (per PMU) reschedules, the PMUs that are not > scheduled out still observe the time glitch. > > Replace this with: > > ctx_time_freeze(); > ... modify context > ctx_resched(); > > With the assumption that this happens in a perf_ctx_lock() / > perf_ctx_unlock() pair. > > The new ctx_time_freeze() will update time and sets EVENT_FROZEN, and > ensures EVENT_TIME and EVENT_FROZEN remain set, this avoids > perf_event_time_now() from observing a time wobble from not seeing > EVENT_TIME for a little while. > > Additionally, this avoids loosing time between > ctx_sched_out(EVENT_TIME) and ctx_sched_in(), which would re-set the > timestamp. > > Reported-by: Kan Liang > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/events/core.c | 128 ++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 86 insertions(+), 42 deletions(-) > > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -155,20 +155,55 @@ static int cpu_function_call(int cpu, re > return data.ret; > } > > +enum event_type_t { > + EVENT_FLEXIBLE = 0x01, > + EVENT_PINNED = 0x02, > + EVENT_TIME = 0x04, > + EVENT_FROZEN = 0x08, > + /* see ctx_resched() for details */ > + EVENT_CPU = 0x10, > + EVENT_CGROUP = 0x20, > + > + /* compound helpers */ > + EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED, > + EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN, > +}; > + > +static inline void __perf_ctx_lock(struct perf_event_context *ctx) > +{ > + raw_spin_lock(&ctx->lock); > + WARN_ON_ONCE(ctx->is_active & EVENT_FROZEN); > +} > + > static void perf_ctx_lock(struct perf_cpu_context *cpuctx, > struct perf_event_context *ctx) > { > - raw_spin_lock(&cpuctx->ctx.lock); > + __perf_ctx_lock(&cpuctx->ctx); > if (ctx) > - raw_spin_lock(&ctx->lock); > + __perf_ctx_lock(ctx); > +} > + > +static inline void __perf_ctx_unlock(struct perf_event_context *ctx) > +{ > + /* > + * If ctx_sched_in() didn't again set any ALL flags, clean up > + * after ctx_sched_out() by clearing is_active. > + */ > + if (ctx->is_active & EVENT_FROZEN) { > + if (!(ctx->is_active & EVENT_ALL)) Nit: It may be better to add a macro/inline function to replace all the (ctx->is_active & EVENT_ALL) check? For example, +static inline bool perf_ctx_has_active_events(struct perf_event_context *ctx) +{ + return ctx->is_active & EVENT_ALL; +} ... + if (ctx->is_active & EVENT_FROZEN) { + if (!perf_ctx_has_active_events(ctx)) + ctx->is_active = 0; + else + ctx->is_active &= ~EVENT_FROZEN; It can tell very straightforwardly that we want to clear all flags if there is no active event. The EVENT_ALL may bring confusion. It actually means all events, not all event types. The developer may have to go to the define and figure out what exactly the EVENT_ALL includes. Thanks, Kan > + ctx->is_active = 0; > + else > + ctx->is_active &= ~EVENT_FROZEN; > + } > + raw_spin_unlock(&ctx->lock); > } > > static void perf_ctx_unlock(struct perf_cpu_context *cpuctx, > struct perf_event_context *ctx) > { > if (ctx) > - raw_spin_unlock(&ctx->lock); > - raw_spin_unlock(&cpuctx->ctx.lock); > + __perf_ctx_unlock(ctx); > + __perf_ctx_unlock(&cpuctx->ctx); > } > > #define TASK_TOMBSTONE ((void *)-1L) > @@ -370,16 +405,6 @@ static void event_function_local(struct > (PERF_SAMPLE_BRANCH_KERNEL |\ > PERF_SAMPLE_BRANCH_HV) > > -enum event_type_t { > - EVENT_FLEXIBLE = 0x1, > - EVENT_PINNED = 0x2, > - EVENT_TIME = 0x4, > - /* see ctx_resched() for details */ > - EVENT_CPU = 0x8, > - EVENT_CGROUP = 0x10, > - EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED, > -}; > - > /* > * perf_sched_events : >0 events exist > */ > @@ -2332,18 +2357,39 @@ group_sched_out(struct perf_event *group > } > > static inline void > -ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) > +__ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final) > { > if (ctx->is_active & EVENT_TIME) { > + if (ctx->is_active & EVENT_FROZEN) > + return; > update_context_time(ctx); > - update_cgrp_time_from_cpuctx(cpuctx, false); > + update_cgrp_time_from_cpuctx(cpuctx, final); > } > } > > static inline void > +ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) > +{ > + __ctx_time_update(cpuctx, ctx, false); > +} > + > +/* > + * To be used inside perf_ctx_lock() / perf_ctx_unlock(). Lasts until perf_ctx_unlock(). > + */ > +static inline void > +ctx_time_freeze(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) > +{ > + ctx_time_update(cpuctx, ctx); > + if (ctx->is_active & EVENT_TIME) > + ctx->is_active |= EVENT_FROZEN; > +} > + > +static inline void > ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event) > { > if (ctx->is_active & EVENT_TIME) { > + if (ctx->is_active & EVENT_FROZEN) > + return; > update_context_time(ctx); > update_cgrp_time_from_event(event); > } > @@ -2822,7 +2868,7 @@ static int __perf_install_in_context(vo > #endif > > if (reprogram) { > - ctx_sched_out(ctx, NULL, EVENT_TIME); > + ctx_time_freeze(cpuctx, ctx); > add_event_to_ctx(event, ctx); > ctx_resched(cpuctx, task_ctx, event->pmu_ctx->pmu, > get_event_type(event)); > @@ -2968,8 +3014,7 @@ static void __perf_event_enable(struct p > event->state <= PERF_EVENT_STATE_ERROR) > return; > > - if (ctx->is_active) > - ctx_sched_out(ctx, NULL, EVENT_TIME); > + ctx_time_freeze(cpuctx, ctx); > > perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); > perf_cgroup_event_enable(event, ctx); > @@ -2977,19 +3022,15 @@ static void __perf_event_enable(struct p > if (!ctx->is_active) > return; > > - if (!event_filter_match(event)) { > - ctx_sched_in(ctx, NULL, EVENT_TIME); > + if (!event_filter_match(event)) > return; > - } > > /* > * If the event is in a group and isn't the group leader, > * then don't put it on unless the group is on. > */ > - if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) { > - ctx_sched_in(ctx, NULL, EVENT_TIME); > + if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) > return; > - } > > task_ctx = cpuctx->task_ctx; > if (ctx->task) > @@ -3263,7 +3304,7 @@ static void __pmu_ctx_sched_out(struct p > struct perf_event *event, *tmp; > struct pmu *pmu = pmu_ctx->pmu; > > - if (ctx->task && !ctx->is_active) { > + if (ctx->task && !(ctx->is_active & EVENT_ALL)) { > struct perf_cpu_pmu_context *cpc; > > cpc = this_cpu_ptr(pmu->cpu_pmu_context); > @@ -3338,24 +3379,29 @@ ctx_sched_out(struct perf_event_context > * > * would only update time for the pinned events. > */ > - if (is_active & EVENT_TIME) { > - /* update (and stop) ctx time */ > - update_context_time(ctx); > - update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx); > + __ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx); > + > + /* > + * CPU-release for the below ->is_active store, > + * see __load_acquire() in perf_event_time_now() > + */ > + barrier(); > + ctx->is_active &= ~event_type; > + > + if (!(ctx->is_active & EVENT_ALL)) { > /* > - * CPU-release for the below ->is_active store, > - * see __load_acquire() in perf_event_time_now() > + * For FROZEN, preserve TIME|FROZEN such that perf_event_time_now() > + * does not observe a hole. perf_ctx_unlock() will clean up. > */ > - barrier(); > + if (ctx->is_active & EVENT_FROZEN) > + ctx->is_active &= EVENT_TIME_FROZEN; > + else > + ctx->is_active = 0; > } > > - ctx->is_active &= ~event_type; > - if (!(ctx->is_active & EVENT_ALL)) > - ctx->is_active = 0; > - > if (ctx->task) { > WARN_ON_ONCE(cpuctx->task_ctx != ctx); > - if (!ctx->is_active) > + if (!(ctx->is_active & EVENT_ALL)) > cpuctx->task_ctx = NULL; > } > > @@ -3943,7 +3989,7 @@ ctx_sched_in(struct perf_event_context * > > ctx->is_active |= (event_type | EVENT_TIME); > if (ctx->task) { > - if (!is_active) > + if (!(is_active & EVENT_ALL)) > cpuctx->task_ctx = ctx; > else > WARN_ON_ONCE(cpuctx->task_ctx != ctx); > @@ -4424,7 +4470,7 @@ static void perf_event_enable_on_exec(st > > cpuctx = this_cpu_ptr(&perf_cpu_context); > perf_ctx_lock(cpuctx, ctx); > - ctx_sched_out(ctx, NULL, EVENT_TIME); > + ctx_time_freeze(cpuctx, ctx); > > list_for_each_entry(event, &ctx->event_list, event_entry) { > enabled |= event_enable_on_exec(event, ctx); > @@ -4437,8 +4483,6 @@ static void perf_event_enable_on_exec(st > if (enabled) { > clone_ctx = unclone_ctx(ctx); > ctx_resched(cpuctx, ctx, NULL, event_type); > - } else { > - ctx_sched_in(ctx, NULL, EVENT_TIME); > } > perf_ctx_unlock(cpuctx, ctx); > > > >