From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69D73221FB1; Tue, 3 Jun 2025 20:22:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.15 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748982172; cv=none; b=tntBeCXsIemwSl2RWvgyv99O12H2cb1jTIwouUXnOgvzFAg8gPRM6zYnFbS59/9+WxhmdFpMnqt5H7ireyfpUZ7u3KZ4Mh92+VP4T3JnMq7vxL3GSHzcgnlkAhgC7+oxDAuSngLwezaV+y6jXeTE5/+oszgoR/7Z0lZXvmgv0GU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1748982172; c=relaxed/simple; bh=N89cRnHvv1Hm7SUg0rGql42+3NsIqu7kNkk+aRE7gPg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Me6aStX4XDMY3/zEIDyvzbelL+4AYJB2c4T0+f0ofKFISnHstV+2933csIq3fTlUr1yUeFDSCR/EYJWu3bg7XXBifHcZ6wDp/B/QTiU/X8xaQ9Pw3mdJ8dE2sCT6Eatw4MWLYH27E/N4Lga5g+fQZhX8sY8gZUs4g4Fq0NIvzbQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=JUShLuT6; arc=none smtp.client-ip=198.175.65.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="JUShLuT6" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1748982171; x=1780518171; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=N89cRnHvv1Hm7SUg0rGql42+3NsIqu7kNkk+aRE7gPg=; b=JUShLuT6dhf7IdtZggoPIJqoF3mwoJ5BtQCHe4fX5PFI3qPm8D8ObTn6 8z4bJmj0N4NISTtqXRZvAuo2vE6k3dXr+cbMnJ3HhD8iREJwJUa7RVaF+ x+LYFNyPEmvZUBAxLP25TtvFSnrdipm+U/DGOXwTxh9RV2B9D4RdxXPg+ cTZoOoR0s8zKOg8yaumL3MNBUIiGu675NEA7+VA8Oy0n0iX3C381ijqlL VHhrgBC6Z0KwocFtg6ldx2AjDy3tL3ITe2DEZZLX2U9+Z8UlqYRvuABML rt0XMcGuzG3n3NezmkDI9+exl34NiNOJGsLHZbaJF2NmlmOOArGq2JHlZ g==; X-CSE-ConnectionGUID: xTqoz7+WTOO1CtDfXu0rgg== X-CSE-MsgGUID: di5ZClMFRRiX/bQ5m8iMVw== X-IronPort-AV: E=McAfee;i="6700,10204,11453"; a="54703808" X-IronPort-AV: E=Sophos;i="6.16,207,1744095600"; d="scan'208";a="54703808" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa107.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jun 2025 13:22:50 -0700 X-CSE-ConnectionGUID: FWzugVgRQ9Kyypvn0HHYjA== X-CSE-MsgGUID: J9eBQXsDQy2R5Fxnn6KLVw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.16,207,1744095600"; d="scan'208";a="175908057" Received: from linux.intel.com ([10.54.29.200]) by orviesa002.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Jun 2025 13:22:50 -0700 Received: from [10.246.136.52] (kliang2-mobl1.ccr.corp.intel.com [10.246.136.52]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 4CA4720B5736; Tue, 3 Jun 2025 13:22:48 -0700 (PDT) Message-ID: <50ed4910-381a-40cf-bdc5-e008a5b28efb@linux.intel.com> Date: Tue, 3 Jun 2025 16:22:47 -0400 Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V2] perf: Fix the throttle error of some clock events To: Ian Rogers Cc: peterz@infradead.org, mingo@redhat.com, namhyung@kernel.org, mark.rutland@arm.com, linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, eranian@google.com, ctshao@google.com, tmricht@linux.ibm.com, leo.yan@arm.com, Aishwarya TCV References: <20250528175832.2999139-1-kan.liang@linux.intel.com> Content-Language: en-US From: "Liang, Kan" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 2025-06-03 12:35 p.m., Ian Rogers wrote: > On Wed, May 28, 2025 at 10:59 AM wrote: >> >> From: Kan Liang >> >> The Arm CI reports RCU stall, which can be reproduced by the below perf >> command. >> perf record -a -e cpu-clock -- sleep 2 >> >> The cpu-clock and task_clock are two special SW events, which rely on >> the hrtimer. Instead of invoking the stop(), the HRTIMER_NORESTART is >> returned to stop the timer. Because the hrtimer interrupt handler cannot >> cancel itself, which causes infinite loop. > > I'm having a hard time understanding this. Currently in the code we > have two PMUs for cpu_clock and task_clock: > https://web.git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c#n10729 > ``` > static struct pmu perf_cpu_clock; /* fwd declaration */ > static struct pmu perf_task_clock; > ``` > This is weird as both of them are programmed using type > PERF_TYPE_SOFTWARE (1) and there is a corresponding sysfs PMU at > `/sys/bus/event_source/devices/software/type` so wouldn't it make > sense for there to be 1 PMU? No, they don't use the type PERF_TYPE_SOFTWARE. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/events/core.c#n14729 They need different operators from the regular software events. I guess that's why they have a dedicated PMU. > Anyway, the stop functions are similar: > https://web.git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/tree/kernel/events/core.c#n11804 > ``` > static void cpu_clock_event_stop(struct perf_event *event, int flags) > { > perf_swevent_cancel_hrtimer(event); > cpu_clock_event_update(event); > } > ... > static void task_clock_event_stop(struct perf_event *event, int flags) > { > perf_swevent_cancel_hrtimer(event); > task_clock_event_update(event, event->ctx->time); > } > ``` > and I think you are talking about perf_swevent_cancel_hrtimer: > ``` > static void perf_swevent_cancel_hrtimer(struct perf_event *event) > { > struct hw_perf_event *hwc = &event->hw; > > if (is_sampling_event(event)) { > ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer); > local64_set(&hwc->period_left, ktime_to_ns(remaining)); > > hrtimer_cancel(&hwc->hrtimer); > } > } > ``` > but I'm not seeing where HRTIMER_NORESTART comes into this. That's an issue introduced by the generic throttle patch set, which unconditionally invoke the event_stop() when throttle is triggered. It's wrong to call the hrtimer_cancel() to stop the timer in the timer handler. Because the hrtimer_cancel() waits for the handler to finish, which is a deadlock. The correct way (or original way) is to set a HRTIMER_NORESTART. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/events/core.c#n11717 So we need a method to specially handle the cpu_clock and task_clock. > >> There may be two ways to fix it. >> - Add a check of MAX_INTERRUPTS in the event_stop. Return immediately if >> the stop is invoked by the throttle. >> - Introduce a PMU flag to track the case. Avoid the event_stop in >> perf_event_throttle() if the flag is detected. > > I'm not sure what the issue is with the hrtimer_cancel from the > explanation up to here. If we can't cancel the timer can we have the > timer callback perf_swevent_hrtimer drop the event? A callback is not required. Just ignore the perf_swevent_cancel_hrtimer() when it's invoked from perf_swevent_hrtimer(). It's similar to the first method (way) above. > > I can see how the patch is working around the problem but from an API > point-of-view having PERF_PMU_CAP_NO_THROTTLE_STOP seems to be adding > more possible complexity to PMUs. Possibly if the software event stop > worked better it would help other future users too. > The PERF_PMU_CAP_NO_THROTTLE_STOP should only be useful if there are some future special PMUs which doesn't want to stop by the generic code. To be honest, I have no idea if there will be such PMUs. If no such PMUs, we should not need the flag. > Sorry for my ignorance in this and I'm likely missing the reason this > is the best solution. I think improving the commit message would help > other future people and not just me.I will send a V3 to drop the flag and only apply the fix for the two clock events. We can have a comparison then. Thanks, Kan> > Thanks, > Ian > >> The latter looks more generic. It may be used if there are more other >> cases that want to avoid the stop later. The latter is implemented. >> >> Reported-by: Leo Yan >> Reported-by: Aishwarya TCV >> Closes: https://lore.kernel.org/lkml/20250527161656.GJ2566836@e132581.arm.com/ >> Tested-by: Leo Yan >> Signed-off-by: Kan Liang >> --- >> >> Changes since V1: >> - Rebase on top of the latest tip.git >> - Add Tested-by from Leo >> >> include/linux/perf_event.h | 1 + >> kernel/events/core.c | 23 ++++++++++++++++++++--- >> 2 files changed, 21 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> index 52dc7cfab0e0..97a747a97a50 100644 >> --- a/include/linux/perf_event.h >> +++ b/include/linux/perf_event.h >> @@ -305,6 +305,7 @@ struct perf_event_pmu_context; >> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100 >> #define PERF_PMU_CAP_AUX_PAUSE 0x0200 >> #define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400 >> +#define PERF_PMU_CAP_NO_THROTTLE_STOP 0x0800 >> >> /** >> * pmu::scope >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index f34c99f8ce8f..abd19bb571e3 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -2656,7 +2656,22 @@ static void perf_event_unthrottle(struct perf_event *event, bool start) >> >> static void perf_event_throttle(struct perf_event *event) >> { >> - event->pmu->stop(event, 0); >> + /* >> + * Some PMUs, e.g., cpu-clock and task_clock, may rely on >> + * a special mechanism (hrtimer) to manipulate counters. >> + * The regular stop doesn't work, since the hrtimer interrupt >> + * handler cannot cancel itself. >> + * >> + * The stop should be avoided for such cases. Let the >> + * driver-specific code handle it. >> + * >> + * The counters will eventually be disabled in the driver-specific >> + * code. In unthrottle, they still need to be re-enabled. >> + * There is no handling for PERF_PMU_CAP_NO_THROTTLE_STOP in >> + * the perf_event_unthrottle(). >> + */ >> + if (!(event->pmu->capabilities & PERF_PMU_CAP_NO_THROTTLE_STOP)) >> + event->pmu->stop(event, 0); >> event->hw.interrupts = MAX_INTERRUPTS; >> if (event == event->group_leader) >> perf_log_throttle(event, 0); >> @@ -11848,7 +11863,8 @@ static int cpu_clock_event_init(struct perf_event *event) >> static struct pmu perf_cpu_clock = { >> .task_ctx_nr = perf_sw_context, >> >> - .capabilities = PERF_PMU_CAP_NO_NMI, >> + .capabilities = PERF_PMU_CAP_NO_NMI | >> + PERF_PMU_CAP_NO_THROTTLE_STOP, >> .dev = PMU_NULL_DEV, >> >> .event_init = cpu_clock_event_init, >> @@ -11930,7 +11946,8 @@ static int task_clock_event_init(struct perf_event *event) >> static struct pmu perf_task_clock = { >> .task_ctx_nr = perf_sw_context, >> >> - .capabilities = PERF_PMU_CAP_NO_NMI, >> + .capabilities = PERF_PMU_CAP_NO_NMI | >> + PERF_PMU_CAP_NO_THROTTLE_STOP, >> .dev = PMU_NULL_DEV, >> >> .event_init = task_clock_event_init, >> -- >> 2.38.1 >> >