From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A58378C7D; Fri, 16 Aug 2024 19:27:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.175.65.17 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723836436; cv=none; b=qiZBqEwMJSmTQKKjCHE3RimGHqaIuEHYL0MF3aaX9ABO/kD7E7UVobIpT3Y3O7wMP/AGj5f+w2ShvXDrlVqGv5Nwe7iIL+oQ0QEpF+ntwHfpB/Qsi7LyKHa6oa2V0Fq15f9pjH5+qs9IiqYzk5rGGTqkHS8MX9bYTL7gx6qfYNw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723836436; c=relaxed/simple; bh=oBM7Ofa7OHDkP8yklvZowNZEXSSnN7/4X344QkoiaWA=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=epUxaNXtAE5tUZAik/1nsXbXlcJBncWp7b0jqphxCyjlsErcTh7ALuhH6STdTEswg9pzGOchtboDiRIgO9yc6BS5/Dbq2jhB0Z8kMqlfH4E9BOC1T996CDKKc/ljQOmfTObJRzFAVN1j817zF9bBHA1MbS7wHtT4Qmn32y93UAs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=V2JQXclu; arc=none smtp.client-ip=198.175.65.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="V2JQXclu" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1723836434; x=1755372434; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=oBM7Ofa7OHDkP8yklvZowNZEXSSnN7/4X344QkoiaWA=; b=V2JQXclurKBBdCU2eBS+NKMECPHDLIPzFKcvFnPCK6WCBrdclyNg5WmP RRhZwBSWRRXVKo8cztIXonF/7C2qif34cn1GJVUqA7Y9Akw9jUSjyC3B9 ppJtFtDkvtLz20Itid9Z9FRrmSxANhIpG9q6/kLj9AlkRGfEnJD9/utz2 ttFMNP9cMJD9Wm8prfy8aUrHkRx6EArCFsn4rrgGeRIM4NFMpUS3Qeilc cZT0rDeM+ElAnc5vLjtDuZs/9A8J2wyVetoNsPVsKjCHfp8xjzRiLvVxA olFCRLqTT8YJ8zyzm9soR2hd5EvfrqxlQiBFwi+k/YoDjDergJd0nkN7I Q==; X-CSE-ConnectionGUID: 5CH1hI1CTBS8yTi9FRvwyw== X-CSE-MsgGUID: ttwN4inKTpCCEjy6OlRu8Q== X-IronPort-AV: E=McAfee;i="6700,10204,11166"; a="22284390" X-IronPort-AV: E=Sophos;i="6.10,152,1719903600"; d="scan'208";a="22284390" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by orvoesa109.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 12:27:13 -0700 X-CSE-ConnectionGUID: 1cG+eiw+QpGWwWA7UCC5GA== X-CSE-MsgGUID: 0YHeI16JQS2E05y6KtEPIw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,152,1719903600"; d="scan'208";a="90495462" Received: from linux.intel.com ([10.54.29.200]) by orviesa002.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Aug 2024 12:27:13 -0700 Received: from [10.212.50.43] (kliang2-mobl1.ccr.corp.intel.com [10.212.50.43]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 4434120CFED0; Fri, 16 Aug 2024 12:27:11 -0700 (PDT) Message-ID: <705dc2fe-7ab3-458a-9b5a-7ea0b30756b8@linux.intel.com> Date: Fri, 16 Aug 2024 15:27:10 -0400 Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] perf/x86/intel: Restrict period on Haswell To: Thomas Gleixner , Li Huafei , peterz@infradead.org, mingo@redhat.com Cc: acme@kernel.org, namhyung@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, irogers@google.com, adrian.hunter@intel.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Andi Kleen , Vince Weaver References: <87sev7nom4.ffs@tglx> <87frr7nd28.ffs@tglx> <70657c5e-f771-456b-a5ac-3df590249288@linux.intel.com> <875xs2oh69.ffs@tglx> <166fdbdf-b24d-4267-b42d-f11348b87b1b@linux.intel.com> <87wmkhlk1l.ffs@tglx> <059d6217-10a5-4d2a-b639-90806c04a13b@linux.intel.com> <87plq9l5d2.ffs@tglx> Content-Language: en-US From: "Liang, Kan" In-Reply-To: <87plq9l5d2.ffs@tglx> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Hi Thomas, On 2024-08-15 7:43 p.m., Thomas Gleixner wrote: > On Thu, Aug 15 2024 at 16:15, Kan Liang wrote: >> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote: >>>> Implication: When this erratum occurs, the fixed-function performance >>>> counter IA32_FIXED_CTR0 may over count by up to 32. >>> >>> Sure. That's only explaining half of the problem. >>> >>> As I demonstrated in the non-contended case even with a count of 2 (I >>> tried 1 too) the status bit is never set on the second check. >>> >> >> Do you mean the below example? The status bit (32) of the fixed counter >> 0 is always set. > > When HT is off or the threads are not running the handler concurrently > then there is zero looping. Once they start do fiddle concurrently the > looping starts and potentially never ends. > >> 65.147782: x86_perf_event_set_period: idx: 32 period: 1 left: 2 >> 65.147783: intel_pmu_handle_irq: loops: 001 status: 100000000 >> >> 65.147784: x86_perf_event_set_period: idx: 32 period: 1 left: 2 >> 65.147784: intel_pmu_handle_irq: loops: 002 status: 100000000 > > So in the non-concurrent (which includes !HT) case the status check > after handling the event is always 0. This never gets into a loop, not > even once. > >>>> It should explains that the issue is gone with the magic number 32 or >>>> disabling the Hyper-Threading. >>> >>> It explains only half of it. If you use 32, then the counter is set to >>> -32 so the overcount of 32 will still bring it to 0, which should set >>> the status bit, no? >> >> I think it's up to 32, not always 32. >> I don't have more details regarding the issue. The architect of HSW has >> left. I'm asking around internally to find the original bug report of >> the erratum. Hope there are more details in the report. > > See below. > >>>> I also found a related discussion about 9 years ago. >>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/ >>>> Vince tried the workaround but it seems not work. >>> >>> Let me play with that. :) >> >> Appreciate it. > > I got it actually working. The inital sequence which "worked" is: > > 1) Clear bit 32 in IA32_PERF_GLOBAL_CTRL > 2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL > 3) Zero the IA32_FIXED_CTR0 MSR > 4) Set IA32_FIXED_CTR0 to (-left) & mask; > 5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL > 6) Set bit 32 in IA32_PERF_GLOBAL_CTRL > > If I omit #3 it does not work. If I flip #5 and #6 it does not work. > > So the initial "working" variant I had was hacking this sequence into > x86_perf_event_set_period() (omitting the fixed counter 0 conditionals > for readability): > > rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32)); > > rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd); > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL); > > wrmsrl(hwc->event_base, 0); > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); > > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd); > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > > Now I thought, that needs to be in intel/core.c and implemented a proper > quirk. And of course being smart I decided it's a brilliant idea to use > the cached values instead of the rdmsrl()s. > > cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask; > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32)); > > cfixd = cpuc->fixed_ctrl_val; > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL); > > wrmsrl(hwc->event_base, 0); > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); > > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd); > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > > Surprise, surprise, that does not work. So I went back and wanted to > know which rdmslr() is curing it: > > rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32)); > > cfixd = cpuc->fixed_ctrl_val; > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL); > > wrmsrl(hwc->event_base, 0); > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); > > wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd); > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > > This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL > did not. > > Now I got bold and boiled it down to: > > rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl); > > wrmsrl(hwc->event_base, 0); > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); > > and the whole thing still worked. *BLINK* > > Exactly zero loop entries in the trace when running 100 instances of > that cve test case, which otherwise spams the trace with entries and > ends up in the loop > 100 path within a split second. > > Removing the zeroing of the counter makes it come back, but reducing the > whole nonsense to: > > wrmsrl(hwc->event_base, 0); > wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask); > Thanks for all the testing. So the trick is to clear the counter before writing to it. That sounds like a usual way to trigger some ucode magic. > makes the loop problem go away, but it "works" so well that the stupid > frequency adjustment algorithm keeps the left == 1, i.e count == 2 case > stay long enough around to trigger 'hung task messages' .... > > Now I looked more at the dmesg output of all the experiments. In all > "working" cases except one running these 100 instances of cve... results > in a varying cascade of > > perf: interrupt took too long (2503 > 2500), lowering ... > > messages. > > The one case where this does not happen is when the limit is > unconditionally set to 32. But when I apply this limit only for the > fixed counter 0 it comes back. Yes, that sounds there is something wrong with < 32 as well. > > Now I looked at when these 'took too long' problems surface aside of the > obvious case of extensive looping. They are unrelated to the hyper > threading issue as I can reproduce with smt=off too. > > They always happen when a counter was armed with a count < 32 and two > events expired in the same NMI. The test case uses fixed counter 0 and > general counter 0 for the other event. > > So that erratum is a good hint, but that hardware does have more issues > than it tells. > > So I think we should just apply that limit patch with a proper change > log and also make it: > > hsw_limit(...) > { > *left = max(*left, erratum_hsw11(event) ? 128 : 32;); > } > The HSW11 is also BDM11. It sounds like we need the trick from both bdw and nhm. How about this? diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index e8bd45556c30..42f557a128b9 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event *event, s64 *left) *left = max(*left, 32LL); } +static void hsw_limit_period(struct perf_event *event, s64 *left) +{ + nhm_limit_period(event, left); + bdw_limit_period(event, left); +} static void glc_limit_period(struct perf_event *event, s64 *left) { if (event->attr.precise_ip == 3) Do you plan to post the "limit" patch for HSW? Or should I send the patch? > or such. > > That limit won't cure the overcount issue from that HSM154 erratum, but > *SHRUG*. > The code is there without the fixed for ~10 years. I didn't find complains. I guess it should be fine to leave it as is. Thanks, Kan