From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46A2D143740; Fri, 19 Jul 2024 16:35:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.17 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721406942; cv=none; b=KVstebAHE29SdxlBpTrhjkdS0OWLoeJk7CA95ny8bHZornUVbgTdas6vI83DZGPsFc3s+2kwY8nvnycDjBRa9ISgOP+rwiUc4TUK0kzNFY5upGhU4eTisc8BI4ayHAC16WIX57CmOio2ARuThyTtJvS+xn9HM2toKUcT/x1jYXE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1721406942; c=relaxed/simple; bh=2P0YSR1E499mF3RBxr3r8hmy3EpBlBFkW8du9nwa2hU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=oXI1Uw1cKFCd4wgzea86RLAN67MQjI/8jWXlTDF/PuzvICIcbJoM39Eauqr1GaI1gJ/kAJLCgZjH0NHbMjzMK5LvVbz8ETLUV40Zspk8y7klTCDoenoksy8cOc7UK/bMT1NxNbTKsbDArst+5/BZ5cbFD7Ef5CYABjuxJWxlQPk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=aESKzx9j; arc=none smtp.client-ip=192.198.163.17 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="aESKzx9j" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721406940; x=1752942940; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=2P0YSR1E499mF3RBxr3r8hmy3EpBlBFkW8du9nwa2hU=; b=aESKzx9jKQIHY+e+G2OZukPg6I5N0Y5dfUOKifBWzI8yDzGXBMT9j8Ew 07JAYOLJ47zc6TMFFlDYRLxNTW0NC4kUsSe/pfnmcOsGSJw/GnNU4W2ha pXyPzWlgbNJmr8ybqR/9QKJZj/vbsYhXdiIMpGczEUXlQ4D7qzsAubX37 +1yw5h0BDN7YUcPdFK9NfbPIK8wMffvSboIx/PojSqmtl3bOtfHksED3l aB3idcoNozCsDBaMo2tmrxkrwsjpK+IKk8LCul2zhiySlWkWt2KvWl8Ti 7CgIQO+5tjOYGZTI0QqtS94hFE2YYRhZ75Fb9UGDb6tLmMOKk1ui73FUw g==; X-CSE-ConnectionGUID: EQcmOjHBToWtCPDJmPwcWA== X-CSE-MsgGUID: vn28Di8LRuWvN6wv6ZM7ow== X-IronPort-AV: E=McAfee;i="6700,10204,11138"; a="18899971" X-IronPort-AV: E=Sophos;i="6.09,221,1716274800"; d="scan'208";a="18899971" Received: from fmviesa010.fm.intel.com ([10.60.135.150]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jul 2024 09:35:39 -0700 X-CSE-ConnectionGUID: JpLEfVqZR6KATsd0Jf/I2g== X-CSE-MsgGUID: gkBnw5EtRPeel9y7PYoHfw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,221,1716274800"; d="scan'208";a="51232828" Received: from linux.intel.com ([10.54.29.200]) by fmviesa010.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Jul 2024 09:35:39 -0700 Received: from [10.209.176.79] (kliang2-mobl1.ccr.corp.intel.com [10.209.176.79]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 8C51520CFED5; Fri, 19 Jul 2024 09:35:36 -0700 (PDT) Message-ID: <775d8f1d-437d-47a3-b4b2-da476e914cf5@linux.intel.com> Date: Fri, 19 Jul 2024 12:35:35 -0400 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in sysfs To: Ian Rogers Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Namhyung Kim , Mark Rutland , Alexander Shishkin , Jiri Olsa , Adrian Hunter , Bjorn Helgaas , Jonathan Corbet , James Clark , Ravi Bangoria , Dominique Martinet , linux-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dhananjay Ugwekar , ananth.narayan@amd.com, gautham.shenoy@amd.com, kprateek.nayak@amd.com, sandipan.das@amd.com References: <20240718003025.1486232-1-irogers@google.com> <20240718003025.1486232-4-irogers@google.com> <92ceb8b5-240a-4715-98db-c73e8e9d3e50@linux.intel.com> <64030eab-6e95-494a-ab72-bc33792ef723@linux.intel.com> Content-Language: en-US From: "Liang, Kan" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 2024-07-19 10:59 a.m., Ian Rogers wrote: > On Fri, Jul 19, 2024 at 6:56 AM Liang, Kan wrote: >> >> >> >> On 2024-07-18 4:50 p.m., Ian Rogers wrote: >>> On Thu, Jul 18, 2024 at 10:48 AM Liang, Kan wrote: >>>> >>>> >>>> >>>> On 2024-07-18 11:39 a.m., Ian Rogers wrote: >>>>> On Thu, Jul 18, 2024 at 7:33 AM Liang, Kan wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024-07-17 8:30 p.m., Ian Rogers wrote: >>>>>>> If an event file exists in sysfs, check if a event.cpus file exists >>>>>>> and read a perf_cpu_map from it if it does. This allows particular >>>>>>> events to have a different set of CPUs compared to the PMU. >>>>>>> >>>>>>> One scenario where this could be useful is when a PMU is set up with a >>>>>>> cpumask/events per SMT thread but some events record for all SMT >>>>>>> threads. Programming an event on each SMT thread will cause >>>>>>> unnecessary counters to be programmed and the aggregate value to be >>>>>>> too large. >>>>>> >>>>>> If I understand the scenario correctly, I think the issue should have >>>>>> been addressed since ICX, with the introduction of the "SMT-aware >>>>>> events". Is there a specific event which still has the issue on newer >>>>>> platforms? >>>>> >>>>> Nothing comes to mind, but it is there on popular machines like Skylake. >>>>> >>>>>>> >>>>>>> Another scenario where this could be useful if when a PMU has >>>>>>> historically had a cpumask at the package level, but now newer per >>>>>>> die, core or CPU information is available. >>>>>> >>>>>> The traditional way to support new counters with a different scope is to >>>>>> add a new PMU. >>>>>> >>>>>>> >>>>>>> Additional context for the motivation is in these patches and >>>>>>> conversation: >>>>>>> https://lore.kernel.org/lkml/20240711102436.4432-1-Dhananjay.Ugwekar@amd.com/ >>>>>> >>>>>> I don't see the advantages of the new event.cpus way. >>>>>> The aggregation should be the same. >>>>> >>>>> Agreed. My concern is that we may end up with a pattern of >>>>> _per_pkg, _per_core, _per_cpu, _per_l3, etc. just >>>>> for the sake of the cpumask. >>>> >>>> The cstate PMUs already do so, e.g., cstate_core, cstate_pkg. >>>> >>>> From another perspective, it discloses the scope information in a PMU >>>> name. If a user only cares about the information of a package, they just >>>> need to focus on everything under _pkg. Otherwise, they have to go >>>> through all the events. >>>> >>>>> >>>>>> The RAPL counters are free-running counters. So there is no difference >>>>>> whether grouping or not grouping. >>>>> >>>>> Should the RAPL counters return true for perf_pmu__is_software? In the >>>>> tool we currently return false and won't allow grouping, these events >>>>> with other hardware events. The intent in perf_pmu__is_software was to >>>>> emulate the way the kernel allows/disallows groups - software context >>>>> events can be in a hardware context but otherwise I don't believe the >>>>> grouping is allowed. >>>> >>>> No, it's not allowed for the RAPL counters. >>>> >>>> If the motivation is to find another way to group counters with >>>> different scope, it may not work. >>>> >>>> We once tried to mix the perf_invalid_context PMUs in a group. But it's >>>> denied. >>>> https://yhbt.net/lore/all/20150415172856.GA5029@twins.programming.kicks-ass.net/ >>>> >>>> The event.cpus still faces the same issues. >>> >>> Why so? The events would share the same perf_event_context, I thought >>> the perf_event_open would succeed. >> >> I think it breaks what groups are as well. At least, you cannot >> guarantee that two events can be co-scheduled on the same CPU. Even >> worse, there could be no events on some CPU. >> The first thing that pops up is the sample read feature. On CPU_A, the >> event_A is the leader event, but on CPU_B, the event_B could be the >> leader event if event_A's cpumask doesn't include the CPU_B. >> There could be more similar features we have to special handle. > > Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and > solely its perf event context, I want to know its core power and > package power as a group so I never record one without the other. That > grouping wouldn't be possible with 2 PMUs. For power, to be honest, I don't think it improves anything. It gives users a false image that perf can group these counters. But the truth is that perf cannot. The power counters are all free-running counters. It's impossible to co-schedule them (which requires a global mechanism to disable/enable all counters, e.g., GLOBAL_CTRL for core PMU). The kernel still has to read the counters one by one while the counters keep running. There are no differences with or without a group for the power events. > >>> >>>>> >>>>>> But it makes the kernel driver complex, since it has to maintain at >>>>>> least two different cpumasks. >>>>> >>>>> Two drivers each maintaining a cpumask or 1 driver maintaining 2 >>>>> cpumasks, it seems like there is chance for code reuse in both cases. >>>>> I'm not seeing how it adds to complexity particularly. >>>> >>>> Yes, there are some cleanup opportunities for the cpumask/hotplug codes. >>>> >>>> We may even abstracts some generic interfaces for pkg or core level PMUs. >>>> >>>> Eventually, the complexity/duplication should be able to be reduced. >>>> >>>>> >>>>>> The tool becomes complex either, since it has to take care of the >>>>>> per-event CPU override case. >>>>> >>>>> I'm not sure I agree with this. What we need for perf_event_open is a >>>>> perf_event_attr, we dress these up as evsels which also have the >>>>> ability to be for >1 CPU or thread. Longer term I think evsels can >>>>> also have >1 PMU, for the wildcard cases like uncore memory >>>>> controllers - this would remove the need for resorting evsels except >>>>> for topdown events which have thrown us a giant bundle of challenges. >>>>> Anyway, so an evsel is perf_event_attr information paired with CPU >>>>> information, it makes sense to me that the parser should do this >>>>> pairing. What's harder in the evsel/evlist setup is trying to fix CPU >>>>> maps up not in parsing, like with __perf_evlist__propagate_maps where >>>>> the parsing is trying to leave crumbs around (like system_wide, >>>>> has_user_cpus, is_pmu_core) so the map propagation works properly. >>>>> >>>>>> The json file must also be updated to add a >>>>>> new field cpumask. >>>>> >>>>> Yeah, I don't like this as it means we end up putting CPU information >>>>> into the json that isn't the same for every CPU variant of the same >>>>> architecture model. Maybe we can have some kind of "scope" enum value >>>>> in the json and then when the scope differs from the PMU's, core scope >>>>> vs the PMU's hyperthread scope, then in the tool we can figure out the >>>>> cpumask from the topology in sysfs. Maybe we should just always use >>>>> the topology and get rid of cpumask files in sysfs, replacing them >>>>> with "scope" files. Will Deacon pushed back on having ARM PMUs >>>>> supporting hot plugging >>>>> (https://lore.kernel.org/lkml/20240701142222.GA2691@willie-the-truck/) >>>>> where the main thing hot plugging handler needs to maintain is set the >>>>> cpumask. >>>> >>>> Not just the cpumask but also migrate the context for some PMUs, see >>>> perf_pmu_migrate_context(). >>> >>> Will do, thanks. >>> >>>> It seems we really need a shared cpumask in the generic code, so the >>>> drivers don't need to handle the hotplug everywhere. I will think about it. >>> >>> Thanks. There are other problems on ARM PMUs like having no or empty >>> cpumasks, which the tool take to mean open the event on every online >>> CPU (there is no cpus or cpumask file on core PMUs historically, so we >>> adopt this behavior when the cpumask is empty or not present). >> >> The no cpus/cpumasks and empty cpumasks should be different. >> No cpus/cpumasks file means that the counters/events are available for >> all the CPUs. >> But if it's an empty cpus/cpumasks file, it means that there are no >> proper CPUs. It may happen on a hybrid machine when all e-core CPUs are >> hot-removed. Since it's possible that the CPUs can be hot-added later, >> the kernel driver doesn't unregister the cpu_atom PMU. >> >>> I've >>> been thinking to expand `tools/perf/tests/pmu.c` with basic PMU sanity >>> tests. Some ideas: >>> >> >> Thanks. >> >>> 1) some kind of cpumask sanity check - we could open an event with the >>> cpumask and see if it yields multiplexing.. which would highlight the >>> ARM no cpumask bug >> >> The multiplexing is triggered when there are not enough counters. It >> should not related to the cpumask. > > Agreed. Here I'm thinking about the bugs I see in PMUs. One of them is > to always succeed in opening siblings within a group and for the > unavailable counters to just report "not counted". This breaks weak > groups used by metrics. > >> For the PMU without cpumask, I think the test case should try to open an >> event on all CPUs and check if the open succeeds. >> For the PMU with cpumask, the test case should try to open an event on >> the masked CPUs and check if the open succeeds. > > I agree without cpumask means all CPUs, the bug I see on ARM PMUs is > that they have uncore PMUs with no or empty cpumasks leading to the > uncore events being programmed on every CPU and over counting, > multiplexing counters and so on. I'm trying to devise tests so that > they can see they are broken. > >> The behavior of opening an event on unmasked CPUs seems not defined. >> For uncore, it's OK. The kernel treats the CPUs from the same socket >> exactly the same. But I'm not sure about the other PMUs. > > My understanding had been that for core PMUs a "perf stat -C" option > would choose the particular CPU to count the event on, for an uncore > PMU the -C option would override the cpumask's "default" value. We > have code to validate this: > https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522 > But it seems now that overriding an uncore PMU's default CPU is > ignored. For the uncore driver, no matter what -C set, it writes the default CPU back. https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/intel/uncore.c#n760 > If you did: > $ perf stat -C 1 -e data_read -a sleep 0.1 > Then the tool thinks data_read is on CPU1 and will set its thread > affinity to CPU1 to avoid IPIs. It seems to fix this we need to just > throw away the -C option. The perf tool can still read the the counter from CPU1 and no IPIs because of the PMU_EV_CAP_READ_ACTIVE_PKG(). Not quite sure, but it seems only the open and close may be impacted and silently changed to CPU0. Thanks, Kan > >>> 2) do the /sys/devices//events/event.(unit|scale|per-pkg|snapshot) >>> files parse correctly and have a corresponding event. >>> 3) keep adding opening events on the PMU to a group to make sure that >>> when counters are exhausted the perf_event_open fails (I've seen this >>> bug on AMD) >>> 4) are the values in the type file unique >>> >> >> The rest sounds good to me. > > Cool. Let me know if you can think of more. > > Thanks, > Ian > >> Thanks, >> Kan >