From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B25126A1CC; Tue, 20 May 2025 16:45:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=192.198.163.19 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747759530; cv=none; b=SgKqGgrMUQrEGiNGflYC04TI7U/+W4AlW8TYnYG8VihZx7FEpU/zNltHl+kWdavaTYSszC1jZw/YHegTfHGtEbIpFWIglVLyxA6jXiAF/9ojgHkyG0aLFDvHOd9xYTnLATIgXB8Bj+VdLxQhTVkhRLypbfqqjP3I/kpPaHwbjVM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1747759530; c=relaxed/simple; bh=2m5zn8UGpDXLkEP1kMEnT4ipR3FUe9IKaZfpADLI8DU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=JiFK7BKb4PI55MkLORGdltlbQLBy/zHy37w7nbChtll/tcpMtvTj30f8ZgxXjVmFqHLOqIMujVIY2mugJTg5QR09f0PcLvp/1raA2MoyW2PD3atGQupOGNKwh3/cJZgTDf+Hg8nCMwSAFyZ2aHTQWIBGUNt/VYYcrYFWkbhZNAk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.mailfrom=linux.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=CqEuqm9N; arc=none smtp.client-ip=192.198.163.19 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="CqEuqm9N" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1747759528; x=1779295528; h=message-id:date:mime-version:subject:to:cc:references: from:in-reply-to:content-transfer-encoding; bh=2m5zn8UGpDXLkEP1kMEnT4ipR3FUe9IKaZfpADLI8DU=; b=CqEuqm9NHB/zSChD3fG07oRt5FhIEvJw4BxoKj/Wt+mWtyqmZhLL17WT qg0Bi72TvwxfQgw97l4/jByQ/tshOo/5IE1TnCkBmgXe3ZPM9XQrCZ8+n jgkGiWtZetdG6HQPGaZc6QoDsAB5fQvRuRaO61Im2N67Kt/EDIQR/wRrD K+usJYiqtv3zXa3AQpjwZAeuqfdt2ly+65gayDeONyH9eZv9dxYzWNVjE kzi13zVhcsnHYGB9ivabWbqVFFbOpaRLGlfCWV2BEfvA4Awerg8Od1KFa kIG7x3CrP7U/OjpeO/LJkmYd3XYZ/+EtVRY2cqVAxzDYffl1spbkOsDKm Q==; X-CSE-ConnectionGUID: GTYIkCnnQR2TgZFqQuzfNw== X-CSE-MsgGUID: o4j2Lhy4RyGIB3qFLqpDVQ== X-IronPort-AV: E=McAfee;i="6700,10204,11439"; a="48822342" X-IronPort-AV: E=Sophos;i="6.15,302,1739865600"; d="scan'208";a="48822342" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 May 2025 09:45:27 -0700 X-CSE-ConnectionGUID: vKh2UPJtTpKbChH0XxNSAg== X-CSE-MsgGUID: J5D+VUrxRGeNis7RNldSng== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.15,302,1739865600"; d="scan'208";a="139787863" Received: from linux.intel.com ([10.54.29.200]) by fmviesa007.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 May 2025 09:45:27 -0700 Received: from [10.246.136.52] (kliang2-mobl1.ccr.corp.intel.com [10.246.136.52]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by linux.intel.com (Postfix) with ESMTPS id 9F69A20B5736; Tue, 20 May 2025 09:45:25 -0700 (PDT) Message-ID: <1fcdbfa7-5d99-4337-a473-eb711f27b8a3@linux.intel.com> Date: Tue, 20 May 2025 12:45:24 -0400 Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] perf pmu intel: Adjust cpumaks for sub-NUMA clusters on graniterapids To: Namhyung Kim , Ian Rogers Cc: Weilin Wang , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Adrian Hunter , Ravi Bangoria , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org References: <20250515181417.491401-1-irogers@google.com> <96c8fae8-b8f9-4094-b03a-9dba3ca234c2@linux.intel.com> Content-Language: en-US From: "Liang, Kan" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 2025-05-19 10:00 p.m., Namhyung Kim wrote: > On Sun, May 18, 2025 at 10:45:52AM -0700, Ian Rogers wrote: >> On Sun, May 18, 2025 at 10:09 AM Namhyung Kim wrote: >>> >>> Hello, >>> >>> On Thu, May 15, 2025 at 03:35:41PM -0700, Ian Rogers wrote: >>>> On Thu, May 15, 2025 at 2:01 PM Liang, Kan wrote: >>>>> >>>>> On 2025-05-15 2:14 p.m., Ian Rogers wrote: >>>>>> On graniterapids the cache home agent (CHA) and memory controller >>>>>> (IMC) PMUs all have their cpumask set to per-socket information. In >>>>>> order for per NUMA node aggregation to work correctly the PMUs cpumask >>>>>> needs to be set to CPUs for the relevant sub-NUMA grouping. >>>>>> >>>>>> For example, on a 2 socket graniterapids machine with sub NUMA >>>>>> clustering of 3, for uncore_cha and uncore_imc PMUs the cpumask is >>>>>> "0,120" leading to aggregation only on NUMA nodes 0 and 3: >>>>>> ``` >>>>>> $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1 >>>>>> >>>>>> Performance counter stats for 'system wide': >>>>>> >>>>>> N0 1 277,835,681,344 UNC_CHA_CLOCKTICKS >>>>>> N0 1 19,242,894,228 UNC_M_CLOCKTICKS >>>>>> N3 1 277,803,448,124 UNC_CHA_CLOCKTICKS >>>>>> N3 1 19,240,741,498 UNC_M_CLOCKTICKS >>>>>> >>>>>> 1.002113847 seconds time elapsed >>>>>> ``` >>>>>> >>>>>> By updating the PMUs cpumasks to "0,120", "40,160" and "80,200" then >>>>>> the correctly 6 NUMA node aggregations are achieved: >>>>>> ``` >>>>>> $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1 >>>>>> >>>>>> Performance counter stats for 'system wide': >>>>>> >>>>>> N0 1 92,748,667,796 UNC_CHA_CLOCKTICKS >>>>>> N0 0 6,424,021,142 UNC_M_CLOCKTICKS >>>>>> N1 0 92,753,504,424 UNC_CHA_CLOCKTICKS >>>>>> N1 1 6,424,308,338 UNC_M_CLOCKTICKS >>>>>> N2 0 92,751,170,084 UNC_CHA_CLOCKTICKS >>>>>> N2 0 6,424,227,402 UNC_M_CLOCKTICKS >>>>>> N3 1 92,745,944,144 UNC_CHA_CLOCKTICKS >>>>>> N3 0 6,423,752,086 UNC_M_CLOCKTICKS >>>>>> N4 0 92,725,793,788 UNC_CHA_CLOCKTICKS >>>>>> N4 1 6,422,393,266 UNC_M_CLOCKTICKS >>>>>> N5 0 92,717,504,388 UNC_CHA_CLOCKTICKS >>>>>> N5 0 6,421,842,618 UNC_M_CLOCKTICKS >>>>> >>>>> Is the second coloum the number of units? >>>>> If so, it's wrong. >>>>> >>>>> On my GNR with SNC-2, I observed the similar issue. >>>>> >>>>> $ sudo ./perf stat -e 'UNC_M_CLOCKTICKS' --per-node -a sleep 1 >>>>> Performance counter stats for 'system wide': >>>>> >>>>> N0 0 6,405,811,284 UNC_M_CLOCKTICKS >>>>> N1 1 6,405,895,988 UNC_M_CLOCKTICKS >>>>> N2 0 6,152,906,692 UNC_M_CLOCKTICKS >>>>> N3 1 6,063,415,630 UNC_M_CLOCKTICKS >>>>> >>>>> It's supposed to be 4? >>>> >>>> Agreed it is weird, but it is what has historically been displayed. >>>> The number is the aggregation number: >>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat-display.c?h=perf-tools-next#n307 >>>> which comes from: >>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat-display.c?h=perf-tools-next#n135 >>>> which comes from: >>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n435 >>>> However, I think it is missing updates from: >>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n526 >>>> but there is a comment there saying "don't increase aggr.nr for >>>> aliases" and all the uncore events are aliases. I don't understand >>>> what the aggregation number is supposed to be, it is commented as >>>> "number of entries (CPUs) aggregated": >>>> https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.h?h=perf-tools-next#n26 >>>> it would seem to make sense in the CHA case with SNC3 and 42 evsels >>>> per NUMA node that the value should be 42. Maybe Namhyung (who did the >>>> evsel__merge_aggr_counters clean up) knows why it is this way but in >>>> my eyes it just seems like something that has been broken for a long >>>> time. >>> >>> I think it's the number of aggregated entries (FDs?). >>> >>> For core events, it's the number of CPUs for the given aggregation as it >>> collects from all CPUs. But it'd be confusing with uncore events which >>> have cpumask to collect data from a few CPUs. >>> >>> On my laptop, --per-socket gives different numbers depending on the >>> events/PMUs. >>> >>> For core events, it's 4. >>> >>> $ sudo ./perf stat -a --per-socket -e cycles sleep 1 >>> >>> Performance counter stats for 'system wide': >>> >>> S0 4 225,297,257 cycles >>> >>> 1.002581995 seconds time elapsed >>> >>> While uncore events give 1. >>> >>> $ sudo ./perf stat -a --per-socket -e unc_mc0_rdcas_count_freerun sleep 1 >>> >>> Performance counter stats for 'system wide': >>> >>> S0 1 23,665,510 unc_mc0_rdcas_count_freerun >>> >>> 1.002148012 seconds time elapsed >> >> I think we're agreeing. I wonder that the intent of the aggregation >> number is to make it so that you can work out an average from the >> aggregated count. So for core PMUs you divide the count by the >> aggregation number and get the average count per core (CPU?). If we're >> getting an aggregated count of say uncore memory controller events >> then it would make sense to me that we show the aggregated total and >> the aggregation count is the number of memory controller PMUs, so we >> can have an average per memory controller. This should line up with >> using the number of file descriptors. > > Sounds right. > >> >> I think this isn't the current behavior, on perf v6.12: >> ``` >> $ sudo perf stat --per-socket -e data_read -a sleep 1 >> >> Performance counter stats for 'system wide': >> >> S0 1 2,484.96 MiB data_read >> >> 1.001365319 seconds time elapsed >> >> $ sudo perf stat -A -e data_read -a sleep 1 >> >> Performance counter stats for 'system wide': >> >> CPU0 1,336.48 MiB data_read [uncore_imc_free_running_0] >> CPU0 1,337.06 MiB data_read [uncore_imc_free_running_1] >> >> 1.001049096 seconds time elapsed >> ``` >> so the aggregation number shows 1 but 2 events were aggregated together. > > Ugh.. right. Merging uncore PMU instances can add more confusion. :( > >> >> I think computing the aggregation number in the stat code is probably >> wrong. The value should be constant for an evsel and aggr_cpu_id, it's >> just computing it for an aggr_cpu_id is a pain due to needing topology >> and/or PMU information. The code is ripe for refactoring. I'd prefer >> not to do it as part of this change though which is altering a >> particular Intel Granite Rapids issue. > > That's ok. Just one more TODO items.. > Sounds good to me as well. For this patch, I've verified it with SNC-2. The rest looks good to me. Tested-by: Kan Liang Thanks, Kan