From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B0EB5685; Thu, 9 Nov 2023 04:49:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=none Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D67F626A2; Wed, 8 Nov 2023 20:49:02 -0800 (PST) Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-6b89ab5ddb7so441829b3a.0; Wed, 08 Nov 2023 20:49:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699505342; x=1700110142; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=//fNZW24HiQXJZnbOmMOKxa5B9sCqJvB/Cm0s9I+HRc=; b=FyrE4iipUD6QqWr/bnLXucq7Wk9zFxszNIAt7uFH5vU2zCXSIEDN3UybuV2ohRbmFJ diU02RGT9FhhBSHDYWZ3NI6U6evN/OCNOYDYMmmeVFQMcvxpnQluELiF0nhOIijZFVmw WCtkudrJE4ipwNw+gDAXfTBU0xH7iepkmq4S5lbz4wyHHMUmwk80tcuIp4m78dn5ufXr nFR77jCq021GwtvnDzNiKWgQueeu8fshDPRlYVqKQYbG3KbqscFO8pGr2K1Fg0JhFpoj VYH55mrBCVDE5N7hTwLfuH4TnGbJ6mhGfHAnFyoxFk8uGnr20WxGYZjWn30k+vLTrcyH SCEw== X-Gm-Message-State: AOJu0YyCwwBSfX1X4cRQMFXAH0HGvtaaObcQnaC6DpP/GFJNZnlJ6jEo ntrK2sjk73zAoICGrtfUC21W/2I7qzvEcOH493I= X-Google-Smtp-Source: AGHT+IGM/Yu0tNAUEG9TLLx7rAmF6coIAA82p+3BHh2r/DRVp1V3HED5aMHwQATgWAsvAkW4rj58aM+DZmkAHfFxe8w= X-Received: by 2002:a05:6a20:a121:b0:17b:7505:8ab7 with SMTP id q33-20020a056a20a12100b0017b75058ab7mr4524066pzk.46.1699505342069; Wed, 08 Nov 2023 20:49:02 -0800 (PST) Precedence: bulk X-Mailing-List: linux-trace-devel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20231012035111.676789-1-namhyung@kernel.org> <82cd8b7e-bd46-49ed-9160-eabcfd4c3c20@redhat.com> In-Reply-To: <82cd8b7e-bd46-49ed-9160-eabcfd4c3c20@redhat.com> From: Namhyung Kim Date: Wed, 8 Nov 2023 20:48:50 -0800 Message-ID: Subject: Re: [RFC 00/48] perf tools: Introduce data type profiling (v1) To: Joe Mario Cc: Arnaldo Carvalho de Melo , Jiri Olsa , Peter Zijlstra , Ian Rogers , Adrian Hunter , Ingo Molnar , LKML , linux-perf-users@vger.kernel.org, Linus Torvalds , Stephane Eranian , Masami Hiramatsu , linux-toolchains@vger.kernel.org, linux-trace-devel@vger.kernel.org, Ben Woodard , Kees Cook , David Blaikie , Xu Liu , Kan Liang , Ravi Bangoria Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, On Wed, Nov 8, 2023 at 9:12=E2=80=AFAM Joe Mario wrote: > > Hi Namhyung: > > I've been playing with your datatype profile patch and it looks really pr= omising. > I think it would be a big help if it could be integrated into perf c2c. Great! Yeah, I think we can collaborate on it. > > Perf c2c gives a great insight into what's contributing to cpu cacheline = contention, but it > can be difficult to understand the output. Having visuals with your data= type profile output > would be a big help. Exactly. > > I have a simple test program with readers and writers tugging on the data= below: > > uint64_t hotVar; > typedef struct __foo { > uint64_t m1; > uint64_t m2; > uint64_t m3; > } FOO; > > The rest of this reply looks at both your datatype output and c2c to see = where they > might compliment each other. > > > When I run perf with your patches on a simple program to cause contention= on the above data, I get the following: > > # perf mem record --ldlat=3D1 --all-user -- ./tugtest -r3 -r5 -r7 -r9 -r= 11 -w10 -w12 -w14 -w16 -b5 -H2000000 > # perf report -s type,typeoff --hierarchy --stdio > > # Samples: 26K of event 'cpu/mem-loads,ldlat=3D1/P' > # Event count (approx.): 2958226 > # > # Overhead Data Type / Data Type Offset > # ........... ............................ > # > 54.50% int > 54.50% int +0 (no field) > 23.21% long int > 23.21% long int +0 (no field) > 18.30% struct __foo > 9.57% struct __foo +8 (m2) > 8.73% struct __foo +0 (m1) > 3.86% long unsigned int > 3.86% long unsigned int +0 (no field) > > > # Samples: 30K of event 'cpu/mem-stores/P' > # Event count (approx.): 33880197 > # > # Overhead Data Type / Data Type Offset > # ........... ............................ > # > 99.85% struct __foo > 70.48% struct __foo +0 (m1) > 29.34% struct __foo +16 (m3) > 0.03% struct __foo +8 (m2) > 0.09% long unsigned int > 0.09% long unsigned int +0 (no field) > 0.06% (unknown) > 0.06% (unknown) +0 (no field) > > > Then I run perf annotate with your patches, and I get the following: > > # perf annotate --data-type > > Annotate type: 'long int' in /home/joe/tugtest/tugtest (2901 samples): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > samples offset size field > 2901 0 8 long int ; > > Annotate type: 'struct __foo' in /home/joe/tugtest/tugtest (5593 sampl= es): > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D > samples offset size field > 5593 0 24 struct __foo { > 2755 0 8 uint64_t m1; > 2838 8 8 uint64_t m2; > 0 16 8 uint64_t m3; > }; > > Now when I run that same simple test using perf c2c, and I focus on the c= achline that the struct and hotVar reside in, I get: > > # perf c2c record --all-user -- ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 = -w14 -w16 -b5 -H2000000 > # perf c2c report -NNN --stdio > # > # > # ----- HITM ----- ------- Store Refs ------ ------ Data address -= ----- ---------- cycles ---------- Total cpu = Shared > # Num RmtHitm LclHitm L1 Hit L1 Miss N/A Offset Node P= A cnt Code address rmt hitm lcl hitm load records cnt Symb= ol Object Source:Line Node{cpu list} > #.... ....... ....... ....... ....... ....... ............ .... .= ..... ............ ........ ........ ........ ....... ..... ........= .. ....... ............. .... > # > --------------------------------------------------------------- > 0 1094 2008 17071 13762 0 0x406100 > --------------------------------------------------------------- > 0.00% 0.20% 0.00% 0.00% 0.00% 0x8 1 = 1 0x401355 0 978 1020 2962 4 [.] writ= er tugtest tugtest.c:129 0{10,12,14,16} > 0.00% 0.00% 0.12% 0.02% 0.00% 0x8 1 = 1 0x401360 0 0 0 23 4 [.] writ= er tugtest tugtest.c:129 0{10,12,14,16} > 68.10% 60.26% 0.00% 0.00% 0.00% 0x10 1 = 1 0x401505 2181 1541 1393 5813 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > 31.63% 39.34% 0.00% 0.00% 0.00% 0x10 1 = 1 0x401331 1242 1095 936 3393 4 [.] writ= er tugtest tugtest.c:127 0{10,12,14,16} > 0.00% 0.00% 40.03% 40.25% 0.00% 0x10 1 = 1 0x40133c 0 0 0 12372 4 [.] writ= er tugtest tugtest.c:127 0{10,12,14,16} > 0.27% 0.15% 0.00% 0.00% 0.00% 0x18 1 = 1 0x401343 834 1136 1032 2930 4 [.] writ= er tugtest tugtest.c:128 0{10,12,14,16} > 0.00% 0.05% 0.00% 0.00% 0.00% 0x18 1 = 1 0x40150c 0 933 1567 5050 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > 0.00% 0.00% 0.06% 0.00% 0.00% 0x18 1 = 1 0x40134e 0 0 0 10 4 [.] writ= er tugtest tugtest.c:128 0{10,12,14,16} > 0.00% 0.00% 59.80% 59.73% 0.00% 0x20 1 = 1 0x401516 0 0 0 18428 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > > With the above c2c output, we can see: > - the hottest contended addresses, and the load latencies they caused. > - the cacheline offset for the contended addresses. > - the cpus and numa nodes where the accesses came from. > - the cacheline alignment for the data of interest. > - the number of cpus and threads concurrently accessing each address. > - the breakdown of reads causing HITM (contention) and writes hitting or= missing the cacheline. > - the object name, source line and line number for where the accesses oc= cured. > - the numa node where the data is allocated. > - the number of physical pages the virtual addresses were mapped to (e.g= . numa_balancing). > > What would really help the c2c output be more usable is if it had a bette= r visual to it. > It's likely the current c2c output can be trimmed a bit. > > Here's one idea that incorporates your datatype info, though I'm sure the= re are better ways, as this may get unwieldy.: > > # ----- HITM ----- ------- Store Refs ------ ------ Data address -= ----- ---------- cycles ---------- Total cpu = Shared > # Num RmtHitm LclHitm L1 Hit L1 Miss N/A Offset Node P= A cnt Code address rmt hitm lcl hitm load records cnt Symb= ol Object Source:Line Node{cpu list} > #.... ....... ....... ....... ....... ....... ............ .... .= ..... ............ ........ ........ ........ ....... ..... ........= .. ....... ............. .... > # > --------------------------------------------------------------- > 0 1094 2008 17071 13762 0 0x406100 > --------------------------------------------------------------- > uint64_t hotVar: tugtest.c:38 > 0.00% 0.20% 0.00% 0.00% 0.00% 0x8 1 = 1 0x401355 0 978 1020 2962 4 [.] writ= er tugtest tugtest.c:129 0{10,12,14,16} > 0.00% 0.00% 0.12% 0.02% 0.00% 0x8 1 = 1 0x401360 0 0 0 23 4 [.] writ= er tugtest tugtest.c:129 0{10,12,14,16} > struct __foo uint64_t m1: tugtest.c:39 > 68.10% 60.26% 0.00% 0.00% 0.00% 0x10 1 = 1 0x401505 2181 1541 1393 5813 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > 31.63% 39.34% 0.00% 0.00% 0.00% 0x10 1 = 1 0x401331 1242 1095 936 3393 4 [.] writ= er tugtest tugtest.c:127 0{10,12,14,16} > 0.00% 0.00% 40.03% 40.25% 0.00% 0x10 1 = 1 0x40133c 0 0 0 12372 4 [.] writ= er tugtest tugtest.c:127 0{10,12,14,16} > struct __foo uint64_t m2: tugtest.c:40 > 0.27% 0.15% 0.00% 0.00% 0.00% 0x18 1 = 1 0x401343 834 1136 1032 2930 4 [.] writ= er tugtest tugtest.c:128 0{10,12,14,16} > 0.00% 0.05% 0.00% 0.00% 0.00% 0x18 1 = 1 0x40150c 0 933 1567 5050 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > 0.00% 0.00% 0.06% 0.00% 0.00% 0x18 1 = 1 0x40134e 0 0 0 10 4 [.] writ= er tugtest tugtest.c:128 0{10,12,14,16} > struct __foo uint64_t m3: tugtest.c:41 > 0.00% 0.00% 59.80% 59.73% 0.00% 0x20 1 = 1 0x401516 0 0 0 18428 5 [.] read= er tugtest tugtest.c:163 1{3,5,7,9,11} > > And then it would be good to find a clean way to incorporate your sample = counts. I'm not sure we can get the exact source line for the data type/fields. Of course, we can aggregate the results for each field. Actually you can use `perf report -s type,typeoff,symoff --hierarchy` for something similar. :) > > On a related note, is there a way the accesses could be broken down into = read counts > and write counts? That, with the above source line info for all the acc= esses, > helps to convey a picture of "the affinity of the accesses". Sure, perf report already supports showing events in a group together. You can use --group option to force grouping individual events. perf annotate with --data-type doesn't have that yet. I'll update it in v2. > > For example, while it's normally good to separate read-mostly data from h= ot > written data, if the reads and writes are done together in the same block= of > code by the same thread, then keeping the two data symbols in the same ca= cheline > could be a win. I've seen this often. Your datatype info might be able t= o > make these affinities more visible to the user. > > Thanks for doing this. This is great. > Joe Thanks for your feedback! Namhyung