* [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?)
@ 2024-02-19 13:00 Jann Horn
2024-02-19 23:20 ` Ian Rogers
0 siblings, 1 reply; 7+ messages in thread
From: Jann Horn @ 2024-02-19 13:00 UTC (permalink / raw)
To: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Mark Rutland, Alexander Shishkin, Ian Rogers,
Adrian Hunter
Cc: Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list,
linux-perf-users, Liang, Kan
Hi!
From what I understand, "perf c2c" shows bogus HitM events on Ice Lake
(and newer) because Intel added some feature where *clean* cachelines
can get snoop-forwarded ("cross-core FWD"), and the PMU apparently
treats this mostly the same as snoop-forwarding of modified cache
lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel
rodata section in "perf c2c report".
This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency
Facility", table "Table 20-101. Data Source Encoding for Memory
Accesses (Ice Lake and Later Microarchitectures)", encoding 07H:
"XCORE FWD. This request was satisfied by a sibling core where either
a modified (cross-core HITM) or a non-modified (cross-core FWD)
cache-line copy was found."
I don't see anything about this in arch/x86/events/intel/ds.c - if I
understand correctly, the kernel's PEBS data source decoding assumes
that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs
to be adjusted somehow - and maybe it just isn't possible to actually
distinguish between HitM and cross-core FWD in PEBS events on these
CPUs (without big-hammer chicken bit trickery)? Maybe someone from
Intel can clarify?
(The SDM describes that E-cores on the newer 12th Gen have more
precise PEBS encodings that distinguish between "L3 HITM" and "L3
HITF"; but I guess the P-cores there maybe still don't let you
distinguish HITM/HITF?)
I think https://perfmon-events.intel.com/tigerLake.html is also
outdated, or at least it uses ambiguous grammar: The
MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is
documented as "Counts retired load instructions where a cross-core
snoop hit in another cores caches on this socket, the data was
forwarded back to the requesting core as the data was modified
(SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" -
from what I understand, a "cross-core FWD" should be a case where the
L3 does have the data, unless L3 has become non-inclusive on Ice Lake?
On a Tiger Lake CPU, I can see this event trigger for the
sys_call_table, which is located in the rodata region and probably
shouldn't be containing Modified cache lines:
# grep -A1 -w sys_call_table /proc/kallsyms
ffffffff82800280 D sys_call_table
ffffffff82801100 d vdso_mapping
# perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data
^C[ perf record: Woken up 11 times to write data ]
[ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ]
# perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]'
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288
ffffffff82526275 do_syscall_64
mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8
ffffffff82526275 do_syscall_64
(For what it's worth, there is a thread on LKML where "cross-core FWD"
got mentioned: <https://lore.kernel.org/lkml/b4aaf1ed-124d-1339-3e99-a120f6cc4d28@linux.intel.com/>)
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-19 13:00 [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) Jann Horn @ 2024-02-19 23:20 ` Ian Rogers 2024-02-20 15:42 ` Arnaldo Carvalho de Melo 0 siblings, 1 reply; 7+ messages in thread From: Ian Rogers @ 2024-02-19 23:20 UTC (permalink / raw) To: Jann Horn Cc: Jiri Olsa, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Liang, Kan, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: > > Hi! > > From what I understand, "perf c2c" shows bogus HitM events on Ice Lake > (and newer) because Intel added some feature where *clean* cachelines > can get snoop-forwarded ("cross-core FWD"), and the PMU apparently > treats this mostly the same as snoop-forwarding of modified cache > lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel > rodata section in "perf c2c report". > > This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency > Facility", table "Table 20-101. Data Source Encoding for Memory > Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: > "XCORE FWD. This request was satisfied by a sibling core where either > a modified (cross-core HITM) or a non-modified (cross-core FWD) > cache-line copy was found." > > I don't see anything about this in arch/x86/events/intel/ds.c - if I > understand correctly, the kernel's PEBS data source decoding assumes > that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs > to be adjusted somehow - and maybe it just isn't possible to actually > distinguish between HitM and cross-core FWD in PEBS events on these > CPUs (without big-hammer chicken bit trickery)? Maybe someone from > Intel can clarify? > > (The SDM describes that E-cores on the newer 12th Gen have more > precise PEBS encodings that distinguish between "L3 HITM" and "L3 > HITF"; but I guess the P-cores there maybe still don't let you > distinguish HITM/HITF?) > > > I think https://perfmon-events.intel.com/tigerLake.html is also > outdated, or at least it uses ambiguous grammar: The > MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is > documented as "Counts retired load instructions where a cross-core > snoop hit in another cores caches on this socket, the data was > forwarded back to the requesting core as the data was modified > (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" - > from what I understand, a "cross-core FWD" should be a case where the > L3 does have the data, unless L3 has become non-inclusive on Ice Lake? > > On a Tiger Lake CPU, I can see this event trigger for the > sys_call_table, which is located in the rodata region and probably > shouldn't be containing Modified cache lines: > > # grep -A1 -w sys_call_table /proc/kallsyms > ffffffff82800280 D sys_call_table > ffffffff82801100 d vdso_mapping > # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data > ^C[ perf record: Woken up 11 times to write data ] > [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ] > # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]' > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288 > ffffffff82526275 do_syscall_64 > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > ffffffff82526275 do_syscall_64 > > > (For what it's worth, there is a thread on LKML where "cross-core FWD" > got mentioned: <https://lore.kernel.org/lkml/b4aaf1ed-124d-1339-3e99-a120f6cc4d28@linux.intel.com/>) +others better qualified than me to respond. Hi Jann, I'm not overly familiar with the issue, but it appears a similar issue has been reported for Broadwell Xeon here: https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120 I'm not sure that thread will be particularly useful, but having the Intel people better qualified than me to answer is probably the better service of this email. Thanks, Ian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-19 23:20 ` Ian Rogers @ 2024-02-20 15:42 ` Arnaldo Carvalho de Melo 2024-02-22 20:05 ` Liang, Kan 0 siblings, 1 reply; 7+ messages in thread From: Arnaldo Carvalho de Melo @ 2024-02-20 15:42 UTC (permalink / raw) To: Ian Rogers Cc: Jann Horn, Joe Mario, Jiri Olsa, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Liang, Kan, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin Just adding Joe Mario to the CC list. On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote: > On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: > > > > Hi! > > > > From what I understand, "perf c2c" shows bogus HitM events on Ice Lake > > (and newer) because Intel added some feature where *clean* cachelines > > can get snoop-forwarded ("cross-core FWD"), and the PMU apparently > > treats this mostly the same as snoop-forwarding of modified cache > > lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel > > rodata section in "perf c2c report". > > > > This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency > > Facility", table "Table 20-101. Data Source Encoding for Memory > > Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: > > "XCORE FWD. This request was satisfied by a sibling core where either > > a modified (cross-core HITM) or a non-modified (cross-core FWD) > > cache-line copy was found." > > > > I don't see anything about this in arch/x86/events/intel/ds.c - if I > > understand correctly, the kernel's PEBS data source decoding assumes > > that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs > > to be adjusted somehow - and maybe it just isn't possible to actually > > distinguish between HitM and cross-core FWD in PEBS events on these > > CPUs (without big-hammer chicken bit trickery)? Maybe someone from > > Intel can clarify? > > > > (The SDM describes that E-cores on the newer 12th Gen have more > > precise PEBS encodings that distinguish between "L3 HITM" and "L3 > > HITF"; but I guess the P-cores there maybe still don't let you > > distinguish HITM/HITF?) > > > > > > I think https://perfmon-events.intel.com/tigerLake.html is also > > outdated, or at least it uses ambiguous grammar: The > > MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is > > documented as "Counts retired load instructions where a cross-core > > snoop hit in another cores caches on this socket, the data was > > forwarded back to the requesting core as the data was modified > > (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" - > > from what I understand, a "cross-core FWD" should be a case where the > > L3 does have the data, unless L3 has become non-inclusive on Ice Lake? > > > > On a Tiger Lake CPU, I can see this event trigger for the > > sys_call_table, which is located in the rodata region and probably > > shouldn't be containing Modified cache lines: > > > > # grep -A1 -w sys_call_table /proc/kallsyms > > ffffffff82800280 D sys_call_table > > ffffffff82801100 d vdso_mapping > > # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data > > ^C[ perf record: Woken up 11 times to write data ] > > [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ] > > # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]' > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288 > > ffffffff82526275 do_syscall_64 > > mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 > > ffffffff82526275 do_syscall_64 > > > > > > (For what it's worth, there is a thread on LKML where "cross-core FWD" > > got mentioned: <https://lore.kernel.org/lkml/b4aaf1ed-124d-1339-3e99-a120f6cc4d28@linux.intel.com/>) > > +others better qualified than me to respond. > > Hi Jann, > > I'm not overly familiar with the issue, but it appears a similar issue > has been reported for Broadwell Xeon here: > https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120 > I'm not sure that thread will be particularly useful, but having the > Intel people better qualified than me to answer is probably the better > service of this email. > > Thanks, > Ian ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-20 15:42 ` Arnaldo Carvalho de Melo @ 2024-02-22 20:05 ` Liang, Kan 2024-02-22 20:07 ` Jann Horn 0 siblings, 1 reply; 7+ messages in thread From: Liang, Kan @ 2024-02-22 20:05 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Ian Rogers, Jann Horn Cc: Joe Mario, Jiri Olsa, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin Hi Jann, Sorry for the late response. On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote: > Just adding Joe Mario to the CC list. > > On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote: >> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: >>> >>> Hi! >>> >>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake >>> (and newer) because Intel added some feature where *clean* cachelines >>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently >>> treats this mostly the same as snoop-forwarding of modified cache >>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel >>> rodata section in "perf c2c report". >>> >>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency >>> Facility", table "Table 20-101. Data Source Encoding for Memory >>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: >>> "XCORE FWD. This request was satisfied by a sibling core where either >>> a modified (cross-core HITM) or a non-modified (cross-core FWD) >>> cache-line copy was found." >>> >>> I don't see anything about this in arch/x86/events/intel/ds.c - if I >>> understand correctly, the kernel's PEBS data source decoding assumes >>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs >>> to be adjusted somehow - and maybe it just isn't possible to actually >>> distinguish between HitM and cross-core FWD in PEBS events on these >>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from >>> Intel can clarify? >>> >>> (The SDM describes that E-cores on the newer 12th Gen have more >>> precise PEBS encodings that distinguish between "L3 HITM" and "L3 >>> HITF"; but I guess the P-cores there maybe still don't let you >>> distinguish HITM/HITF?) Right, there is no way to distinguish HITM/HITF on Tiger Lake. I think what we can do is to add both HITM and HITF for the 0x07 to match the SDM description. How about the below patch (not tested yet)? diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c index d49d661ec0a7..8c966b5b23cb 100644 --- a/arch/x86/events/intel/ds.c +++ b/arch/x86/events/intel/ds.c @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = { OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */ OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit, snoop miss */ OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit, snoop hit */ - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit, snoop hitm */ + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /* 0x07: L3 hit, snoop hitm & fwd */ OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08: L3 miss snoop hit */ OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09: L3 miss snoop hitm*/ OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a: L3 miss, shared */ >>> >>> >>> I think https://perfmon-events.intel.com/tigerLake.html is also >>> outdated, or at least it uses ambiguous grammar: The >>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is >>> documented as "Counts retired load instructions where a cross-core >>> snoop hit in another cores caches on this socket, the data was >>> forwarded back to the requesting core as the data was modified >>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" - >>> from what I understand, a "cross-core FWD" should be a case where the >>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake? >>> For the event, the BriefDescription in the event list json file gives a more accurate description. "BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD) in another on-pkg core which forwarded the data back due to a retired load instruction.", https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286 Thanks, Kan >>> On a Tiger Lake CPU, I can see this event trigger for the >>> sys_call_table, which is located in the rodata region and probably >>> shouldn't be containing Modified cache lines: >>> >>> # grep -A1 -w sys_call_table /proc/kallsyms >>> ffffffff82800280 D sys_call_table >>> ffffffff82801100 d vdso_mapping >>> # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data >>> ^C[ perf record: Woken up 11 times to write data ] >>> [ perf record: Captured and wrote 22.851 MB perf.data (43176 samples) ] >>> # perf script -F event,ip,sym,addr | egrep --color 'ffffffff828002[89abcdef]' >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002d8 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800280 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff82800288 >>> ffffffff82526275 do_syscall_64 >>> mem_load_l3_hit_retired.xsnp_fwd:ppp: ffffffff828002b8 >>> ffffffff82526275 do_syscall_64 >>> >>> >>> (For what it's worth, there is a thread on LKML where "cross-core FWD" >>> got mentioned: <https://lore.kernel.org/lkml/b4aaf1ed-124d-1339-3e99-a120f6cc4d28@linux.intel.com/>) >> >> +others better qualified than me to respond. >> >> Hi Jann, >> >> I'm not overly familiar with the issue, but it appears a similar issue >> has been reported for Broadwell Xeon here: >> https://community.intel.com/t5/Software-Tuning-Performance/Broadwell-Xeon-perf-c2c-showing-remote-HITM-but-remote-socket-is/td-p/1172120 >> I'm not sure that thread will be particularly useful, but having the >> Intel people better qualified than me to answer is probably the better >> service of this email. >> >> Thanks, >> Ian > ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-22 20:05 ` Liang, Kan @ 2024-02-22 20:07 ` Jann Horn 2024-02-23 15:51 ` Liang, Kan 0 siblings, 1 reply; 7+ messages in thread From: Jann Horn @ 2024-02-22 20:07 UTC (permalink / raw) To: Liang, Kan Cc: Arnaldo Carvalho de Melo, Ian Rogers, Joe Mario, Jiri Olsa, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > > Hi Jann, > > Sorry for the late response. > > On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote: > > Just adding Joe Mario to the CC list. > > > > On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote: > >> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: > >>> > >>> Hi! > >>> > >>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake > >>> (and newer) because Intel added some feature where *clean* cachelines > >>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently > >>> treats this mostly the same as snoop-forwarding of modified cache > >>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel > >>> rodata section in "perf c2c report". > >>> > >>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency > >>> Facility", table "Table 20-101. Data Source Encoding for Memory > >>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: > >>> "XCORE FWD. This request was satisfied by a sibling core where either > >>> a modified (cross-core HITM) or a non-modified (cross-core FWD) > >>> cache-line copy was found." > >>> > >>> I don't see anything about this in arch/x86/events/intel/ds.c - if I > >>> understand correctly, the kernel's PEBS data source decoding assumes > >>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs > >>> to be adjusted somehow - and maybe it just isn't possible to actually > >>> distinguish between HitM and cross-core FWD in PEBS events on these > >>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from > >>> Intel can clarify? > >>> > >>> (The SDM describes that E-cores on the newer 12th Gen have more > >>> precise PEBS encodings that distinguish between "L3 HITM" and "L3 > >>> HITF"; but I guess the P-cores there maybe still don't let you > >>> distinguish HITM/HITF?) > > Right, there is no way to distinguish HITM/HITF on Tiger Lake. Aah, okay, thank you very much for the clarification! > I think what we can do is to add both HITM and HITF for the 0x07 to > match the SDM description. > > How about the below patch (not tested yet)? > diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c > index d49d661ec0a7..8c966b5b23cb 100644 > --- a/arch/x86/events/intel/ds.c > +++ b/arch/x86/events/intel/ds.c > @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = { > OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */ > OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit, > snoop miss */ > OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit, > snoop hit */ > - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit, > snoop hitm */ > + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /* > 0x07: L3 hit, snoop hitm & fwd */ > OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08: > L3 miss snoop hit */ > OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09: > L3 miss snoop hitm*/ > OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a: > L3 miss, shared */ (I'm not familiar enough with the perf semantics to know how the event encoding works, maybe someone else can have a look?) > > > >>> > >>> > >>> I think https://perfmon-events.intel.com/tigerLake.html is also > >>> outdated, or at least it uses ambiguous grammar: The > >>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is > >>> documented as "Counts retired load instructions where a cross-core > >>> snoop hit in another cores caches on this socket, the data was > >>> forwarded back to the requesting core as the data was modified > >>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" - > >>> from what I understand, a "cross-core FWD" should be a case where the > >>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake? > >>> > > For the event, the BriefDescription in the event list json file gives a > more accurate description. > "BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD) > in another on-pkg core which forwarded the data back due to a retired > load instruction.", > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286 Ah, right, that's clearer. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-22 20:07 ` Jann Horn @ 2024-02-23 15:51 ` Liang, Kan 2024-02-23 19:37 ` Jann Horn 0 siblings, 1 reply; 7+ messages in thread From: Liang, Kan @ 2024-02-23 15:51 UTC (permalink / raw) To: Jann Horn Cc: Arnaldo Carvalho de Melo, Ian Rogers, Joe Mario, Jiri Olsa, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin On 2024-02-22 3:07 p.m., Jann Horn wrote: > On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <kan.liang@linux.intel.com> wrote: >> >> Hi Jann, >> >> Sorry for the late response. >> >> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote: >>> Just adding Joe Mario to the CC list. >>> >>> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote: >>>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: >>>>> >>>>> Hi! >>>>> >>>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake >>>>> (and newer) because Intel added some feature where *clean* cachelines >>>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently >>>>> treats this mostly the same as snoop-forwarding of modified cache >>>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel >>>>> rodata section in "perf c2c report". >>>>> >>>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency >>>>> Facility", table "Table 20-101. Data Source Encoding for Memory >>>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: >>>>> "XCORE FWD. This request was satisfied by a sibling core where either >>>>> a modified (cross-core HITM) or a non-modified (cross-core FWD) >>>>> cache-line copy was found." >>>>> >>>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I >>>>> understand correctly, the kernel's PEBS data source decoding assumes >>>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs >>>>> to be adjusted somehow - and maybe it just isn't possible to actually >>>>> distinguish between HitM and cross-core FWD in PEBS events on these >>>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from >>>>> Intel can clarify? >>>>> >>>>> (The SDM describes that E-cores on the newer 12th Gen have more >>>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3 >>>>> HITF"; but I guess the P-cores there maybe still don't let you >>>>> distinguish HITM/HITF?) >> >> Right, there is no way to distinguish HITM/HITF on Tiger Lake. > > Aah, okay, thank you very much for the clarification! > >> I think what we can do is to add both HITM and HITF for the 0x07 to >> match the SDM description. >> >> How about the below patch (not tested yet)? >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c >> index d49d661ec0a7..8c966b5b23cb 100644 >> --- a/arch/x86/events/intel/ds.c >> +++ b/arch/x86/events/intel/ds.c >> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = { >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */ >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit, >> snoop miss */ >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit, >> snoop hit */ >> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit, >> snoop hitm */ >> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /* >> 0x07: L3 hit, snoop hitm & fwd */ >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08: >> L3 miss snoop hit */ >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09: >> L3 miss snoop hitm*/ >> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a: >> L3 miss, shared */ > > (I'm not familiar enough with the perf semantics to know how the event > encoding works, maybe someone else can have a look?) > I can do the test to verify the settings and perf c2c. But I don't have a benchmark. Could you please share your benchmark with me? For example, the data you used in your example. # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c 100 --data Thanks, Kan >> >> >>>>> >>>>> >>>>> I think https://perfmon-events.intel.com/tigerLake.html is also >>>>> outdated, or at least it uses ambiguous grammar: The >>>>> MEM_LOAD_L3_HIT_RETIRED.XSNP_FWD event (EventSel=D2H UMask=04H) is >>>>> documented as "Counts retired load instructions where a cross-core >>>>> snoop hit in another cores caches on this socket, the data was >>>>> forwarded back to the requesting core as the data was modified >>>>> (SNOOP_HITM) or the L3 did not have the data(SNOOP_HIT_WITH_FWD)" - >>>>> from what I understand, a "cross-core FWD" should be a case where the >>>>> L3 does have the data, unless L3 has become non-inclusive on Ice Lake? >>>>> >> >> For the event, the BriefDescription in the event list json file gives a >> more accurate description. >> "BriefDescription": "Snoop hit a modified(HITM) or clean line(HIT_W_FWD) >> in another on-pkg core which forwarded the data back due to a retired >> load instruction.", >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/pmu-events/arch/x86/tigerlake/cache.json#n286 > > Ah, right, that's clearer. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) 2024-02-23 15:51 ` Liang, Kan @ 2024-02-23 19:37 ` Jann Horn 0 siblings, 0 replies; 7+ messages in thread From: Jann Horn @ 2024-02-23 19:37 UTC (permalink / raw) To: Liang, Kan Cc: Arnaldo Carvalho de Melo, Ian Rogers, Joe Mario, Jiri Olsa, Peter Zijlstra, Ingo Molnar, Namhyung Kim, Mark Rutland, Alexander Shishkin, Adrian Hunter, Feng Tang, Andi Kleen, the arch/x86 maintainers, kernel list, linux-perf-users, Stephane Eranian, Taylor, Perry, Alt, Samantha, Biggers, Caleb, Wang, Weilin On Fri, Feb 23, 2024 at 4:52 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > On 2024-02-22 3:07 p.m., Jann Horn wrote: > > On Thu, Feb 22, 2024 at 9:05 PM Liang, Kan <kan.liang@linux.intel.com> wrote: > >> > >> Hi Jann, > >> > >> Sorry for the late response. > >> > >> On 2024-02-20 10:42 a.m., Arnaldo Carvalho de Melo wrote: > >>> Just adding Joe Mario to the CC list. > >>> > >>> On Mon, Feb 19, 2024 at 03:20:00PM -0800, Ian Rogers wrote: > >>>> On Mon, Feb 19, 2024 at 5:01 AM Jann Horn <jannh@google.com> wrote: > >>>>> > >>>>> Hi! > >>>>> > >>>>> From what I understand, "perf c2c" shows bogus HitM events on Ice Lake > >>>>> (and newer) because Intel added some feature where *clean* cachelines > >>>>> can get snoop-forwarded ("cross-core FWD"), and the PMU apparently > >>>>> treats this mostly the same as snoop-forwarding of modified cache > >>>>> lines (HitM)? On a Tiger Lake CPU, I can see addresses from the kernel > >>>>> rodata section in "perf c2c report". > >>>>> > >>>>> This is mentioned in the SDM, Volume 3B, section "20.9.7 Load Latency > >>>>> Facility", table "Table 20-101. Data Source Encoding for Memory > >>>>> Accesses (Ice Lake and Later Microarchitectures)", encoding 07H: > >>>>> "XCORE FWD. This request was satisfied by a sibling core where either > >>>>> a modified (cross-core HITM) or a non-modified (cross-core FWD) > >>>>> cache-line copy was found." > >>>>> > >>>>> I don't see anything about this in arch/x86/events/intel/ds.c - if I > >>>>> understand correctly, the kernel's PEBS data source decoding assumes > >>>>> that 0x07 means "L3 hit, snoop hitm" on these CPUs. I think this needs > >>>>> to be adjusted somehow - and maybe it just isn't possible to actually > >>>>> distinguish between HitM and cross-core FWD in PEBS events on these > >>>>> CPUs (without big-hammer chicken bit trickery)? Maybe someone from > >>>>> Intel can clarify? > >>>>> > >>>>> (The SDM describes that E-cores on the newer 12th Gen have more > >>>>> precise PEBS encodings that distinguish between "L3 HITM" and "L3 > >>>>> HITF"; but I guess the P-cores there maybe still don't let you > >>>>> distinguish HITM/HITF?) > >> > >> Right, there is no way to distinguish HITM/HITF on Tiger Lake. > > > > Aah, okay, thank you very much for the clarification! > > > >> I think what we can do is to add both HITM and HITF for the 0x07 to > >> match the SDM description. > >> > >> How about the below patch (not tested yet)? > >> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c > >> index d49d661ec0a7..8c966b5b23cb 100644 > >> --- a/arch/x86/events/intel/ds.c > >> +++ b/arch/x86/events/intel/ds.c > >> @@ -84,7 +84,7 @@ static u64 pebs_data_source[] = { > >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, NONE), /* 0x04: L3 hit */ > >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, MISS), /* 0x05: L3 hit, > >> snoop miss */ > >> OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HIT), /* 0x06: L3 hit, > >> snoop hit */ > >> - OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM), /* 0x07: L3 hit, > >> snoop hitm */ > >> + OP_LH | P(LVL, L3) | LEVEL(L3) | P(SNOOP, HITM) | P(SNOOPX, FWD), /* > >> 0x07: L3 hit, snoop hitm & fwd */ > >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HIT), /* 0x08: > >> L3 miss snoop hit */ > >> OP_LH | P(LVL, REM_CCE1) | REM | LEVEL(L3) | P(SNOOP, HITM), /* 0x09: > >> L3 miss snoop hitm*/ > >> OP_LH | P(LVL, LOC_RAM) | LEVEL(RAM) | P(SNOOP, HIT), /* 0x0a: > >> L3 miss, shared */ > > > > (I'm not familiar enough with the perf semantics to know how the event > > encoding works, maybe someone else can have a look?) > > > > I can do the test to verify the settings and perf c2c. But I don't have > a benchmark. Could you please share your benchmark with me? > For example, the data you used in your example. > # perf record -e mem_load_l3_hit_retired.xsnp_fwd:ppp --all-kernel -c > 100 --data It seems to be happening at a low rate in the background when I'm just clicking around on websites or such; but it seems like compiling the kernel with "make -j8" (where 8 is the number of hyperthreads my Tiger Lake laptop has) causes it to happen at a somewhat higher rate, a few times per second. Sorry, I don't really have a particularly good microbenchmark or such that makes this happen at an abnormally high rate... ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-02-23 19:38 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-19 13:00 [BUG] perf/x86/intel: HitM false-positives on Ice Lake / Tiger Lake (I think?) Jann Horn 2024-02-19 23:20 ` Ian Rogers 2024-02-20 15:42 ` Arnaldo Carvalho de Melo 2024-02-22 20:05 ` Liang, Kan 2024-02-22 20:07 ` Jann Horn 2024-02-23 15:51 ` Liang, Kan 2024-02-23 19:37 ` Jann Horn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).