linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Remote Cache Accesses (1 Hop)
@ 2014-01-17 13:24 Manuel Selva
  2014-01-23 16:02 ` Manuel Selva
  0 siblings, 1 reply; 4+ messages in thread
From: Manuel Selva @ 2014-01-17 13:24 UTC (permalink / raw)
  To: linux-perf-users

Hi all,

I wrote a benchmarking program, in order to play with perf_event_open 
memory sampling capabilities (as discussed earlier on this list).

My benchmark is allocating a large array of memory (120 megas by 
default), then I am starting sampling with perf_event_open and the 
MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the 
minimal one according to Intel's doc) along with memory accesses 
counting (uncore event QMC_NORMAL_READ.ANY) and I access all the 
allocated memory either sequentially or randomly. The benchmark is mono 
thread, pinned to a given core, and memory is allocate don the core 
associated with this node using numa_alloc functions. perf_event_open is 
thus called to monitor only this core/numa node. The code is compiled 
without any optimization and I tried as possible to increase the ratio 
of memory accesses codes vs branching, and other stuff code.

I can clearly see with the QMC_NORMAL_READ.ANY events count that my 
random accesses test case generate far more memory accesses than the 
sequential one (I guess the prefetcher and sequential access are 
responsible for that).

Regarding the sampled event, I successfully mmap the result of perf 
event open and I am able to read the samples. My problem is that even 
for the random accesses test case on the 120 megas, I don't have any 
samples served by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) 
(the sampling period is 1000 events and I only get ~700 samples). 
Nevertheless, in the sequential case I have 0,01% of my samples that are 
remote cache accesses (1 Hop) where as this percentage is ~20% in the 
random case.

So my questions are:

- What is exactly remote cache (1 hop) ? A data found in another core's 
private cache (L1 or L2 in my case) on the same processor ?

- How can I interpret my results, I was expecting to have local memory 
accesses samples increasing in the random case instead of remote cache 
(1 hop). How can I have remote cache accesses with malloced data not 
shared, and used by a thread pinned to a given core.

- I didn't yet look at that on details, but 700~ samples for accessing 
120 megas with sampling every 1000 events seems small. I am going to 
check the assembly code generated and also count the total number of 
memory requests (a core event) and compare that to the 700 samples * 
1000 events.

Thanks in advance for any suggestions you may have in order to help 
understand what's happening there.

-- 
Manu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Remote Cache Accesses (1 Hop)
  2014-01-17 13:24 Remote Cache Accesses (1 Hop) Manuel Selva
@ 2014-01-23 16:02 ` Manuel Selva
  2014-01-29 13:47   ` Manuel Selva
  0 siblings, 1 reply; 4+ messages in thread
From: Manuel Selva @ 2014-01-23 16:02 UTC (permalink / raw)
  To: linux-perf-users

Hi all,

Today I followed up my investigations on this subject.

Concerning the first question about, remote cache samples, I am still
not able to understand  why i get this kind of events instead of Local
ram accesses. I modified my memory allocation function to use
numa_alloc function to be sure the memory is physically allocated
where I want. When the memory is allocated on the same node as the
thread running my program, I still have remote cache accesses and no
ram accesses. When it's on another node, I have remote ram accesses as
expected.

Does anybody have an explanation of the exact meaning of these remote
cache events that could explain why I get them ?

Regarding the the number of samples, I first checked the number of
memory loads generated by the sampled function (I mean here
statically, looking at the code generated by gcc). I then compared
this number with the core event: MEM_INST_RETIRED.LOADS, they are very
close (measured = 15001804, expected = 15001804). The difference must
come from the loads generated by the start/stop ioctl calls and maybe
other thing happening behind the scene I must miss, anyway this is
coherent. With a sampling period of 1000 events, I only have 455
samples. For period = 5000 I have 91 samples, and for period = 10000
45 samples.

So my question is what is the reason for this low number of samples ?

1- Some memory accesses have a threshold smaller than 3 cycles and are
thus not counted at all ?

2- Is there any relation to the time taken by  the PMI handler being
too large. My benchmarking code is only doing loads and no computation
at all ?

3- any other idea ?

Manu

PS: Because this list is called perf-users, I should maybe ask this
kind of question on another list. If it's the case, please let me know
where ? Thanks.

2014/1/17 Manuel Selva <selva.manuel@gmail.com>:
> Hi all,
>
> I wrote a benchmarking program, in order to play with perf_event_open memory
> sampling capabilities (as discussed earlier on this list).
>
> My benchmark is allocating a large array of memory (120 megas by default),
> then I am starting sampling with perf_event_open and the
> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the
> minimal one according to Intel's doc) along with memory accesses counting
> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory
> either sequentially or randomly. The benchmark is mono thread, pinned to a
> given core, and memory is allocate don the core associated with this node
> using numa_alloc functions. perf_event_open is thus called to monitor only
> this core/numa node. The code is compiled without any optimization and I
> tried as possible to increase the ratio of memory accesses codes vs
> branching, and other stuff code.
>
> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random
> accesses test case generate far more memory accesses than the sequential one
> (I guess the prefetcher and sequential access are responsible for that).
>
> Regarding the sampled event, I successfully mmap the result of perf event
> open and I am able to read the samples. My problem is that even for the
> random accesses test case on the 120 megas, I don't have any samples served
> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period
> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential
> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where
> as this percentage is ~20% in the random case.
>
> So my questions are:
>
> - What is exactly remote cache (1 hop) ? A data found in another core's
> private cache (L1 or L2 in my case) on the same processor ?
>
> - How can I interpret my results, I was expecting to have local memory
> accesses samples increasing in the random case instead of remote cache (1
> hop). How can I have remote cache accesses with malloced data not shared,
> and used by a thread pinned to a given core.
>
> - I didn't yet look at that on details, but 700~ samples for accessing 120
> megas with sampling every 1000 events seems small. I am going to check the
> assembly code generated and also count the total number of memory requests
> (a core event) and compare that to the 700 samples * 1000 events.
>
> Thanks in advance for any suggestions you may have in order to help
> understand what's happening there.
>
> --
> Manu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Remote Cache Accesses (1 Hop)
  2014-01-23 16:02 ` Manuel Selva
@ 2014-01-29 13:47   ` Manuel Selva
  2014-02-03 18:29     ` Manuel Selva
  0 siblings, 1 reply; 4+ messages in thread
From: Manuel Selva @ 2014-01-29 13:47 UTC (permalink / raw)
  To: linux-perf-users

Hi all,

Anyone can help on this subject about low number of sample and remote
cache samples meaning ?

Thanks,

Manu

2014-01-23 Manuel Selva <selva.manuel@gmail.com>:
> Hi all,
>
> Today I followed up my investigations on this subject.
>
> Concerning the first question about, remote cache samples, I am still
> not able to understand  why i get this kind of events instead of Local
> ram accesses. I modified my memory allocation function to use
> numa_alloc function to be sure the memory is physically allocated
> where I want. When the memory is allocated on the same node as the
> thread running my program, I still have remote cache accesses and no
> ram accesses. When it's on another node, I have remote ram accesses as
> expected.
>
> Does anybody have an explanation of the exact meaning of these remote
> cache events that could explain why I get them ?
>
> Regarding the the number of samples, I first checked the number of
> memory loads generated by the sampled function (I mean here
> statically, looking at the code generated by gcc). I then compared
> this number with the core event: MEM_INST_RETIRED.LOADS, they are very
> close (measured = 15001804, expected = 15001804). The difference must
> come from the loads generated by the start/stop ioctl calls and maybe
> other thing happening behind the scene I must miss, anyway this is
> coherent. With a sampling period of 1000 events, I only have 455
> samples. For period = 5000 I have 91 samples, and for period = 10000
> 45 samples.
>
> So my question is what is the reason for this low number of samples ?
>
> 1- Some memory accesses have a threshold smaller than 3 cycles and are
> thus not counted at all ?
>
> 2- Is there any relation to the time taken by  the PMI handler being
> too large. My benchmarking code is only doing loads and no computation
> at all ?
>
> 3- any other idea ?
>
> Manu
>
> PS: Because this list is called perf-users, I should maybe ask this
> kind of question on another list. If it's the case, please let me know
> where ? Thanks.
>
> 2014/1/17 Manuel Selva <selva.manuel@gmail.com>:
>> Hi all,
>>
>> I wrote a benchmarking program, in order to play with perf_event_open memory
>> sampling capabilities (as discussed earlier on this list).
>>
>> My benchmark is allocating a large array of memory (120 megas by default),
>> then I am starting sampling with perf_event_open and the
>> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the
>> minimal one according to Intel's doc) along with memory accesses counting
>> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory
>> either sequentially or randomly. The benchmark is mono thread, pinned to a
>> given core, and memory is allocate don the core associated with this node
>> using numa_alloc functions. perf_event_open is thus called to monitor only
>> this core/numa node. The code is compiled without any optimization and I
>> tried as possible to increase the ratio of memory accesses codes vs
>> branching, and other stuff code.
>>
>> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random
>> accesses test case generate far more memory accesses than the sequential one
>> (I guess the prefetcher and sequential access are responsible for that).
>>
>> Regarding the sampled event, I successfully mmap the result of perf event
>> open and I am able to read the samples. My problem is that even for the
>> random accesses test case on the 120 megas, I don't have any samples served
>> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period
>> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential
>> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where
>> as this percentage is ~20% in the random case.
>>
>> So my questions are:
>>
>> - What is exactly remote cache (1 hop) ? A data found in another core's
>> private cache (L1 or L2 in my case) on the same processor ?
>>
>> - How can I interpret my results, I was expecting to have local memory
>> accesses samples increasing in the random case instead of remote cache (1
>> hop). How can I have remote cache accesses with malloced data not shared,
>> and used by a thread pinned to a given core.
>>
>> - I didn't yet look at that on details, but 700~ samples for accessing 120
>> megas with sampling every 1000 events seems small. I am going to check the
>> assembly code generated and also count the total number of memory requests
>> (a core event) and compare that to the 700 samples * 1000 events.
>>
>> Thanks in advance for any suggestions you may have in order to help
>> understand what's happening there.
>>
>> --
>> Manu

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Remote Cache Accesses (1 Hop)
  2014-01-29 13:47   ` Manuel Selva
@ 2014-02-03 18:29     ` Manuel Selva
  0 siblings, 0 replies; 4+ messages in thread
From: Manuel Selva @ 2014-02-03 18:29 UTC (permalink / raw)
  To: linux-perf-users; +Cc: Manuel Selva

Hi,

I just read information about Intel MESIF cache coherency protocol, and 
I learned that the protocol tries as possible to satisfy memory requests 
from remote caches before using memory for performances purpose.

As a consequence I am wondering if the hardware may be able to identify 
that my second (remote) L3 cache is not used and thus use it for 
performances purpose even if it's on another Node than the memory I am 
using. This could explain why I am getting remote caches samples while 
sampling a mono thread application using only local memory (allocated 
with numa_alloc_on_node()).

Is this hypothesis possible ?

----
Manu

On 01/29/2014 02:47 PM, Manuel Selva wrote:
> Hi all,
>
> Anyone can help on this subject about low number of sample and remote
> cache samples meaning ?
>
> Thanks,
>
> Manu
>
> 2014-01-23 Manuel Selva <selva.manuel@gmail.com>:
>> Hi all,
>>
>> Today I followed up my investigations on this subject.
>>
>> Concerning the first question about, remote cache samples, I am still
>> not able to understand  why i get this kind of events instead of Local
>> ram accesses. I modified my memory allocation function to use
>> numa_alloc function to be sure the memory is physically allocated
>> where I want. When the memory is allocated on the same node as the
>> thread running my program, I still have remote cache accesses and no
>> ram accesses. When it's on another node, I have remote ram accesses as
>> expected.
>>
>> Does anybody have an explanation of the exact meaning of these remote
>> cache events that could explain why I get them ?
>>
>> Regarding the the number of samples, I first checked the number of
>> memory loads generated by the sampled function (I mean here
>> statically, looking at the code generated by gcc). I then compared
>> this number with the core event: MEM_INST_RETIRED.LOADS, they are very
>> close (measured = 15001804, expected = 15001804). The difference must
>> come from the loads generated by the start/stop ioctl calls and maybe
>> other thing happening behind the scene I must miss, anyway this is
>> coherent. With a sampling period of 1000 events, I only have 455
>> samples. For period = 5000 I have 91 samples, and for period = 10000
>> 45 samples.
>>
>> So my question is what is the reason for this low number of samples ?
>>
>> 1- Some memory accesses have a threshold smaller than 3 cycles and are
>> thus not counted at all ?
>>
>> 2- Is there any relation to the time taken by  the PMI handler being
>> too large. My benchmarking code is only doing loads and no computation
>> at all ?
>>
>> 3- any other idea ?
>>
>> Manu
>>
>> PS: Because this list is called perf-users, I should maybe ask this
>> kind of question on another list. If it's the case, please let me know
>> where ? Thanks.
>>
>> 2014/1/17 Manuel Selva <selva.manuel@gmail.com>:
>>> Hi all,
>>>
>>> I wrote a benchmarking program, in order to play with perf_event_open memory
>>> sampling capabilities (as discussed earlier on this list).
>>>
>>> My benchmark is allocating a large array of memory (120 megas by default),
>>> then I am starting sampling with perf_event_open and the
>>> MEM_INST_RETIRED_LATENCY_ABOVE_THRESHOLD event (with threshold = 3, the
>>> minimal one according to Intel's doc) along with memory accesses counting
>>> (uncore event QMC_NORMAL_READ.ANY) and I access all the allocated memory
>>> either sequentially or randomly. The benchmark is mono thread, pinned to a
>>> given core, and memory is allocate don the core associated with this node
>>> using numa_alloc functions. perf_event_open is thus called to monitor only
>>> this core/numa node. The code is compiled without any optimization and I
>>> tried as possible to increase the ratio of memory accesses codes vs
>>> branching, and other stuff code.
>>>
>>> I can clearly see with the QMC_NORMAL_READ.ANY events count that my random
>>> accesses test case generate far more memory accesses than the sequential one
>>> (I guess the prefetcher and sequential access are responsible for that).
>>>
>>> Regarding the sampled event, I successfully mmap the result of perf event
>>> open and I am able to read the samples. My problem is that even for the
>>> random accesses test case on the 120 megas, I don't have any samples served
>>> by the RAM (I am using the PERF_SAMPLE_DATA_SRC field) (the sampling period
>>> is 1000 events and I only get ~700 samples). Nevertheless, in the sequential
>>> case I have 0,01% of my samples that are remote cache accesses (1 Hop) where
>>> as this percentage is ~20% in the random case.
>>>
>>> So my questions are:
>>>
>>> - What is exactly remote cache (1 hop) ? A data found in another core's
>>> private cache (L1 or L2 in my case) on the same processor ?
>>>
>>> - How can I interpret my results, I was expecting to have local memory
>>> accesses samples increasing in the random case instead of remote cache (1
>>> hop). How can I have remote cache accesses with malloced data not shared,
>>> and used by a thread pinned to a given core.
>>>
>>> - I didn't yet look at that on details, but 700~ samples for accessing 120
>>> megas with sampling every 1000 events seems small. I am going to check the
>>> assembly code generated and also count the total number of memory requests
>>> (a core event) and compare that to the 700 samples * 1000 events.
>>>
>>> Thanks in advance for any suggestions you may have in order to help
>>> understand what's happening there.
>>>
>>> --
>>> Manu

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-02-03 18:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-17 13:24 Remote Cache Accesses (1 Hop) Manuel Selva
2014-01-23 16:02 ` Manuel Selva
2014-01-29 13:47   ` Manuel Selva
2014-02-03 18:29     ` Manuel Selva

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).