* Some troubles with perf and measuring flops
@ 2014-03-06 0:55 Alen Stojanov
2014-03-06 1:40 ` Vince Weaver
0 siblings, 1 reply; 7+ messages in thread
From: Alen Stojanov @ 2014-03-06 0:55 UTC (permalink / raw)
To: linux-perf-users
[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]
Dear Linux Perf Users Community,
I noticed some inconsistencies with the perf tool. I would like to
determine whether I am doing something wrong, or whether there are
problem in the perf tool. Here is the problem:
I would like to obtain flops on a simple matrix-to-matrix multiplication
algorithm. The code is available in the attachment as mmmtest.c. To
obtain flops, I run the perf tool using raw counters. When I try to
obtain flops for matrices having sizes bellow 150x150, I obtain accurate
results. Example (anticipated flops: 100 * 100 * 100 * 2 = 2'000'000):
perf stat -e r538010 ./mmmtest 100
Performance counter stats for './mmmtest 100':
2,078,775 r538010
0.003889544 seconds time elapsed
However, whenever I try to run matrices of bigger size, the reported
flops are not even close to the flops that I am supposed to obtain
(anticipated results: 600 * 600 * 600 * 2 = 432'000'000):
perf stat -e r538010 ./mmmtest 600
Performance counter stats for './mmmtest 600':
2,348,148,851 r538010
0.955511968 seconds time elapsed
To give you more info to replicate the problem, I provide you with the
following:
CPU: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz, 8 cores
Linux Kernel: 3.11.0-12-generic
GCC Version: gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu8)
Monitored events: FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE - Raw event:
0x538010 (converted using libpfm4)
I have compiled the mmmtest.c using gcc -O3 -march=corei7-avx -o mmmtest
mmmtest.c. You can also find mmmtest.s asm version in the attachment.
Do you know why does this happens ? How can I instruct perf to obtain
accurate results ?
Greetings,
Alen
[-- Attachment #2: mmmtest.c --]
[-- Type: text/plain, Size: 479 bytes --]
#include <stdlib.h>
int m, n, k;
double *A, *B, *C;
void compute() {
int i,j,h;
for(i = 0; i < m; ++i) {
for(j = 0; j < n; ++j) {
for(h = 0; h < k; ++h) {
C[i*n+j] += A[i*k+h] * B[h*n+j];
}
}
}
}
int main(int argc, char **argv)
{
m = atoi(argv[1]); n = m; k = m;
A = (double *) malloc (m * k * sizeof(double));
B = (double *) malloc (k * n * sizeof(double));
C = (double *) malloc (m * n * sizeof(double));
compute ();
free(A);
free(B);
free(C);
}
[-- Attachment #3: mmmtest.s --]
[-- Type: text/plain, Size: 2423 bytes --]
.file "mmmtest.c"
.text
.p2align 4,,15
.globl compute
.type compute, @function
compute:
.LFB14:
.cfi_startproc
pushq %r15
.cfi_def_cfa_offset 16
.cfi_offset 15, -16
pushq %r14
.cfi_def_cfa_offset 24
.cfi_offset 14, -24
pushq %r13
.cfi_def_cfa_offset 32
.cfi_offset 13, -32
pushq %r12
.cfi_def_cfa_offset 40
.cfi_offset 12, -40
movl m(%rip), %r12d
pushq %rbp
.cfi_def_cfa_offset 48
.cfi_offset 6, -48
pushq %rbx
.cfi_def_cfa_offset 56
.cfi_offset 3, -56
testl %r12d, %r12d
jle .L9
movl n(%rip), %ebp
xorl %ebx, %ebx
movl k(%rip), %esi
movq B(%rip), %r15
movq A(%rip), %rdi
movq C(%rip), %r11
leal -1(%rbp), %eax
movslq %ebp, %r8
leaq 8(,%rax,8), %r13
movslq %esi, %r14
salq $3, %r8
salq $3, %r14
.L3:
testl %ebp, %ebp
jle .L5
leaq 0(%r13,%r11), %r10
movq %r15, %r9
movq %r11, %rcx
.p2align 4,,10
.p2align 3
.L8:
testl %esi, %esi
jle .L6
vmovsd (%rcx), %xmm0
movq %r9, %rdx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L7:
vmovsd (%rdi,%rax,8), %xmm1
addq $1, %rax
vmulsd (%rdx), %xmm1, %xmm1
addq %r8, %rdx
cmpl %eax, %esi
vaddsd %xmm1, %xmm0, %xmm0
vmovsd %xmm0, (%rcx)
jg .L7
.L6:
addq $8, %rcx
addq $8, %r9
cmpq %r10, %rcx
jne .L8
.L5:
addl $1, %ebx
addq %r14, %rdi
addq %r8, %r11
cmpl %r12d, %ebx
jne .L3
.L9:
popq %rbx
.cfi_def_cfa_offset 48
popq %rbp
.cfi_def_cfa_offset 40
popq %r12
.cfi_def_cfa_offset 32
popq %r13
.cfi_def_cfa_offset 24
popq %r14
.cfi_def_cfa_offset 16
popq %r15
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE14:
.size compute, .-compute
.section .text.startup,"ax",@progbits
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB15:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movl $10, %edx
movq 8(%rsi), %rdi
xorl %esi, %esi
call strtol
movl %eax, m(%rip)
movl %eax, n(%rip)
movl %eax, k(%rip)
imull %eax, %eax
movslq %eax, %rbx
salq $3, %rbx
movq %rbx, %rdi
call malloc
movq %rbx, %rdi
movq %rax, A(%rip)
call malloc
movq %rbx, %rdi
movq %rax, B(%rip)
call malloc
movq %rax, C(%rip)
xorl %eax, %eax
call compute
movq A(%rip), %rdi
call free
movq B(%rip), %rdi
call free
movq C(%rip), %rdi
call free
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE15:
.size main, .-main
.comm C,8,8
.comm B,8,8
.comm A,8,8
.comm k,4,4
.comm n,4,4
.comm m,4,4
.ident "GCC: (Ubuntu/Linaro 4.8.1-10ubuntu8) 4.8.1"
.section .note.GNU-stack,"",@progbits
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-06 0:55 Some troubles with perf and measuring flops Alen Stojanov
@ 2014-03-06 1:40 ` Vince Weaver
2014-03-06 1:53 ` Alen Stojanov
0 siblings, 1 reply; 7+ messages in thread
From: Vince Weaver @ 2014-03-06 1:40 UTC (permalink / raw)
To: Alen Stojanov; +Cc: linux-perf-users
On Thu, 6 Mar 2014, Alen Stojanov wrote:
> However, whenever I try to run matrices of bigger size, the reported flops are
> not even close to the flops that I am supposed to obtain (anticipated results:
> 600 * 600 * 600 * 2 = 432'000'000):
>
> perf stat -e r538010 ./mmmtest 600
>
> Performance counter stats for './mmmtest 600':
>
> 2,348,148,851 r538010
>
> 0.955511968 seconds time elapsed
>
...
> CPU: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz, 8 cores
> Linux Kernel: 3.11.0-12-generic
> GCC Version: gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu8)
> Monitored events: FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE - Raw event: 0x538010
> (converted using libpfm4)
...
> Do you know why does this happens ? How can I instruct perf to obtain accurate
> results ?
one thing you might want to do is put :u on your event name so you are
only measuring user space accesses not kernel too.
floating point events are notoriously unreliable on modern intel
processors.
The event might also be counting speculative events or uops and it gets
more complicated with AVX in the mix. What does the intel documentation
say for the event for your architecture?
Vince
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-06 1:40 ` Vince Weaver
@ 2014-03-06 1:53 ` Alen Stojanov
2014-03-06 18:25 ` Vince Weaver
0 siblings, 1 reply; 7+ messages in thread
From: Alen Stojanov @ 2014-03-06 1:53 UTC (permalink / raw)
To: Vince Weaver; +Cc: linux-perf-users
On 06/03/14 02:40, Vince Weaver wrote:
> On Thu, 6 Mar 2014, Alen Stojanov wrote:
>
>> However, whenever I try to run matrices of bigger size, the reported flops are
>> not even close to the flops that I am supposed to obtain (anticipated results:
>> 600 * 600 * 600 * 2 = 432'000'000):
>>
>> perf stat -e r538010 ./mmmtest 600
>>
>> Performance counter stats for './mmmtest 600':
>>
>> 2,348,148,851 r538010
>>
>> 0.955511968 seconds time elapsed
>>
> ...
>> CPU: Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz, 8 cores
>> Linux Kernel: 3.11.0-12-generic
>> GCC Version: gcc version 4.8.1 (Ubuntu/Linaro 4.8.1-10ubuntu8)
>> Monitored events: FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE - Raw event: 0x538010
>> (converted using libpfm4)
> ...
>> Do you know why does this happens ? How can I instruct perf to obtain accurate
>> results ?
> one thing you might want to do is put :u on your event name so you are
> only measuring user space accesses not kernel too.
Well, even if the perf is measuring kernel events, I really doubt that
the kernel is doing any double precision floating point operations.
Nevertheless, I tried the :u option, and this does not change anything:
perf stat -e r538010:u ./mmmtest 100
Performance counter stats for './mmmtest 100':
2,079,002 r538010:u
0.003887873 seconds time elapsed
perf stat -e r538010:u ./mmmtest 600
Performance counter stats for './mmmtest 600':
2,349,426,507 r538010:u
0.956538237 seconds time elapsed
>
> floating point events are notoriously unreliable on modern intel
> processors.
>
> The event might also be counting speculative events or uops and it gets
> more complicated with AVX in the mix. What does the intel documentation
> say for the event for your architecture?
I agree on this. However, if you would look at the .s file, you can see
that it does not have any AVX instructions inside. And if I would
monitor any other event on the CPU that counts any flop operations, I
get 0s. It seems that the FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE is the only
one that occurs. I don't think that FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE
counts speculative events.
>
> Vince
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-06 1:53 ` Alen Stojanov
@ 2014-03-06 18:25 ` Vince Weaver
2014-03-06 19:41 ` Alen Stojanov
0 siblings, 1 reply; 7+ messages in thread
From: Vince Weaver @ 2014-03-06 18:25 UTC (permalink / raw)
To: Alen Stojanov; +Cc: linux-perf-users
On Thu, 6 Mar 2014, Alen Stojanov wrote:
> > more complicated with AVX in the mix. What does the intel documentation
> > say for the event for your architecture?
> I agree on this. However, if you would look at the .s file, you can see that
> it does not have any AVX instructions inside.
I'm pretty sure vmovsd and vmuld are AVX instructions.
> And if I would monitor any other
> event on the CPU that counts any flop operations, I get 0s. It seems that the
> FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE is the only one that occurs. I don't think
> that FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE counts speculative events.
are you sure?
See http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
about FP events on SNB and IVB at least.
Vince
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-06 18:25 ` Vince Weaver
@ 2014-03-06 19:41 ` Alen Stojanov
2014-03-11 23:53 ` Alen Stojanov
0 siblings, 1 reply; 7+ messages in thread
From: Alen Stojanov @ 2014-03-06 19:41 UTC (permalink / raw)
To: Vince Weaver; +Cc: linux-perf-users
On 06/03/14 19:25, Vince Weaver wrote:
> On Thu, 6 Mar 2014, Alen Stojanov wrote:
>
>>> more complicated with AVX in the mix. What does the intel documentation
>>> say for the event for your architecture?
>> I agree on this. However, if you would look at the .s file, you can see that
>> it does not have any AVX instructions inside.
> I'm pretty sure vmovsd and vmuld are AVX instructions.
Yes you are absolutely right. I made a wrong statement. What I really
meant was that there are no AVX instructions on packed doubles, since
vmovsd and vmulsd operate with scalar doubles. This is also why I get
zeros whenever I do:
perf stat -e r530211 ./mmmtest 600
Performance counter stats for './mmmtest 600':
0 r530211
0.952037328 seconds time elapsed
What I really wanted to depict was the fact that I don't have to mix
several counters to obtain results, as there would always be only
FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE as an event in the code.
>> And if I would monitor any other
>> event on the CPU that counts any flop operations, I get 0s. It seems that the
>> FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE is the only one that occurs. I don't think
>> that FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE counts speculative events.
> are you sure?
>
> See http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
> about FP events on SNB and IVB at least.
Thank you for the link. I only made the assumption that we do not have
speculative events, since in a previous project that was done as part of
my research group, we were able to get accurate flops, using Intel PCM:
https://github.com/GeorgOfenbeck/perfplot/ (and we were able to get
correct flops of a of a mmm having size 1600x1600x1600).
Nevertheless, as much as I understood, the PAPI is discussing count
deviations whenever several counters are combined. In my use case that I
send you before, I would always use one single raw counter to obtain
counts. But the deviations that I obtain, they grow as the matrix size
grows. I made a list to depict how much the flops would deviate
List format:
(mmm size) (anticipated_flops) (obtained_flops) (anticipated_flops /
obtained_flops * 100.0)
10 2000 2061 97.040
20 16000 16692 95.854
30 54000 58097 92.948
40 128000 132457 96.635
50 250000 257482 97.094
60 432000 452624 95.443
70 686000 730299 93.934
80 1024000 1098453 93.222
90 1458000 1573331 92.670
100 2000000 2138014 93.545
110 2662000 2852239 93.330
120 3456000 3626028 95.311
130 4394000 4783638 91.855
140 5488000 5979236 91.784
150 6750000 7349358 91.845
160 8192000 11324521 72.339
170 9826000 11000354 89.324
180 11664000 13191288 88.422
190 13718000 16492253 83.178
200 16000000 20253599 78.998
210 18522000 23839202 77.696
220 21296000 27832906 76.514
230 24334000 32056213 75.910
240 27648000 40026709 69.074
250 31250000 41837527 74.694
260 35152000 47291908 74.330
270 39366000 53534225 73.534
280 43904000 60193718 72.938
290 48778000 67230702 72.553
300 54000000 74451165 72.531
310 59582000 82773965 71.982
320 65536000 129974914 50.422
330 71874000 99894238 71.950
340 78608000 108421806 72.502
350 85750000 118870753 72.137
360 93312000 129058036 72.302
370 101306000 141901053 71.392
380 109744000 152138340 72.134
390 118638000 170393279 69.626
400 128000000 225637046 56.728
410 137842000 208174503 66.215
420 148176000 205434911 72.128
430 159014000 231594232 68.661
440 170368000 235422186 72.367
450 182250000 280728129 64.920
460 194672000 282586911 68.889
470 207646000 310944304 66.779
480 221184000 409532779 54.009
490 235298000 381057200 61.749
500 250000000 413099959 60.518
510 265302000 393498007 67.421
520 281216000 675607105 41.624
530 297754000 988906780 30.109
540 314928000 1228529787 25.635
550 332750000 1396858866 23.821
560 351232000 2144144283 16.381
570 370386000 2712975462 13.652
580 390224000 3308411489 11.795
590 410758000 2326514544 17.656
And I cant see a pattern to derive any conclusion that makes sense.
>
> Vince
Alen
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-06 19:41 ` Alen Stojanov
@ 2014-03-11 23:53 ` Alen Stojanov
2014-03-13 20:17 ` Vince Weaver
0 siblings, 1 reply; 7+ messages in thread
From: Alen Stojanov @ 2014-03-11 23:53 UTC (permalink / raw)
To: Vince Weaver; +Cc: linux-perf-users
So just to summarize (since I did not get any reply) - the final
conclusion is that I can not simply obtain proper flop counts with linux
perf, because of hardware limitations ?
On 06/03/14 20:41, Alen Stojanov wrote:
> On 06/03/14 19:25, Vince Weaver wrote:
>> On Thu, 6 Mar 2014, Alen Stojanov wrote:
>>
>>>> more complicated with AVX in the mix. What does the intel
>>>> documentation
>>>> say for the event for your architecture?
>>> I agree on this. However, if you would look at the .s file, you can
>>> see that
>>> it does not have any AVX instructions inside.
>> I'm pretty sure vmovsd and vmuld are AVX instructions.
>
> Yes you are absolutely right. I made a wrong statement. What I really
> meant was that there are no AVX instructions on packed doubles, since
> vmovsd and vmulsd operate with scalar doubles. This is also why I get
> zeros whenever I do:
>
> perf stat -e r530211 ./mmmtest 600
>
> Performance counter stats for './mmmtest 600':
>
> 0 r530211
>
> 0.952037328 seconds time elapsed
>
> What I really wanted to depict was the fact that I don't have to mix
> several counters to obtain results, as there would always be only
> FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE as an event in the code.
>
>>> And if I would monitor any other
>>> event on the CPU that counts any flop operations, I get 0s. It seems
>>> that the
>>> FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE is the only one that occurs. I
>>> don't think
>>> that FP_COMP_OPS_EXE:SSE_SCALAR_DOUBLE counts speculative events.
>> are you sure?
>>
>> See http://icl.cs.utk.edu/projects/papi/wiki/PAPITopics:SandyFlops
>> about FP events on SNB and IVB at least.
>
> Thank you for the link. I only made the assumption that we do not have
> speculative events, since in a previous project that was done as part
> of my research group, we were able to get accurate flops, using Intel
> PCM: https://github.com/GeorgOfenbeck/perfplot/ (and we were able to
> get correct flops of a of a mmm having size 1600x1600x1600).
>
> Nevertheless, as much as I understood, the PAPI is discussing count
> deviations whenever several counters are combined. In my use case that
> I send you before, I would always use one single raw counter to obtain
> counts. But the deviations that I obtain, they grow as the matrix size
> grows. I made a list to depict how much the flops would deviate
>
> List format:
> (mmm size) (anticipated_flops) (obtained_flops) (anticipated_flops /
> obtained_flops * 100.0)
> 10 2000 2061 97.040
> 20 16000 16692 95.854
> 30 54000 58097 92.948
> 40 128000 132457 96.635
> 50 250000 257482 97.094
> 60 432000 452624 95.443
> 70 686000 730299 93.934
> 80 1024000 1098453 93.222
> 90 1458000 1573331 92.670
> 100 2000000 2138014 93.545
> 110 2662000 2852239 93.330
> 120 3456000 3626028 95.311
> 130 4394000 4783638 91.855
> 140 5488000 5979236 91.784
> 150 6750000 7349358 91.845
> 160 8192000 11324521 72.339
> 170 9826000 11000354 89.324
> 180 11664000 13191288 88.422
> 190 13718000 16492253 83.178
> 200 16000000 20253599 78.998
> 210 18522000 23839202 77.696
> 220 21296000 27832906 76.514
> 230 24334000 32056213 75.910
> 240 27648000 40026709 69.074
> 250 31250000 41837527 74.694
> 260 35152000 47291908 74.330
> 270 39366000 53534225 73.534
> 280 43904000 60193718 72.938
> 290 48778000 67230702 72.553
> 300 54000000 74451165 72.531
> 310 59582000 82773965 71.982
> 320 65536000 129974914 50.422
> 330 71874000 99894238 71.950
> 340 78608000 108421806 72.502
> 350 85750000 118870753 72.137
> 360 93312000 129058036 72.302
> 370 101306000 141901053 71.392
> 380 109744000 152138340 72.134
> 390 118638000 170393279 69.626
> 400 128000000 225637046 56.728
> 410 137842000 208174503 66.215
> 420 148176000 205434911 72.128
> 430 159014000 231594232 68.661
> 440 170368000 235422186 72.367
> 450 182250000 280728129 64.920
> 460 194672000 282586911 68.889
> 470 207646000 310944304 66.779
> 480 221184000 409532779 54.009
> 490 235298000 381057200 61.749
> 500 250000000 413099959 60.518
> 510 265302000 393498007 67.421
> 520 281216000 675607105 41.624
> 530 297754000 988906780 30.109
> 540 314928000 1228529787 25.635
> 550 332750000 1396858866 23.821
> 560 351232000 2144144283 16.381
> 570 370386000 2712975462 13.652
> 580 390224000 3308411489 11.795
> 590 410758000 2326514544 17.656
>
> And I cant see a pattern to derive any conclusion that makes sense.
>
>>
>> Vince
> Alen
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-perf-users" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Some troubles with perf and measuring flops
2014-03-11 23:53 ` Alen Stojanov
@ 2014-03-13 20:17 ` Vince Weaver
0 siblings, 0 replies; 7+ messages in thread
From: Vince Weaver @ 2014-03-13 20:17 UTC (permalink / raw)
To: Alen Stojanov; +Cc: linux-perf-users
On Wed, 12 Mar 2014, Alen Stojanov wrote:
> So just to summarize (since I did not get any reply) - the final conclusion is
> that I can not simply obtain proper flop counts with linux perf, because of
> hardware limitations ?
Performance counters are tricky things. You shouldn't take my word for
it, you should either run tests or contact people inside Intel.
But yes, "flop count" has always been a tricky quantity to measure (what
constitutes a flop? Is a fused multiply-add one flop or two? etc.)
And in recent Intel processors the floating point events are notriously
hard to use, and for a while Intel even stopped documenting that they
existed until the HPC community complained enough that Intel has brought
them back but with a lot of warnings about accuracy.
This doesn't mean you can't get useful results out of the events, it just
means that it's probably never going to be possible to get "exact flop
counts", whatever that means.
Vince
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-03-13 20:14 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-06 0:55 Some troubles with perf and measuring flops Alen Stojanov
2014-03-06 1:40 ` Vince Weaver
2014-03-06 1:53 ` Alen Stojanov
2014-03-06 18:25 ` Vince Weaver
2014-03-06 19:41 ` Alen Stojanov
2014-03-11 23:53 ` Alen Stojanov
2014-03-13 20:17 ` Vince Weaver
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).