Transparent Hugepage impact on memcpy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Transparent Hugepage impact on memcpy
@ 2013-06-04  8:57 Jianguo Wu
  2013-06-04 12:30 ` Wanpeng Li
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Jianguo Wu @ 2013-06-04  8:57 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrea Arcangeli, qiuxishi

Hi all,

I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
memcpy has worse performance.

When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).

I think THP will improve performance, but the test result obviously not the case. 
Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.

I am not quite understand this, could you please give me some comments, Thanks!

I test in Linux-3.4-stable, and my machine info is:
Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 24567 MB
node 0 free: 23550 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 24576 MB
node 1 free: 23767 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10

Below is test result:
---with THP---
#cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
#./perf bench mem memcpy -l 1gb -o
# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

       3.672879 GB/Sec (with prefault)

#./perf stat ...
Performance counter stats for './perf bench mem memcpy -l 1gb -o':

          35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
          66267785 cache-references                                             [49.78%]
              2409 page-faults                                                 
         450768651 dTLB-loads
                                                  [50.78%]
             24580 dTLB-misses
              #    0.01% of all dTLB cache hits  [51.01%]
        1338974202 dTLB-stores
                                                 [50.63%]
             77943 dTLB-misses
                                                 [50.24%]
         697404997 iTLB-loads
                                                  [49.77%]
               274 iTLB-misses
              #    0.00% of all iTLB cache hits  [49.30%]

       0.855041819 seconds time elapsed

---no THP---
#cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

#./perf bench mem memcpy -l 1gb -o
# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

       6.190187 GB/Sec (with prefault)

#./perf stat ...
Performance counter stats for './perf bench mem memcpy -l 1gb -o':

          16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
          17200000 cache-references                                             [50.04%]
            524652 page-faults                                                 
         734365659 dTLB-loads
                                                  [50.04%]
           4986387 dTLB-misses
              #    0.68% of all dTLB cache hits  [50.04%]
        1013408298 dTLB-stores
                                                 [50.04%]
           8180817 dTLB-misses
                                                 [49.97%]
        1526642351 iTLB-loads
                                                  [50.41%]
                56 iTLB-misses
              #    0.00% of all iTLB cache hits  [50.21%]

       1.025425847 seconds time elapsed

Thanks,
Jianguo Wu.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04  8:57 Transparent Hugepage impact on memcpy Jianguo Wu
  2013-06-04 12:30 ` Wanpeng Li
@ 2013-06-04 12:30 ` Wanpeng Li
  2013-06-04 20:20   ` Andrea Arcangeli
       [not found] ` <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com>
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Wanpeng Li @ 2013-06-04 12:30 UTC (permalink / raw)
  To: Jianguo Wu; +Cc: linux-mm, Andrea Arcangeli, qiuxishi

[-- Attachment #1: Type: text/plain, Size: 3777 bytes --]

On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote:
>Hi all,
>
>I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>memcpy has worse performance.
>
>When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>

I get similar result as you against 3.10-rc4 in the attachment. This
dues to the characteristic of thp takes a single page fault for each 
2MB virtual region touched by userland.

>I think THP will improve performance, but the test result obviously not the case. 
>Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
>http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>
>I am not quite understand this, could you please give me some comments, Thanks!
>
>I test in Linux-3.4-stable, and my machine info is:
>Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>
>available: 2 nodes (0-1)
>node 0 cpus: 0 1 2 3 8 9 10 11
>node 0 size: 24567 MB
>node 0 free: 23550 MB
>node 1 cpus: 4 5 6 7 12 13 14 15
>node 1 size: 24576 MB
>node 1 free: 23767 MB
>node distances:
>node   0   1 
>  0:  10  20 
>  1:  20  10
>
>Below is test result:
>---with THP---
>#cat /sys/kernel/mm/transparent_hugepage/enabled
>[always] madvise never
>#./perf bench mem memcpy -l 1gb -o
># Running mem/memcpy benchmark...
># Copying 1gb Bytes ...
>
>       3.672879 GB/Sec (with prefault)
>
>#./perf stat ...
>Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>          35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>          66267785 cache-references                                             [49.78%]
>              2409 page-faults                                                 
>         450768651 dTLB-loads
>                                                  [50.78%]
>             24580 dTLB-misses
>              #    0.01% of all dTLB cache hits  [51.01%]
>        1338974202 dTLB-stores
>                                                 [50.63%]
>             77943 dTLB-misses
>                                                 [50.24%]
>         697404997 iTLB-loads
>                                                  [49.77%]
>               274 iTLB-misses
>              #    0.00% of all iTLB cache hits  [49.30%]
>
>       0.855041819 seconds time elapsed
>
>---no THP---
>#cat /sys/kernel/mm/transparent_hugepage/enabled
>always madvise [never]
>
>#./perf bench mem memcpy -l 1gb -o
># Running mem/memcpy benchmark...
># Copying 1gb Bytes ...
>
>       6.190187 GB/Sec (with prefault)
>
>#./perf stat ...
>Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>          16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>          17200000 cache-references                                             [50.04%]
>            524652 page-faults                                                 
>         734365659 dTLB-loads
>                                                  [50.04%]
>           4986387 dTLB-misses
>              #    0.68% of all dTLB cache hits  [50.04%]
>        1013408298 dTLB-stores
>                                                 [50.04%]
>           8180817 dTLB-misses
>                                                 [49.97%]
>        1526642351 iTLB-loads
>                                                  [50.41%]
>                56 iTLB-misses
>              #    0.00% of all iTLB cache hits  [50.21%]
>
>       1.025425847 seconds time elapsed
>
>Thanks,
>Jianguo Wu.
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: thp --]
[-- Type: text/plain, Size: 2004 bytes --]

---with THP---
#cat  /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

      12.208522 GB/Sec (with prefault)

 Performance counter stats for './perf bench mem memcpy -l 1gb -o':

        26,453,696 cache-misses              #   35.411 % of all cache refs     [57.66%]
        74,704,531 cache-references                                             [58.40%]
             2,297 page-faults                                                 
       146,567,960 dTLB-loads                                                   [58.64%]
       211,648,685 dTLB-stores                                                  [58.63%]
            14,533 dTLB-load-misses          #    0.01% of all dTLB cache hits  [57.46%]
               640 iTLB-loads                                                   [55.74%]
           270,881 iTLB-load-misses          #  42325.16% of all iTLB cache hits  [55.17%]

       0.232425109 seconds time elapsed

---no THP---
#cat  /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

      18.325087 GB/Sec (with prefault)

 Performance counter stats for './perf bench mem memcpy -l 1gb -o':

        28,498,544 cache-misses              #   86.167 % of all cache refs     [57.35%]
        33,073,611 cache-references                                             [57.71%]
           524,540 page-faults                                                 
       453,500,641 dTLB-loads                                                   [57.99%]
       409,255,606 dTLB-stores                                                  [57.99%]
         2,033,985 dTLB-load-misses          #    0.45% of all dTLB cache hits  [57.52%]
             1,180 iTLB-loads                                                   [56.69%]
           539,056 iTLB-load-misses          #  45682.71% of all iTLB cache hits  [56.02%]

       0.485932214 seconds time elapsed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04  8:57 Transparent Hugepage impact on memcpy Jianguo Wu
@ 2013-06-04 12:30 ` Wanpeng Li
  2013-06-04 12:30 ` Wanpeng Li
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Wanpeng Li @ 2013-06-04 12:30 UTC (permalink / raw)
  To: Jianguo Wu; +Cc: linux-mm, Andrea Arcangeli, qiuxishi

[-- Attachment #1: Type: text/plain, Size: 3777 bytes --]

On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote:
>Hi all,
>
>I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>memcpy has worse performance.
>
>When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>

I get similar result as you against 3.10-rc4 in the attachment. This
dues to the characteristic of thp takes a single page fault for each 
2MB virtual region touched by userland.

>I think THP will improve performance, but the test result obviously not the case. 
>Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
>http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>
>I am not quite understand this, could you please give me some comments, Thanks!
>
>I test in Linux-3.4-stable, and my machine info is:
>Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>
>available: 2 nodes (0-1)
>node 0 cpus: 0 1 2 3 8 9 10 11
>node 0 size: 24567 MB
>node 0 free: 23550 MB
>node 1 cpus: 4 5 6 7 12 13 14 15
>node 1 size: 24576 MB
>node 1 free: 23767 MB
>node distances:
>node   0   1 
>  0:  10  20 
>  1:  20  10
>
>Below is test result:
>---with THP---
>#cat /sys/kernel/mm/transparent_hugepage/enabled
>[always] madvise never
>#./perf bench mem memcpy -l 1gb -o
># Running mem/memcpy benchmark...
># Copying 1gb Bytes ...
>
>       3.672879 GB/Sec (with prefault)
>
>#./perf stat ...
>Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>          35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>          66267785 cache-references                                             [49.78%]
>              2409 page-faults                                                 
>         450768651 dTLB-loads
>                                                  [50.78%]
>             24580 dTLB-misses
>              #    0.01% of all dTLB cache hits  [51.01%]
>        1338974202 dTLB-stores
>                                                 [50.63%]
>             77943 dTLB-misses
>                                                 [50.24%]
>         697404997 iTLB-loads
>                                                  [49.77%]
>               274 iTLB-misses
>              #    0.00% of all iTLB cache hits  [49.30%]
>
>       0.855041819 seconds time elapsed
>
>---no THP---
>#cat /sys/kernel/mm/transparent_hugepage/enabled
>always madvise [never]
>
>#./perf bench mem memcpy -l 1gb -o
># Running mem/memcpy benchmark...
># Copying 1gb Bytes ...
>
>       6.190187 GB/Sec (with prefault)
>
>#./perf stat ...
>Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>          16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>          17200000 cache-references                                             [50.04%]
>            524652 page-faults                                                 
>         734365659 dTLB-loads
>                                                  [50.04%]
>           4986387 dTLB-misses
>              #    0.68% of all dTLB cache hits  [50.04%]
>        1013408298 dTLB-stores
>                                                 [50.04%]
>           8180817 dTLB-misses
>                                                 [49.97%]
>        1526642351 iTLB-loads
>                                                  [50.41%]
>                56 iTLB-misses
>              #    0.00% of all iTLB cache hits  [50.21%]
>
>       1.025425847 seconds time elapsed
>
>Thanks,
>Jianguo Wu.
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: thp --]
[-- Type: text/plain, Size: 2004 bytes --]

---with THP---
#cat  /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

      12.208522 GB/Sec (with prefault)

 Performance counter stats for './perf bench mem memcpy -l 1gb -o':

        26,453,696 cache-misses              #   35.411 % of all cache refs     [57.66%]
        74,704,531 cache-references                                             [58.40%]
             2,297 page-faults                                                 
       146,567,960 dTLB-loads                                                   [58.64%]
       211,648,685 dTLB-stores                                                  [58.63%]
            14,533 dTLB-load-misses          #    0.01% of all dTLB cache hits  [57.46%]
               640 iTLB-loads                                                   [55.74%]
           270,881 iTLB-load-misses          #  42325.16% of all iTLB cache hits  [55.17%]

       0.232425109 seconds time elapsed

---no THP---
#cat  /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

# Running mem/memcpy benchmark...
# Copying 1gb Bytes ...

      18.325087 GB/Sec (with prefault)

 Performance counter stats for './perf bench mem memcpy -l 1gb -o':

        28,498,544 cache-misses              #   86.167 % of all cache refs     [57.35%]
        33,073,611 cache-references                                             [57.71%]
           524,540 page-faults                                                 
       453,500,641 dTLB-loads                                                   [57.99%]
       409,255,606 dTLB-stores                                                  [57.99%]
         2,033,985 dTLB-load-misses          #    0.45% of all dTLB cache hits  [57.52%]
             1,180 iTLB-loads                                                   [56.69%]
           539,056 iTLB-load-misses          #  45682.71% of all iTLB cache hits  [56.02%]

       0.485932214 seconds time elapsed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
       [not found] ` <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-06-04 12:55   ` Jianguo Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Jianguo Wu @ 2013-06-04 12:55 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: linux-mm, Andrea Arcangeli, qiuxishi

On 2013/6/4 20:30, Wanpeng Li wrote:

> On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote:
>> Hi all,
>>
>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>> memcpy has worse performance.
>>
>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>>
> 
> I get similar result as you against 3.10-rc4 in the attachment. This
> dues to the characteristic of thp takes a single page fault for each 
> 2MB virtual region touched by userland.
>

Hi Wanpeng,
Thanks for your reply:).

 

This test is with prefault, so it would not count page fault time in, and I think less page fault
will improve memcpy performance, right?

Test results from perf stat show a significant reduction in cache-references and cache-misses
when THP is off, do you have any idea about this?

Thanks,
Jianguo Wu.

>> I think THP will improve performance, but the test result obviously not the case. 
>> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
>> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>>
>> I am not quite understand this, could you please give me some comments, Thanks!
>>
>> I test in Linux-3.4-stable, and my machine info is:
>> Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 8 9 10 11
>> node 0 size: 24567 MB
>> node 0 free: 23550 MB
>> node 1 cpus: 4 5 6 7 12 13 14 15
>> node 1 size: 24576 MB
>> node 1 free: 23767 MB
>> node distances:
>> node   0   1 
>>  0:  10  20 
>>  1:  20  10
>>
>> Below is test result:
>> ---with THP---
>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> #./perf bench mem memcpy -l 1gb -o
>> # Running mem/memcpy benchmark...
>> # Copying 1gb Bytes ...
>>
>>       3.672879 GB/Sec (with prefault)
>>
>> #./perf stat ...
>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>
>>          35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>>          66267785 cache-references                                             [49.78%]
>>              2409 page-faults                                                 
>>         450768651 dTLB-loads
>>                                                  [50.78%]
>>             24580 dTLB-misses
>>              #    0.01% of all dTLB cache hits  [51.01%]
>>        1338974202 dTLB-stores
>>                                                 [50.63%]
>>             77943 dTLB-misses
>>                                                 [50.24%]
>>         697404997 iTLB-loads
>>                                                  [49.77%]
>>               274 iTLB-misses
>>              #    0.00% of all iTLB cache hits  [49.30%]
>>
>>       0.855041819 seconds time elapsed
>>
>> ---no THP---
>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>> always madvise [never]
>>
>> #./perf bench mem memcpy -l 1gb -o
>> # Running mem/memcpy benchmark...
>> # Copying 1gb Bytes ...
>>
>>       6.190187 GB/Sec (with prefault)
>>
>> #./perf stat ...
>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>
>>          16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>>          17200000 cache-references                                             [50.04%]
>>            524652 page-faults                                                 
>>         734365659 dTLB-loads
>>                                                  [50.04%]
>>           4986387 dTLB-misses
>>              #    0.68% of all dTLB cache hits  [50.04%]
>>        1013408298 dTLB-stores
>>                                                 [50.04%]
>>           8180817 dTLB-misses
>>                                                 [49.97%]
>>        1526642351 iTLB-loads
>>                                                  [50.41%]
>>                56 iTLB-misses
>>              #    0.00% of all iTLB cache hits  [50.21%]
>>
>>       1.025425847 seconds time elapsed
>>
>> Thanks,
>> Jianguo Wu.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04  8:57 Transparent Hugepage impact on memcpy Jianguo Wu
                   ` (2 preceding siblings ...)
       [not found] ` <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-06-04 14:10 ` Hush Bensen
  2013-06-05  3:26 ` Jianguo Wu
  4 siblings, 0 replies; 12+ messages in thread
From: Hush Bensen @ 2013-06-04 14:10 UTC (permalink / raw)
  To: Jianguo Wu, linux-mm, Kirill A. Shutemov, Hugh Dickins,
	Dave Hansen, Andi Kleen
  Cc: Andrea Arcangeli, qiuxishi

Cc thp guys.

ao? 2013/6/4 16:57, Jianguo Wu a??e??:
> Hi all,
>
> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
> memcpy has worse performance.
>
> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>
> I think THP will improve performance, but the test result obviously not the case.
> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>
> I am not quite understand this, could you please give me some comments, Thanks!
>
> I test in Linux-3.4-stable, and my machine info is:
> Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 24567 MB
> node 0 free: 23550 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 24576 MB
> node 1 free: 23767 MB
> node distances:
> node   0   1
>    0:  10  20
>    1:  20  10
>
> Below is test result:
> ---with THP---
> #cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> #./perf bench mem memcpy -l 1gb -o
> # Running mem/memcpy benchmark...
> # Copying 1gb Bytes ...
>
>         3.672879 GB/Sec (with prefault)
>
> #./perf stat ...
> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>            35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>            66267785 cache-references                                             [49.78%]
>                2409 page-faults
>           450768651 dTLB-loads
>                                                    [50.78%]
>               24580 dTLB-misses
>                #    0.01% of all dTLB cache hits  [51.01%]
>          1338974202 dTLB-stores
>                                                   [50.63%]
>               77943 dTLB-misses
>                                                   [50.24%]
>           697404997 iTLB-loads
>                                                    [49.77%]
>                 274 iTLB-misses
>                #    0.00% of all iTLB cache hits  [49.30%]
>
>         0.855041819 seconds time elapsed
>
> ---no THP---
> #cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> #./perf bench mem memcpy -l 1gb -o
> # Running mem/memcpy benchmark...
> # Copying 1gb Bytes ...
>
>         6.190187 GB/Sec (with prefault)
>
> #./perf stat ...
> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>
>            16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>            17200000 cache-references                                             [50.04%]
>              524652 page-faults
>           734365659 dTLB-loads
>                                                    [50.04%]
>             4986387 dTLB-misses
>                #    0.68% of all dTLB cache hits  [50.04%]
>          1013408298 dTLB-stores
>                                                   [50.04%]
>             8180817 dTLB-misses
>                                                   [49.97%]
>          1526642351 iTLB-loads
>                                                    [50.41%]
>                  56 iTLB-misses
>                #    0.00% of all iTLB cache hits  [50.21%]
>
>         1.025425847 seconds time elapsed
>
> Thanks,
> Jianguo Wu.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04 12:30 ` Wanpeng Li
@ 2013-06-04 20:20   ` Andrea Arcangeli
  2013-06-05  2:49     ` Jianguo Wu
  0 siblings, 1 reply; 12+ messages in thread
From: Andrea Arcangeli @ 2013-06-04 20:20 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: Jianguo Wu, linux-mm, qiuxishi

Hello everyone,

On Tue, Jun 04, 2013 at 08:30:51PM +0800, Wanpeng Li wrote:
> On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote:
> >Hi all,
> >
> >I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
> >memcpy has worse performance.
> >
> >When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
> >
> 
> I get similar result as you against 3.10-rc4 in the attachment. This
> dues to the characteristic of thp takes a single page fault for each 
> 2MB virtual region touched by userland.

I had a look at what prefault does and page faults should not be
involved in the measurement of GB/sec. The "stats" also include the
page faults but the page fault is not part of the printed GB/sec, if
"-o" is used.

If the perf test is correct, it looks more an hardware issue with
memcpy and large TLBs than a software one. memset doesn't exibith it,
if this was something fundamental memset should also exibith it. It
shall be possible to reproduce this with hugetlbfs in fact... if you
want to be 100% sure it's not software, you should try that.

Chances are there's enough pre-fetching going on in the CPU to
optimize for those 4k tlb loads in streaming copies, and the
pagetables are also cached very nicely with streaming copies. Maybe
large TLBs somewhere are less optimized for streaming copies. Only
something smarter happening in the CPU optimized for 4k and not yet
for 2M TLBs can explain this: if the CPU was equally intelligent it
should definitely be faster with THP on even with "-o".

Overall I doubt there's anything in software to fix here.

Also note, this is not related to additional cache usage during page
faults that I mentioned in the pdf. Page faults or cache effects in
the page faults are completely removed from the equation because of
"-o". The prefault pass, eliminates the page faults and trashes away
all the cache (regardless if the page fault uses non-temporal stores
or not) before the "measured" memcpy load starts.

I don't think this is a major concern, as a proof of thumb you just
need to prefix the "perf" command with "time" to see it: the THP
version still completes much faster despite the prefault part of it
is slightly slower with THP on.

THP pays off the most during computations that are accessing randomly,
and not sequentially.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04 20:20   ` Andrea Arcangeli
@ 2013-06-05  2:49     ` Jianguo Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Jianguo Wu @ 2013-06-05  2:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Wanpeng Li, linux-mm, qiuxishi, Hush Bensen

Hi Andrea,

Thanks for your patient explanation:). Please see below.

On 2013/6/5 4:20, Andrea Arcangeli wrote:

> Hello everyone,
> 
> On Tue, Jun 04, 2013 at 08:30:51PM +0800, Wanpeng Li wrote:
>> On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote:
>>> Hi all,
>>>
>>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>>> memcpy has worse performance.
>>>
>>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>>>
>>
>> I get similar result as you against 3.10-rc4 in the attachment. This
>> dues to the characteristic of thp takes a single page fault for each 
>> 2MB virtual region touched by userland.
> 
> I had a look at what prefault does and page faults should not be
> involved in the measurement of GB/sec. The "stats" also include the
> page faults but the page fault is not part of the printed GB/sec, if
> "-o" is used.

Agreed.

> 
> If the perf test is correct, it looks more an hardware issue with
> memcpy and large TLBs than a software one. memset doesn't exibith it,
> if this was something fundamental memset should also exibith it. It

Yes, I test memset with perf bench, it's little faster with THP:
THP:    6.458863 GB/Sec (with prefault)
NO-THP: 6.393698 GB/Sec (with prefault)

> shall be possible to reproduce this with hugetlbfs in fact... if you
> want to be 100% sure it's not software, you should try that.
> 

Yes, I got following result:
hugetlb:    2.518822 GB/Sec	(with prefault)
no-hugetlb: 3.688322 GB/Sec	(with prefault)

> Chances are there's enough pre-fetching going on in the CPU to
> optimize for those 4k tlb loads in streaming copies, and the
> pagetables are also cached very nicely with streaming copies. Maybe
> large TLBs somewhere are less optimized for streaming copies. Only
> something smarter happening in the CPU optimized for 4k and not yet
> for 2M TLBs can explain this: if the CPU was equally intelligent it
> should definitely be faster with THP on even with "-o".
> 
> Overall I doubt there's anything in software to fix here.
> 
> Also note, this is not related to additional cache usage during page
> faults that I mentioned in the pdf. Page faults or cache effects in
> the page faults are completely removed from the equation because of
> "-o". The prefault pass, eliminates the page faults and trashes away
> all the cache (regardless if the page fault uses non-temporal stores
> or not) before the "measured" memcpy load starts.
> 

Test results from perf stat show a significant reduction in cache-references and cache-misses
when THP is off, how to explain this?
	cache-misses	cache-references
THP:	35455940	66267785
NO-THP: 16920763	17200000

> I don't think this is a major concern, as a proof of thumb you just
> need to prefix the "perf" command with "time" to see it: the THP

I test with "time ./perf bench mem memcpy -l 1gb -o", and the result is
consistent with your expect.

THP:
       3.629896 GB/Sec (with prefault)

real	0m0.849s
user	0m0.472s
sys	0m0.372s

NO-THP:
       6.169184 GB/Sec (with prefault)

real	0m1.013s
user	0m0.412s
sys	0m0.596s

> version still completes much faster despite the prefault part of it
> is slightly slower with THP on.
> 

Why the prefault part is slower with THP on?
perf bench shows when no prefault, with THP on is much faster:

# ./perf bench mem memcpy -l 1gb -n
THP:    1.759009 GB/Sec
NO-THP: 1.291761 GB/Sec

Thanks again for your explanation.

Jianguo Wu.

> THP pays off the most during computations that are accessing randomly,
> and not sequentially.
> 
> Thanks,
> Andrea
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-04  8:57 Transparent Hugepage impact on memcpy Jianguo Wu
                   ` (3 preceding siblings ...)
  2013-06-04 14:10 ` Hush Bensen
@ 2013-06-05  3:26 ` Jianguo Wu
  2013-06-06 13:54   ` Hitoshi Mitake
  4 siblings, 1 reply; 12+ messages in thread
From: Jianguo Wu @ 2013-06-05  3:26 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrea Arcangeli, qiuxishi, Wanpeng Li, Hush Bensen, mitake

Hi,
One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy.
But test result isn't consistent with perf bench when THP is off.

	my program				perf bench
THP:	3.628368 GB/Sec (with prefault)		3.672879 GB/Sec (with prefault)
NO-THP:	3.612743 GB/Sec (with prefault)		6.190187 GB/Sec (with prefault)

Below is my code:
	src = calloc(1, len);
	dst = calloc(1, len);

	if (prefault)
		memcpy(dst, src, len);
	gettimeofday(&tv_start, NULL);
	memcpy(dst, src, len);
	gettimeofday(&tv_end, NULL);

	timersub(&tv_end, &tv_start, &tv_diff);
	free(src);
	free(dst);

	speed = (double)((double)len / timeval2double(&tv_diff));
	print_bps(speed);

This is weird, is it possible that perf bench do some build optimize?

Thansk,
Jianguo Wu.

On 2013/6/4 16:57, Jianguo Wu wrote:

> Hi all,
> 
> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
> memcpy has worse performance.
> 
> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
> 
> I think THP will improve performance, but the test result obviously not the case. 
> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
> 
> I am not quite understand this, could you please give me some comments, Thanks!
> 
> I test in Linux-3.4-stable, and my machine info is:
> Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 24567 MB
> node 0 free: 23550 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 24576 MB
> node 1 free: 23767 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10
> 
> Below is test result:
> ---with THP---
> #cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> #./perf bench mem memcpy -l 1gb -o
> # Running mem/memcpy benchmark...
> # Copying 1gb Bytes ...
> 
>        3.672879 GB/Sec (with prefault)
> 
> #./perf stat ...
> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
> 
>           35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>           66267785 cache-references                                             [49.78%]
>               2409 page-faults                                                 
>          450768651 dTLB-loads
>                                                   [50.78%]
>              24580 dTLB-misses
>               #    0.01% of all dTLB cache hits  [51.01%]
>         1338974202 dTLB-stores
>                                                  [50.63%]
>              77943 dTLB-misses
>                                                  [50.24%]
>          697404997 iTLB-loads
>                                                   [49.77%]
>                274 iTLB-misses
>               #    0.00% of all iTLB cache hits  [49.30%]
> 
>        0.855041819 seconds time elapsed
> 
> ---no THP---
> #cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> 
> #./perf bench mem memcpy -l 1gb -o
> # Running mem/memcpy benchmark...
> # Copying 1gb Bytes ...
> 
>        6.190187 GB/Sec (with prefault)
> 
> #./perf stat ...
> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
> 
>           16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>           17200000 cache-references                                             [50.04%]
>             524652 page-faults                                                 
>          734365659 dTLB-loads
>                                                   [50.04%]
>            4986387 dTLB-misses
>               #    0.68% of all dTLB cache hits  [50.04%]
>         1013408298 dTLB-stores
>                                                  [50.04%]
>            8180817 dTLB-misses
>                                                  [49.97%]
>         1526642351 iTLB-loads
>                                                   [50.41%]
>                 56 iTLB-misses
>               #    0.00% of all iTLB cache hits  [50.21%]
> 
>        1.025425847 seconds time elapsed
> 
> Thanks,
> Jianguo Wu.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-05  3:26 ` Jianguo Wu
@ 2013-06-06 13:54   ` Hitoshi Mitake
  2013-06-07  1:26     ` Jianguo Wu
  0 siblings, 1 reply; 12+ messages in thread
From: Hitoshi Mitake @ 2013-06-06 13:54 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-mm, Andrea Arcangeli, qiuxishi, Wanpeng Li, Hush Bensen,
	mitake.hitoshi

Hi Jianguo,

On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu <wujianguo@huawei.com> wrote:
> Hi,
> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy.
> But test result isn't consistent with perf bench when THP is off.
>
>         my program                              perf bench
> THP:    3.628368 GB/Sec (with prefault)         3.672879 GB/Sec (with prefault)
> NO-THP: 3.612743 GB/Sec (with prefault)         6.190187 GB/Sec (with prefault)
>
> Below is my code:
>         src = calloc(1, len);
>         dst = calloc(1, len);
>
>         if (prefault)
>                 memcpy(dst, src, len);
>         gettimeofday(&tv_start, NULL);
>         memcpy(dst, src, len);
>         gettimeofday(&tv_end, NULL);
>
>         timersub(&tv_end, &tv_start, &tv_diff);
>         free(src);
>         free(dst);
>
>         speed = (double)((double)len / timeval2double(&tv_diff));
>         print_bps(speed);
>
> This is weird, is it possible that perf bench do some build optimize?
>
> Thansk,
> Jianguo Wu.

perf bench mem memcpy is build with -O6. This is the compile command
line (you can get this with make V=1):
gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3
-funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited

Can I see your compile option for your test program and the actual
command line executing perf bench mem memcpy?

Thanks,
Hitoshi

>
> On 2013/6/4 16:57, Jianguo Wu wrote:
>
>> Hi all,
>>
>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>> memcpy has worse performance.
>>
>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>>
>> I think THP will improve performance, but the test result obviously not the case.
>> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
>> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>>
>> I am not quite understand this, could you please give me some comments, Thanks!
>>
>> I test in Linux-3.4-stable, and my machine info is:
>> Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>>
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 1 2 3 8 9 10 11
>> node 0 size: 24567 MB
>> node 0 free: 23550 MB
>> node 1 cpus: 4 5 6 7 12 13 14 15
>> node 1 size: 24576 MB
>> node 1 free: 23767 MB
>> node distances:
>> node   0   1
>>   0:  10  20
>>   1:  20  10
>>
>> Below is test result:
>> ---with THP---
>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>> [always] madvise never
>> #./perf bench mem memcpy -l 1gb -o
>> # Running mem/memcpy benchmark...
>> # Copying 1gb Bytes ...
>>
>>        3.672879 GB/Sec (with prefault)
>>
>> #./perf stat ...
>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>
>>           35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>>           66267785 cache-references                                             [49.78%]
>>               2409 page-faults
>>          450768651 dTLB-loads
>>                                                   [50.78%]
>>              24580 dTLB-misses
>>               #    0.01% of all dTLB cache hits  [51.01%]
>>         1338974202 dTLB-stores
>>                                                  [50.63%]
>>              77943 dTLB-misses
>>                                                  [50.24%]
>>          697404997 iTLB-loads
>>                                                   [49.77%]
>>                274 iTLB-misses
>>               #    0.00% of all iTLB cache hits  [49.30%]
>>
>>        0.855041819 seconds time elapsed
>>
>> ---no THP---
>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>> always madvise [never]
>>
>> #./perf bench mem memcpy -l 1gb -o
>> # Running mem/memcpy benchmark...
>> # Copying 1gb Bytes ...
>>
>>        6.190187 GB/Sec (with prefault)
>>
>> #./perf stat ...
>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>
>>           16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>>           17200000 cache-references                                             [50.04%]
>>             524652 page-faults
>>          734365659 dTLB-loads
>>                                                   [50.04%]
>>            4986387 dTLB-misses
>>               #    0.68% of all dTLB cache hits  [50.04%]
>>         1013408298 dTLB-stores
>>                                                  [50.04%]
>>            8180817 dTLB-misses
>>                                                  [49.97%]
>>         1526642351 iTLB-loads
>>                                                   [50.41%]
>>                 56 iTLB-misses
>>               #    0.00% of all iTLB cache hits  [50.21%]
>>
>>        1.025425847 seconds time elapsed
>>
>> Thanks,
>> Jianguo Wu.
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-06 13:54   ` Hitoshi Mitake
@ 2013-06-07  1:26     ` Jianguo Wu
  2013-06-07 13:50       ` Hitoshi Mitake
  0 siblings, 1 reply; 12+ messages in thread
From: Jianguo Wu @ 2013-06-07  1:26 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: linux-mm, Andrea Arcangeli, qiuxishi, Wanpeng Li, Hush Bensen,
	mitake.hitoshi

Hi Hitoshi,

Thanks for your reply! please see below.

On 2013/6/6 21:54, Hitoshi Mitake wrote:

> Hi Jianguo,
> 
> On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu <wujianguo@huawei.com> wrote:
>> Hi,
>> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy.
>> But test result isn't consistent with perf bench when THP is off.
>>
>>         my program                              perf bench
>> THP:    3.628368 GB/Sec (with prefault)         3.672879 GB/Sec (with prefault)
>> NO-THP: 3.612743 GB/Sec (with prefault)         6.190187 GB/Sec (with prefault)
>>
>> Below is my code:
>>         src = calloc(1, len);
>>         dst = calloc(1, len);
>>
>>         if (prefault)
>>                 memcpy(dst, src, len);
>>         gettimeofday(&tv_start, NULL);
>>         memcpy(dst, src, len);
>>         gettimeofday(&tv_end, NULL);
>>
>>         timersub(&tv_end, &tv_start, &tv_diff);
>>         free(src);
>>         free(dst);
>>
>>         speed = (double)((double)len / timeval2double(&tv_diff));
>>         print_bps(speed);
>>
>> This is weird, is it possible that perf bench do some build optimize?
>>
>> Thansk,
>> Jianguo Wu.
> 
> perf bench mem memcpy is build with -O6. This is the compile command
> line (you can get this with make V=1):
> gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3
> -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited
> 
> Can I see your compile option for your test program and the actual
> command line executing perf bench mem memcpy?
> 

I just compiled my test program with gcc -o memcpy-test memcpy-test.c.
I tried to use the same compile option with perf bench mem memcpy, and
the test result showed no difference.

My execute command line for perf bench mem memcpy:
#./perf bench mem memcpy -l 1gb -o

Thanks,
Jianguo Wu

> Thanks,
> Hitoshi
> 
>>
>> On 2013/6/4 16:57, Jianguo Wu wrote:
>>
>>> Hi all,
>>>
>>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on,
>>> memcpy has worse performance.
>>>
>>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault).
>>>
>>> I think THP will improve performance, but the test result obviously not the case.
>>> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in
>>> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf.
>>>
>>> I am not quite understand this, could you please give me some comments, Thanks!
>>>
>>> I test in Linux-3.4-stable, and my machine info is:
>>> Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 8 9 10 11
>>> node 0 size: 24567 MB
>>> node 0 free: 23550 MB
>>> node 1 cpus: 4 5 6 7 12 13 14 15
>>> node 1 size: 24576 MB
>>> node 1 free: 23767 MB
>>> node distances:
>>> node   0   1
>>>   0:  10  20
>>>   1:  20  10
>>>
>>> Below is test result:
>>> ---with THP---
>>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>>> [always] madvise never
>>> #./perf bench mem memcpy -l 1gb -o
>>> # Running mem/memcpy benchmark...
>>> # Copying 1gb Bytes ...
>>>
>>>        3.672879 GB/Sec (with prefault)
>>>
>>> #./perf stat ...
>>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>>
>>>           35455940 cache-misses              #   53.504 % of all cache refs     [49.45%]
>>>           66267785 cache-references                                             [49.78%]
>>>               2409 page-faults
>>>          450768651 dTLB-loads
>>>                                                   [50.78%]
>>>              24580 dTLB-misses
>>>               #    0.01% of all dTLB cache hits  [51.01%]
>>>         1338974202 dTLB-stores
>>>                                                  [50.63%]
>>>              77943 dTLB-misses
>>>                                                  [50.24%]
>>>          697404997 iTLB-loads
>>>                                                   [49.77%]
>>>                274 iTLB-misses
>>>               #    0.00% of all iTLB cache hits  [49.30%]
>>>
>>>        0.855041819 seconds time elapsed
>>>
>>> ---no THP---
>>> #cat /sys/kernel/mm/transparent_hugepage/enabled
>>> always madvise [never]
>>>
>>> #./perf bench mem memcpy -l 1gb -o
>>> # Running mem/memcpy benchmark...
>>> # Copying 1gb Bytes ...
>>>
>>>        6.190187 GB/Sec (with prefault)
>>>
>>> #./perf stat ...
>>> Performance counter stats for './perf bench mem memcpy -l 1gb -o':
>>>
>>>           16920763 cache-misses              #   98.377 % of all cache refs     [50.01%]
>>>           17200000 cache-references                                             [50.04%]
>>>             524652 page-faults
>>>          734365659 dTLB-loads
>>>                                                   [50.04%]
>>>            4986387 dTLB-misses
>>>               #    0.68% of all dTLB cache hits  [50.04%]
>>>         1013408298 dTLB-stores
>>>                                                  [50.04%]
>>>            8180817 dTLB-misses
>>>                                                  [49.97%]
>>>         1526642351 iTLB-loads
>>>                                                   [50.41%]
>>>                 56 iTLB-misses
>>>               #    0.00% of all iTLB cache hits  [50.21%]
>>>
>>>        1.025425847 seconds time elapsed
>>>
>>> Thanks,
>>> Jianguo Wu.
>>
>>
>>
>>
> 
> .
> 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-07  1:26     ` Jianguo Wu
@ 2013-06-07 13:50       ` Hitoshi Mitake
  2013-06-08  1:13         ` Jianguo Wu
  0 siblings, 1 reply; 12+ messages in thread
From: Hitoshi Mitake @ 2013-06-07 13:50 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: Hitoshi Mitake, linux-mm, Andrea Arcangeli, qiuxishi, Wanpeng Li,
	Hush Bensen, mitake.hitoshi

At Fri, 7 Jun 2013 09:26:58 +0800,
Jianguo Wu wrote:
> 
> Hi Hitoshi,
> 
> Thanks for your reply! please see below.
> 
> On 2013/6/6 21:54, Hitoshi Mitake wrote:
> 
> > Hi Jianguo,
> > 
> > On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu <wujianguo@huawei.com> wrote:
> >> Hi,
> >> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy.
> >> But test result isn't consistent with perf bench when THP is off.
> >>
> >>         my program                              perf bench
> >> THP:    3.628368 GB/Sec (with prefault)         3.672879 GB/Sec (with prefault)
> >> NO-THP: 3.612743 GB/Sec (with prefault)         6.190187 GB/Sec (with prefault)
> >>
> >> Below is my code:
> >>         src = calloc(1, len);
> >>         dst = calloc(1, len);
> >>
> >>         if (prefault)
> >>                 memcpy(dst, src, len);
> >>         gettimeofday(&tv_start, NULL);
> >>         memcpy(dst, src, len);
> >>         gettimeofday(&tv_end, NULL);
> >>
> >>         timersub(&tv_end, &tv_start, &tv_diff);
> >>         free(src);
> >>         free(dst);
> >>
> >>         speed = (double)((double)len / timeval2double(&tv_diff));
> >>         print_bps(speed);
> >>
> >> This is weird, is it possible that perf bench do some build optimize?
> >>
> >> Thansk,
> >> Jianguo Wu.
> > 
> > perf bench mem memcpy is build with -O6. This is the compile command
> > line (you can get this with make V=1):
> > gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3
> > -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited
> > 
> > Can I see your compile option for your test program and the actual
> > command line executing perf bench mem memcpy?
> > 
> 
> I just compiled my test program with gcc -o memcpy-test memcpy-test.c.
> I tried to use the same compile option with perf bench mem memcpy, and
> the test result showed no difference.
> 
> My execute command line for perf bench mem memcpy:
> #./perf bench mem memcpy -l 1gb -o

Thanks for your information. I have three more requests for
reproducing the problem:

1. the entire source code of your program
2. your gcc version
3. your glibc version

I should've requested it first, sorry :(

Thanks,
Hitoshi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Transparent Hugepage impact on memcpy
  2013-06-07 13:50       ` Hitoshi Mitake
@ 2013-06-08  1:13         ` Jianguo Wu
  0 siblings, 0 replies; 12+ messages in thread
From: Jianguo Wu @ 2013-06-08  1:13 UTC (permalink / raw)
  To: Hitoshi Mitake
  Cc: Hitoshi Mitake, linux-mm, Andrea Arcangeli, qiuxishi, Wanpeng Li,
	Hush Bensen

[-- Attachment #1: Type: text/plain, Size: 2394 bytes --]

Hi Hitoshi,

On 2013/6/7 21:50, Hitoshi Mitake wrote:

> At Fri, 7 Jun 2013 09:26:58 +0800,
> Jianguo Wu wrote:
>>
>> Hi Hitoshi,
>>
>> Thanks for your reply! please see below.
>>
>> On 2013/6/6 21:54, Hitoshi Mitake wrote:
>>
>>> Hi Jianguo,
>>>
>>> On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu <wujianguo@huawei.com> wrote:
>>>> Hi,
>>>> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy.
>>>> But test result isn't consistent with perf bench when THP is off.
>>>>
>>>>         my program                              perf bench
>>>> THP:    3.628368 GB/Sec (with prefault)         3.672879 GB/Sec (with prefault)
>>>> NO-THP: 3.612743 GB/Sec (with prefault)         6.190187 GB/Sec (with prefault)
>>>>
>>>> Below is my code:
>>>>         src = calloc(1, len);
>>>>         dst = calloc(1, len);
>>>>
>>>>         if (prefault)
>>>>                 memcpy(dst, src, len);
>>>>         gettimeofday(&tv_start, NULL);
>>>>         memcpy(dst, src, len);
>>>>         gettimeofday(&tv_end, NULL);
>>>>
>>>>         timersub(&tv_end, &tv_start, &tv_diff);
>>>>         free(src);
>>>>         free(dst);
>>>>
>>>>         speed = (double)((double)len / timeval2double(&tv_diff));
>>>>         print_bps(speed);
>>>>
>>>> This is weird, is it possible that perf bench do some build optimize?
>>>>
>>>> Thansk,
>>>> Jianguo Wu.
>>>
>>> perf bench mem memcpy is build with -O6. This is the compile command
>>> line (you can get this with make V=1):
>>> gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3
>>> -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited
>>>
>>> Can I see your compile option for your test program and the actual
>>> command line executing perf bench mem memcpy?
>>>
>>
>> I just compiled my test program with gcc -o memcpy-test memcpy-test.c.
>> I tried to use the same compile option with perf bench mem memcpy, and
>> the test result showed no difference.
>>
>> My execute command line for perf bench mem memcpy:
>> #./perf bench mem memcpy -l 1gb -o
> 
> Thanks for your information. I have three more requests for
> reproducing the problem:
> 
> 1. the entire source code of your program

Please see the attachment.

> 2. your gcc version

4.3.4

> 3. your glibc version

glibc-2.11.1-0.17.4

Thanks,
Jianguo Wu

> 
> I should've requested it first, sorry :(
> 
> Thanks,
> Hitoshi
> 
> .
> 



[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: memcpy-prefault.c --]
[-- Type: text/plain; charset="gb18030"; name="memcpy-prefault.c", Size: 2580 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <unistd.h>

#define K 1024LL
#define print_bps(x) do {					\
		if (x < K)					\
			printf(" %14lf B/Sec", x);		\
		else if (x < K * K)				\
			printf(" %14lfd KB/Sec", x / K);	\
		else if (x < K * K * K)				\
			printf(" %14lf MB/Sec", x / K / K);	\
		else						\
			printf(" %14lf GB/Sec", x / K / K / K); \
	} while (0)

long long local_atoll(const char *str)
{
	unsigned int i;
	long long length = -1, unit = 1;

	if (!isdigit(str[0]))
		goto out_err;

	for (i = 1; i < strlen(str); i++) {
		switch (str[i]) {
		case 'B':
		case 'b':
			break;
		case 'K':
			if (str[i + 1] != 'B')
				goto out_err;
			else
				goto kilo;
		case 'k':
			if (str[i + 1] != 'b')
				goto out_err;
kilo:
			unit = K;
			break;
		case 'M':
			if (str[i + 1] != 'B')
				goto out_err;
			else
				goto mega;
		case 'm':
			if (str[i + 1] != 'b')
				goto out_err;
mega:
			unit = K * K;
			break;
		case 'G':
			if (str[i + 1] != 'B')
				goto out_err;
			else
				goto giga;
		case 'g':
			if (str[i + 1] != 'b')
				goto out_err;
giga:
			unit = K * K * K;
			break;
		case 'T':
			if (str[i + 1] != 'B')
				goto out_err;
			else
				goto tera;
		case 't':
			if (str[i + 1] != 'b')
				goto out_err;
tera:
			unit = K * K * K * K;
			break;
		case '\0':	/* only specified figures */
			unit = 1;
			break;
		default:
			if (!isdigit(str[i]))
				goto out_err;
			break;
		}
	}

	length = atoll(str) * unit;
	goto out;

out_err:
	length = -1;
out:
	return length;
}

static double timeval2double(struct timeval *ts)
{
	return (double)ts->tv_sec +
			(double)ts->tv_usec / (double)1000000;
}

void do_memcpy(long long len, int prefault)
{
	void *src, *dst;
	struct timeval tv_start, tv_end, tv_diff;
	double res;

	src = calloc(1, len);
	dst = calloc(1, len);

	if (prefault)
		memcpy(dst, src, len);
	gettimeofday(&tv_start, NULL);
	memcpy(dst, src, len);
	gettimeofday(&tv_end, NULL);

	timersub(&tv_end, &tv_start, &tv_diff);
	free(src);
	free(dst);

	res = (double)((double)len / timeval2double(&tv_diff));
	print_bps(res);
	if (prefault)
		printf("\t(with prefault)");
	printf("\n");

}

int main(int argc, char *argv[])
{
	long long len = -1; 
	char ch;
	int prefault = 0;

	while( (ch=getopt(argc, argv, "l:") ) != -1 )  
	{  
		switch(ch)  
		{  
			case 'l':
				len = local_atoll(optarg);
				if (len < 0) {
					printf("Invalid size\n");
					return 0;
				} else				
					printf("# Copying %s Byte ...\n", optarg);
				break;
			default:
				return;
		}
	}

	do_memcpy(len, 1);	
	
	return 0;
}

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2013-06-08  1:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-04  8:57 Transparent Hugepage impact on memcpy Jianguo Wu
2013-06-04 12:30 ` Wanpeng Li
2013-06-04 12:30 ` Wanpeng Li
2013-06-04 20:20   ` Andrea Arcangeli
2013-06-05  2:49     ` Jianguo Wu
     [not found] ` <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com>
2013-06-04 12:55   ` Jianguo Wu
2013-06-04 14:10 ` Hush Bensen
2013-06-05  3:26 ` Jianguo Wu
2013-06-06 13:54   ` Hitoshi Mitake
2013-06-07  1:26     ` Jianguo Wu
2013-06-07 13:50       ` Hitoshi Mitake
2013-06-08  1:13         ` Jianguo Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).