From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id 9041F6B0072 for ; Tue, 4 Jun 2013 04:58:22 -0400 (EDT) Message-ID: <51ADAC15.1050103@huawei.com> Date: Tue, 4 Jun 2013 16:57:57 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Transparent Hugepage impact on memcpy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , qiuxishi Hi all, I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, memcpy has worse performance. When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). I think THP will improve performance, but the test result obviously not the case. Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. I am not quite understand this, could you please give me some comments, Thanks! I test in Linux-3.4-stable, and my machine info is: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 8 9 10 11 node 0 size: 24567 MB node 0 free: 23550 MB node 1 cpus: 4 5 6 7 12 13 14 15 node 1 size: 24576 MB node 1 free: 23767 MB node distances: node 0 1 0: 10 20 1: 20 10 Below is test result: ---with THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never #./perf bench mem memcpy -l 1gb -o # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 3.672879 GB/Sec (with prefault) #./perf stat ... Performance counter stats for './perf bench mem memcpy -l 1gb -o': 35455940 cache-misses # 53.504 % of all cache refs [49.45%] 66267785 cache-references [49.78%] 2409 page-faults 450768651 dTLB-loads [50.78%] 24580 dTLB-misses # 0.01% of all dTLB cache hits [51.01%] 1338974202 dTLB-stores [50.63%] 77943 dTLB-misses [50.24%] 697404997 iTLB-loads [49.77%] 274 iTLB-misses # 0.00% of all iTLB cache hits [49.30%] 0.855041819 seconds time elapsed ---no THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] #./perf bench mem memcpy -l 1gb -o # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 6.190187 GB/Sec (with prefault) #./perf stat ... Performance counter stats for './perf bench mem memcpy -l 1gb -o': 16920763 cache-misses # 98.377 % of all cache refs [50.01%] 17200000 cache-references [50.04%] 524652 page-faults 734365659 dTLB-loads [50.04%] 4986387 dTLB-misses # 0.68% of all dTLB cache hits [50.04%] 1013408298 dTLB-stores [50.04%] 8180817 dTLB-misses [49.97%] 1526642351 iTLB-loads [50.41%] 56 iTLB-misses # 0.00% of all iTLB cache hits [50.21%] 1.025425847 seconds time elapsed Thanks, Jianguo Wu. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id D5E156B0093 for ; Tue, 4 Jun 2013 08:31:03 -0400 (EDT) Received: from /spool/local by e28smtp03.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 4 Jun 2013 17:55:30 +0530 Received: from d28relay05.in.ibm.com (d28relay05.in.ibm.com [9.184.220.62]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id 2827C125804F for ; Tue, 4 Jun 2013 18:03:02 +0530 (IST) Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65]) by d28relay05.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r54CUnZ856361008 for ; Tue, 4 Jun 2013 18:00:49 +0530 Received: from d28av03.in.ibm.com (loopback [127.0.0.1]) by d28av03.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r54CUrJF018270 for ; Tue, 4 Jun 2013 22:30:53 +1000 Date: Tue, 4 Jun 2013 20:30:51 +0800 From: Wanpeng Li Subject: Re: Transparent Hugepage impact on memcpy Message-ID: <20130604123050.GA32707@hacker.(null)> Reply-To: Wanpeng Li References: <51ADAC15.1050103@huawei.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="NzB8fVQJ5HfG6fxh" Content-Disposition: inline In-Reply-To: <51ADAC15.1050103@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jianguo Wu Cc: linux-mm@kvack.org, Andrea Arcangeli , qiuxishi --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: >Hi all, > >I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >memcpy has worse performance. > >When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > I get similar result as you against 3.10-rc4 in the attachment. This dues to the characteristic of thp takes a single page fault for each 2MB virtual region touched by userland. >I think THP will improve performance, but the test result obviously not the case. >Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in >http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. > >I am not quite understand this, could you please give me some comments, Thanks! > >I test in Linux-3.4-stable, and my machine info is: >Intel(R) Xeon(R) CPU E5520 @ 2.27GHz > >available: 2 nodes (0-1) >node 0 cpus: 0 1 2 3 8 9 10 11 >node 0 size: 24567 MB >node 0 free: 23550 MB >node 1 cpus: 4 5 6 7 12 13 14 15 >node 1 size: 24576 MB >node 1 free: 23767 MB >node distances: >node 0 1 > 0: 10 20 > 1: 20 10 > >Below is test result: >---with THP--- >#cat /sys/kernel/mm/transparent_hugepage/enabled >[always] madvise never >#./perf bench mem memcpy -l 1gb -o ># Running mem/memcpy benchmark... ># Copying 1gb Bytes ... > > 3.672879 GB/Sec (with prefault) > >#./perf stat ... >Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 35455940 cache-misses # 53.504 % of all cache refs [49.45%] > 66267785 cache-references [49.78%] > 2409 page-faults > 450768651 dTLB-loads > [50.78%] > 24580 dTLB-misses > # 0.01% of all dTLB cache hits [51.01%] > 1338974202 dTLB-stores > [50.63%] > 77943 dTLB-misses > [50.24%] > 697404997 iTLB-loads > [49.77%] > 274 iTLB-misses > # 0.00% of all iTLB cache hits [49.30%] > > 0.855041819 seconds time elapsed > >---no THP--- >#cat /sys/kernel/mm/transparent_hugepage/enabled >always madvise [never] > >#./perf bench mem memcpy -l 1gb -o ># Running mem/memcpy benchmark... ># Copying 1gb Bytes ... > > 6.190187 GB/Sec (with prefault) > >#./perf stat ... >Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 16920763 cache-misses # 98.377 % of all cache refs [50.01%] > 17200000 cache-references [50.04%] > 524652 page-faults > 734365659 dTLB-loads > [50.04%] > 4986387 dTLB-misses > # 0.68% of all dTLB cache hits [50.04%] > 1013408298 dTLB-stores > [50.04%] > 8180817 dTLB-misses > [49.97%] > 1526642351 iTLB-loads > [50.41%] > 56 iTLB-misses > # 0.00% of all iTLB cache hits [50.21%] > > 1.025425847 seconds time elapsed > >Thanks, >Jianguo Wu. > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=thp ---with THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 12.208522 GB/Sec (with prefault) Performance counter stats for './perf bench mem memcpy -l 1gb -o': 26,453,696 cache-misses # 35.411 % of all cache refs [57.66%] 74,704,531 cache-references [58.40%] 2,297 page-faults 146,567,960 dTLB-loads [58.64%] 211,648,685 dTLB-stores [58.63%] 14,533 dTLB-load-misses # 0.01% of all dTLB cache hits [57.46%] 640 iTLB-loads [55.74%] 270,881 iTLB-load-misses # 42325.16% of all iTLB cache hits [55.17%] 0.232425109 seconds time elapsed ---no THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 18.325087 GB/Sec (with prefault) Performance counter stats for './perf bench mem memcpy -l 1gb -o': 28,498,544 cache-misses # 86.167 % of all cache refs [57.35%] 33,073,611 cache-references [57.71%] 524,540 page-faults 453,500,641 dTLB-loads [57.99%] 409,255,606 dTLB-stores [57.99%] 2,033,985 dTLB-load-misses # 0.45% of all dTLB cache hits [57.52%] 1,180 iTLB-loads [56.69%] 539,056 iTLB-load-misses # 45682.71% of all iTLB cache hits [56.02%] 0.485932214 seconds time elapsed --NzB8fVQJ5HfG6fxh-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 457256B003A for ; Tue, 4 Jun 2013 08:55:32 -0400 (EDT) Message-ID: <51ADE3B7.1070303@huawei.com> Date: Tue, 4 Jun 2013 20:55:19 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com> In-Reply-To: <51adde12.e6b2320a.610d.ffff96f3SMTPIN_ADDED_BROKEN@mx.google.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Wanpeng Li Cc: linux-mm@kvack.org, Andrea Arcangeli , qiuxishi On 2013/6/4 20:30, Wanpeng Li wrote: > On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: >> Hi all, >> >> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >> memcpy has worse performance. >> >> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). >> > > I get similar result as you against 3.10-rc4 in the attachment. This > dues to the characteristic of thp takes a single page fault for each > 2MB virtual region touched by userland. > Hi Wanpeng, Thanks for your reply:). This test is with prefault, so it would not count page fault time in, and I think less page fault will improve memcpy performance, right? Test results from perf stat show a significant reduction in cache-references and cache-misses when THP is off, do you have any idea about this? Thanks, Jianguo Wu. >> I think THP will improve performance, but the test result obviously not the case. >> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in >> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. >> >> I am not quite understand this, could you please give me some comments, Thanks! >> >> I test in Linux-3.4-stable, and my machine info is: >> Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 8 9 10 11 >> node 0 size: 24567 MB >> node 0 free: 23550 MB >> node 1 cpus: 4 5 6 7 12 13 14 15 >> node 1 size: 24576 MB >> node 1 free: 23767 MB >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> Below is test result: >> ---with THP--- >> #cat /sys/kernel/mm/transparent_hugepage/enabled >> [always] madvise never >> #./perf bench mem memcpy -l 1gb -o >> # Running mem/memcpy benchmark... >> # Copying 1gb Bytes ... >> >> 3.672879 GB/Sec (with prefault) >> >> #./perf stat ... >> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >> >> 35455940 cache-misses # 53.504 % of all cache refs [49.45%] >> 66267785 cache-references [49.78%] >> 2409 page-faults >> 450768651 dTLB-loads >> [50.78%] >> 24580 dTLB-misses >> # 0.01% of all dTLB cache hits [51.01%] >> 1338974202 dTLB-stores >> [50.63%] >> 77943 dTLB-misses >> [50.24%] >> 697404997 iTLB-loads >> [49.77%] >> 274 iTLB-misses >> # 0.00% of all iTLB cache hits [49.30%] >> >> 0.855041819 seconds time elapsed >> >> ---no THP--- >> #cat /sys/kernel/mm/transparent_hugepage/enabled >> always madvise [never] >> >> #./perf bench mem memcpy -l 1gb -o >> # Running mem/memcpy benchmark... >> # Copying 1gb Bytes ... >> >> 6.190187 GB/Sec (with prefault) >> >> #./perf stat ... >> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >> >> 16920763 cache-misses # 98.377 % of all cache refs [50.01%] >> 17200000 cache-references [50.04%] >> 524652 page-faults >> 734365659 dTLB-loads >> [50.04%] >> 4986387 dTLB-misses >> # 0.68% of all dTLB cache hits [50.04%] >> 1013408298 dTLB-stores >> [50.04%] >> 8180817 dTLB-misses >> [49.97%] >> 1526642351 iTLB-loads >> [50.41%] >> 56 iTLB-misses >> # 0.00% of all iTLB cache hits [50.21%] >> >> 1.025425847 seconds time elapsed >> >> Thanks, >> Jianguo Wu. >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx170.postini.com [74.125.245.170]) by kanga.kvack.org (Postfix) with SMTP id B89DC6B0032 for ; Tue, 4 Jun 2013 10:12:15 -0400 (EDT) Received: by mail-pb0-f48.google.com with SMTP id md4so297611pbc.7 for ; Tue, 04 Jun 2013 07:12:15 -0700 (PDT) Message-ID: <51ADF56B.2060407@gmail.com> Date: Tue, 04 Jun 2013 22:10:51 +0800 From: Hush Bensen MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> In-Reply-To: <51ADAC15.1050103@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Jianguo Wu , linux-mm@kvack.org, "Kirill A. Shutemov" , Hugh Dickins , Dave Hansen , Andi Kleen Cc: Andrea Arcangeli , qiuxishi Cc thp guys. ao? 2013/6/4 16:57, Jianguo Wu a??e??: > Hi all, > > I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, > memcpy has worse performance. > > When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > > I think THP will improve performance, but the test result obviously not the case. > Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in > http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. > > I am not quite understand this, could you please give me some comments, Thanks! > > I test in Linux-3.4-stable, and my machine info is: > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz > > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 8 9 10 11 > node 0 size: 24567 MB > node 0 free: 23550 MB > node 1 cpus: 4 5 6 7 12 13 14 15 > node 1 size: 24576 MB > node 1 free: 23767 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > Below is test result: > ---with THP--- > #cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > #./perf bench mem memcpy -l 1gb -o > # Running mem/memcpy benchmark... > # Copying 1gb Bytes ... > > 3.672879 GB/Sec (with prefault) > > #./perf stat ... > Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 35455940 cache-misses # 53.504 % of all cache refs [49.45%] > 66267785 cache-references [49.78%] > 2409 page-faults > 450768651 dTLB-loads > [50.78%] > 24580 dTLB-misses > # 0.01% of all dTLB cache hits [51.01%] > 1338974202 dTLB-stores > [50.63%] > 77943 dTLB-misses > [50.24%] > 697404997 iTLB-loads > [49.77%] > 274 iTLB-misses > # 0.00% of all iTLB cache hits [49.30%] > > 0.855041819 seconds time elapsed > > ---no THP--- > #cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > > #./perf bench mem memcpy -l 1gb -o > # Running mem/memcpy benchmark... > # Copying 1gb Bytes ... > > 6.190187 GB/Sec (with prefault) > > #./perf stat ... > Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 16920763 cache-misses # 98.377 % of all cache refs [50.01%] > 17200000 cache-references [50.04%] > 524652 page-faults > 734365659 dTLB-loads > [50.04%] > 4986387 dTLB-misses > # 0.68% of all dTLB cache hits [50.04%] > 1013408298 dTLB-stores > [50.04%] > 8180817 dTLB-misses > [49.97%] > 1526642351 iTLB-loads > [50.41%] > 56 iTLB-misses > # 0.00% of all iTLB cache hits [50.21%] > > 1.025425847 seconds time elapsed > > Thanks, > Jianguo Wu. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx154.postini.com [74.125.245.154]) by kanga.kvack.org (Postfix) with SMTP id BB81A6B0032 for ; Tue, 4 Jun 2013 16:20:24 -0400 (EDT) Date: Tue, 4 Jun 2013 22:20:17 +0200 From: Andrea Arcangeli Subject: Re: Transparent Hugepage impact on memcpy Message-ID: <20130604202017.GJ3463@redhat.com> References: <51ADAC15.1050103@huawei.com> <20130604123050.GA32707@hacker.(null)> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130604123050.GA32707@hacker.(null)> Sender: owner-linux-mm@kvack.org List-ID: To: Wanpeng Li Cc: Jianguo Wu , linux-mm@kvack.org, qiuxishi Hello everyone, On Tue, Jun 04, 2013 at 08:30:51PM +0800, Wanpeng Li wrote: > On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: > >Hi all, > > > >I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, > >memcpy has worse performance. > > > >When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > > > > I get similar result as you against 3.10-rc4 in the attachment. This > dues to the characteristic of thp takes a single page fault for each > 2MB virtual region touched by userland. I had a look at what prefault does and page faults should not be involved in the measurement of GB/sec. The "stats" also include the page faults but the page fault is not part of the printed GB/sec, if "-o" is used. If the perf test is correct, it looks more an hardware issue with memcpy and large TLBs than a software one. memset doesn't exibith it, if this was something fundamental memset should also exibith it. It shall be possible to reproduce this with hugetlbfs in fact... if you want to be 100% sure it's not software, you should try that. Chances are there's enough pre-fetching going on in the CPU to optimize for those 4k tlb loads in streaming copies, and the pagetables are also cached very nicely with streaming copies. Maybe large TLBs somewhere are less optimized for streaming copies. Only something smarter happening in the CPU optimized for 4k and not yet for 2M TLBs can explain this: if the CPU was equally intelligent it should definitely be faster with THP on even with "-o". Overall I doubt there's anything in software to fix here. Also note, this is not related to additional cache usage during page faults that I mentioned in the pdf. Page faults or cache effects in the page faults are completely removed from the equation because of "-o". The prefault pass, eliminates the page faults and trashes away all the cache (regardless if the page fault uses non-temporal stores or not) before the "measured" memcpy load starts. I don't think this is a major concern, as a proof of thumb you just need to prefix the "perf" command with "time" to see it: the THP version still completes much faster despite the prefault part of it is slightly slower with THP on. THP pays off the most during computations that are accessing randomly, and not sequentially. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx199.postini.com [74.125.245.199]) by kanga.kvack.org (Postfix) with SMTP id 929426B0031 for ; Tue, 4 Jun 2013 22:50:07 -0400 (EDT) Message-ID: <51AEA72B.5070707@huawei.com> Date: Wed, 5 Jun 2013 10:49:15 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> <20130604123050.GA32707@hacker.(null)> <20130604202017.GJ3463@redhat.com> In-Reply-To: <20130604202017.GJ3463@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Andrea Arcangeli Cc: Wanpeng Li , linux-mm@kvack.org, qiuxishi , Hush Bensen Hi Andrea, Thanks for your patient explanation:). Please see below. On 2013/6/5 4:20, Andrea Arcangeli wrote: > Hello everyone, > > On Tue, Jun 04, 2013 at 08:30:51PM +0800, Wanpeng Li wrote: >> On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: >>> Hi all, >>> >>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >>> memcpy has worse performance. >>> >>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). >>> >> >> I get similar result as you against 3.10-rc4 in the attachment. This >> dues to the characteristic of thp takes a single page fault for each >> 2MB virtual region touched by userland. > > I had a look at what prefault does and page faults should not be > involved in the measurement of GB/sec. The "stats" also include the > page faults but the page fault is not part of the printed GB/sec, if > "-o" is used. Agreed. > > If the perf test is correct, it looks more an hardware issue with > memcpy and large TLBs than a software one. memset doesn't exibith it, > if this was something fundamental memset should also exibith it. It Yes, I test memset with perf bench, it's little faster with THP: THP: 6.458863 GB/Sec (with prefault) NO-THP: 6.393698 GB/Sec (with prefault) > shall be possible to reproduce this with hugetlbfs in fact... if you > want to be 100% sure it's not software, you should try that. > Yes, I got following result: hugetlb: 2.518822 GB/Sec (with prefault) no-hugetlb: 3.688322 GB/Sec (with prefault) > Chances are there's enough pre-fetching going on in the CPU to > optimize for those 4k tlb loads in streaming copies, and the > pagetables are also cached very nicely with streaming copies. Maybe > large TLBs somewhere are less optimized for streaming copies. Only > something smarter happening in the CPU optimized for 4k and not yet > for 2M TLBs can explain this: if the CPU was equally intelligent it > should definitely be faster with THP on even with "-o". > > Overall I doubt there's anything in software to fix here. > > Also note, this is not related to additional cache usage during page > faults that I mentioned in the pdf. Page faults or cache effects in > the page faults are completely removed from the equation because of > "-o". The prefault pass, eliminates the page faults and trashes away > all the cache (regardless if the page fault uses non-temporal stores > or not) before the "measured" memcpy load starts. > Test results from perf stat show a significant reduction in cache-references and cache-misses when THP is off, how to explain this? cache-misses cache-references THP: 35455940 66267785 NO-THP: 16920763 17200000 > I don't think this is a major concern, as a proof of thumb you just > need to prefix the "perf" command with "time" to see it: the THP I test with "time ./perf bench mem memcpy -l 1gb -o", and the result is consistent with your expect. THP: 3.629896 GB/Sec (with prefault) real 0m0.849s user 0m0.472s sys 0m0.372s NO-THP: 6.169184 GB/Sec (with prefault) real 0m1.013s user 0m0.412s sys 0m0.596s > version still completes much faster despite the prefault part of it > is slightly slower with THP on. > Why the prefault part is slower with THP on? perf bench shows when no prefault, with THP on is much faster: # ./perf bench mem memcpy -l 1gb -n THP: 1.759009 GB/Sec NO-THP: 1.291761 GB/Sec Thanks again for your explanation. Jianguo Wu. > THP pays off the most during computations that are accessing randomly, > and not sequentially. > > Thanks, > Andrea > > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx129.postini.com [74.125.245.129]) by kanga.kvack.org (Postfix) with SMTP id 1BC126B0031 for ; Tue, 4 Jun 2013 23:26:48 -0400 (EDT) Message-ID: <51AEAFD8.305@huawei.com> Date: Wed, 5 Jun 2013 11:26:16 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> In-Reply-To: <51ADAC15.1050103@huawei.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , qiuxishi , Wanpeng Li , Hush Bensen , mitake@dcl.info.waseda.ac.jp Hi, One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy. But test result isn't consistent with perf bench when THP is off. my program perf bench THP: 3.628368 GB/Sec (with prefault) 3.672879 GB/Sec (with prefault) NO-THP: 3.612743 GB/Sec (with prefault) 6.190187 GB/Sec (with prefault) Below is my code: src = calloc(1, len); dst = calloc(1, len); if (prefault) memcpy(dst, src, len); gettimeofday(&tv_start, NULL); memcpy(dst, src, len); gettimeofday(&tv_end, NULL); timersub(&tv_end, &tv_start, &tv_diff); free(src); free(dst); speed = (double)((double)len / timeval2double(&tv_diff)); print_bps(speed); This is weird, is it possible that perf bench do some build optimize? Thansk, Jianguo Wu. On 2013/6/4 16:57, Jianguo Wu wrote: > Hi all, > > I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, > memcpy has worse performance. > > When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > > I think THP will improve performance, but the test result obviously not the case. > Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in > http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. > > I am not quite understand this, could you please give me some comments, Thanks! > > I test in Linux-3.4-stable, and my machine info is: > Intel(R) Xeon(R) CPU E5520 @ 2.27GHz > > available: 2 nodes (0-1) > node 0 cpus: 0 1 2 3 8 9 10 11 > node 0 size: 24567 MB > node 0 free: 23550 MB > node 1 cpus: 4 5 6 7 12 13 14 15 > node 1 size: 24576 MB > node 1 free: 23767 MB > node distances: > node 0 1 > 0: 10 20 > 1: 20 10 > > Below is test result: > ---with THP--- > #cat /sys/kernel/mm/transparent_hugepage/enabled > [always] madvise never > #./perf bench mem memcpy -l 1gb -o > # Running mem/memcpy benchmark... > # Copying 1gb Bytes ... > > 3.672879 GB/Sec (with prefault) > > #./perf stat ... > Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 35455940 cache-misses # 53.504 % of all cache refs [49.45%] > 66267785 cache-references [49.78%] > 2409 page-faults > 450768651 dTLB-loads > [50.78%] > 24580 dTLB-misses > # 0.01% of all dTLB cache hits [51.01%] > 1338974202 dTLB-stores > [50.63%] > 77943 dTLB-misses > [50.24%] > 697404997 iTLB-loads > [49.77%] > 274 iTLB-misses > # 0.00% of all iTLB cache hits [49.30%] > > 0.855041819 seconds time elapsed > > ---no THP--- > #cat /sys/kernel/mm/transparent_hugepage/enabled > always madvise [never] > > #./perf bench mem memcpy -l 1gb -o > # Running mem/memcpy benchmark... > # Copying 1gb Bytes ... > > 6.190187 GB/Sec (with prefault) > > #./perf stat ... > Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 16920763 cache-misses # 98.377 % of all cache refs [50.01%] > 17200000 cache-references [50.04%] > 524652 page-faults > 734365659 dTLB-loads > [50.04%] > 4986387 dTLB-misses > # 0.68% of all dTLB cache hits [50.04%] > 1013408298 dTLB-stores > [50.04%] > 8180817 dTLB-misses > [49.97%] > 1526642351 iTLB-loads > [50.41%] > 56 iTLB-misses > # 0.00% of all iTLB cache hits [50.21%] > > 1.025425847 seconds time elapsed > > Thanks, > Jianguo Wu. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id 6DCEF6B0031 for ; Thu, 6 Jun 2013 09:54:11 -0400 (EDT) Received: by mail-la0-f54.google.com with SMTP id ec20so2607849lab.13 for ; Thu, 06 Jun 2013 06:54:09 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <51AEAFD8.305@huawei.com> References: <51ADAC15.1050103@huawei.com> <51AEAFD8.305@huawei.com> Date: Thu, 6 Jun 2013 22:54:09 +0900 Message-ID: Subject: Re: Transparent Hugepage impact on memcpy From: Hitoshi Mitake Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-linux-mm@kvack.org List-ID: To: Jianguo Wu Cc: linux-mm@kvack.org, Andrea Arcangeli , qiuxishi , Wanpeng Li , Hush Bensen , mitake.hitoshi@gmail.com Hi Jianguo, On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu wrote: > Hi, > One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy. > But test result isn't consistent with perf bench when THP is off. > > my program perf bench > THP: 3.628368 GB/Sec (with prefault) 3.672879 GB/Sec (with prefault) > NO-THP: 3.612743 GB/Sec (with prefault) 6.190187 GB/Sec (with prefault) > > Below is my code: > src = calloc(1, len); > dst = calloc(1, len); > > if (prefault) > memcpy(dst, src, len); > gettimeofday(&tv_start, NULL); > memcpy(dst, src, len); > gettimeofday(&tv_end, NULL); > > timersub(&tv_end, &tv_start, &tv_diff); > free(src); > free(dst); > > speed = (double)((double)len / timeval2double(&tv_diff)); > print_bps(speed); > > This is weird, is it possible that perf bench do some build optimize? > > Thansk, > Jianguo Wu. perf bench mem memcpy is build with -O6. This is the compile command line (you can get this with make V=1): gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3 -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited Can I see your compile option for your test program and the actual command line executing perf bench mem memcpy? Thanks, Hitoshi > > On 2013/6/4 16:57, Jianguo Wu wrote: > >> Hi all, >> >> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >> memcpy has worse performance. >> >> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). >> >> I think THP will improve performance, but the test result obviously not the case. >> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in >> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. >> >> I am not quite understand this, could you please give me some comments, Thanks! >> >> I test in Linux-3.4-stable, and my machine info is: >> Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >> >> available: 2 nodes (0-1) >> node 0 cpus: 0 1 2 3 8 9 10 11 >> node 0 size: 24567 MB >> node 0 free: 23550 MB >> node 1 cpus: 4 5 6 7 12 13 14 15 >> node 1 size: 24576 MB >> node 1 free: 23767 MB >> node distances: >> node 0 1 >> 0: 10 20 >> 1: 20 10 >> >> Below is test result: >> ---with THP--- >> #cat /sys/kernel/mm/transparent_hugepage/enabled >> [always] madvise never >> #./perf bench mem memcpy -l 1gb -o >> # Running mem/memcpy benchmark... >> # Copying 1gb Bytes ... >> >> 3.672879 GB/Sec (with prefault) >> >> #./perf stat ... >> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >> >> 35455940 cache-misses # 53.504 % of all cache refs [49.45%] >> 66267785 cache-references [49.78%] >> 2409 page-faults >> 450768651 dTLB-loads >> [50.78%] >> 24580 dTLB-misses >> # 0.01% of all dTLB cache hits [51.01%] >> 1338974202 dTLB-stores >> [50.63%] >> 77943 dTLB-misses >> [50.24%] >> 697404997 iTLB-loads >> [49.77%] >> 274 iTLB-misses >> # 0.00% of all iTLB cache hits [49.30%] >> >> 0.855041819 seconds time elapsed >> >> ---no THP--- >> #cat /sys/kernel/mm/transparent_hugepage/enabled >> always madvise [never] >> >> #./perf bench mem memcpy -l 1gb -o >> # Running mem/memcpy benchmark... >> # Copying 1gb Bytes ... >> >> 6.190187 GB/Sec (with prefault) >> >> #./perf stat ... >> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >> >> 16920763 cache-misses # 98.377 % of all cache refs [50.01%] >> 17200000 cache-references [50.04%] >> 524652 page-faults >> 734365659 dTLB-loads >> [50.04%] >> 4986387 dTLB-misses >> # 0.68% of all dTLB cache hits [50.04%] >> 1013408298 dTLB-stores >> [50.04%] >> 8180817 dTLB-misses >> [49.97%] >> 1526642351 iTLB-loads >> [50.41%] >> 56 iTLB-misses >> # 0.00% of all iTLB cache hits [50.21%] >> >> 1.025425847 seconds time elapsed >> >> Thanks, >> Jianguo Wu. > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx119.postini.com [74.125.245.119]) by kanga.kvack.org (Postfix) with SMTP id 5BEF36B0032 for ; Thu, 6 Jun 2013 21:27:24 -0400 (EDT) Message-ID: <51B136E2.4010606@huawei.com> Date: Fri, 7 Jun 2013 09:26:58 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> <51AEAFD8.305@huawei.com> In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Hitoshi Mitake Cc: linux-mm@kvack.org, Andrea Arcangeli , qiuxishi , Wanpeng Li , Hush Bensen , mitake.hitoshi@gmail.com Hi Hitoshi, Thanks for your reply! please see below. On 2013/6/6 21:54, Hitoshi Mitake wrote: > Hi Jianguo, > > On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu wrote: >> Hi, >> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy. >> But test result isn't consistent with perf bench when THP is off. >> >> my program perf bench >> THP: 3.628368 GB/Sec (with prefault) 3.672879 GB/Sec (with prefault) >> NO-THP: 3.612743 GB/Sec (with prefault) 6.190187 GB/Sec (with prefault) >> >> Below is my code: >> src = calloc(1, len); >> dst = calloc(1, len); >> >> if (prefault) >> memcpy(dst, src, len); >> gettimeofday(&tv_start, NULL); >> memcpy(dst, src, len); >> gettimeofday(&tv_end, NULL); >> >> timersub(&tv_end, &tv_start, &tv_diff); >> free(src); >> free(dst); >> >> speed = (double)((double)len / timeval2double(&tv_diff)); >> print_bps(speed); >> >> This is weird, is it possible that perf bench do some build optimize? >> >> Thansk, >> Jianguo Wu. > > perf bench mem memcpy is build with -O6. This is the compile command > line (you can get this with make V=1): > gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3 > -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited > > Can I see your compile option for your test program and the actual > command line executing perf bench mem memcpy? > I just compiled my test program with gcc -o memcpy-test memcpy-test.c. I tried to use the same compile option with perf bench mem memcpy, and the test result showed no difference. My execute command line for perf bench mem memcpy: #./perf bench mem memcpy -l 1gb -o Thanks, Jianguo Wu > Thanks, > Hitoshi > >> >> On 2013/6/4 16:57, Jianguo Wu wrote: >> >>> Hi all, >>> >>> I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >>> memcpy has worse performance. >>> >>> When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). >>> >>> I think THP will improve performance, but the test result obviously not the case. >>> Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in >>> http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. >>> >>> I am not quite understand this, could you please give me some comments, Thanks! >>> >>> I test in Linux-3.4-stable, and my machine info is: >>> Intel(R) Xeon(R) CPU E5520 @ 2.27GHz >>> >>> available: 2 nodes (0-1) >>> node 0 cpus: 0 1 2 3 8 9 10 11 >>> node 0 size: 24567 MB >>> node 0 free: 23550 MB >>> node 1 cpus: 4 5 6 7 12 13 14 15 >>> node 1 size: 24576 MB >>> node 1 free: 23767 MB >>> node distances: >>> node 0 1 >>> 0: 10 20 >>> 1: 20 10 >>> >>> Below is test result: >>> ---with THP--- >>> #cat /sys/kernel/mm/transparent_hugepage/enabled >>> [always] madvise never >>> #./perf bench mem memcpy -l 1gb -o >>> # Running mem/memcpy benchmark... >>> # Copying 1gb Bytes ... >>> >>> 3.672879 GB/Sec (with prefault) >>> >>> #./perf stat ... >>> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >>> >>> 35455940 cache-misses # 53.504 % of all cache refs [49.45%] >>> 66267785 cache-references [49.78%] >>> 2409 page-faults >>> 450768651 dTLB-loads >>> [50.78%] >>> 24580 dTLB-misses >>> # 0.01% of all dTLB cache hits [51.01%] >>> 1338974202 dTLB-stores >>> [50.63%] >>> 77943 dTLB-misses >>> [50.24%] >>> 697404997 iTLB-loads >>> [49.77%] >>> 274 iTLB-misses >>> # 0.00% of all iTLB cache hits [49.30%] >>> >>> 0.855041819 seconds time elapsed >>> >>> ---no THP--- >>> #cat /sys/kernel/mm/transparent_hugepage/enabled >>> always madvise [never] >>> >>> #./perf bench mem memcpy -l 1gb -o >>> # Running mem/memcpy benchmark... >>> # Copying 1gb Bytes ... >>> >>> 6.190187 GB/Sec (with prefault) >>> >>> #./perf stat ... >>> Performance counter stats for './perf bench mem memcpy -l 1gb -o': >>> >>> 16920763 cache-misses # 98.377 % of all cache refs [50.01%] >>> 17200000 cache-references [50.04%] >>> 524652 page-faults >>> 734365659 dTLB-loads >>> [50.04%] >>> 4986387 dTLB-misses >>> # 0.68% of all dTLB cache hits [50.04%] >>> 1013408298 dTLB-stores >>> [50.04%] >>> 8180817 dTLB-misses >>> [49.97%] >>> 1526642351 iTLB-loads >>> [50.41%] >>> 56 iTLB-misses >>> # 0.00% of all iTLB cache hits [50.21%] >>> >>> 1.025425847 seconds time elapsed >>> >>> Thanks, >>> Jianguo Wu. >> >> >> >> > > . > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx168.postini.com [74.125.245.168]) by kanga.kvack.org (Postfix) with SMTP id 98B796B0032 for ; Fri, 7 Jun 2013 09:50:19 -0400 (EDT) Received: by mail-pd0-f182.google.com with SMTP id g10so4810453pdj.41 for ; Fri, 07 Jun 2013 06:50:18 -0700 (PDT) Date: Fri, 07 Jun 2013 22:50:09 +0900 Message-ID: <87txlado8e.wl%mitake.hitoshi@gmail.com> From: Hitoshi Mitake Subject: Re: Transparent Hugepage impact on memcpy In-Reply-To: <51B136E2.4010606@huawei.com> References: <51ADAC15.1050103@huawei.com> <51AEAFD8.305@huawei.com> <51B136E2.4010606@huawei.com> MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org List-ID: To: Jianguo Wu Cc: Hitoshi Mitake , linux-mm@kvack.org, Andrea Arcangeli , qiuxishi , Wanpeng Li , Hush Bensen , mitake.hitoshi@gmail.com At Fri, 7 Jun 2013 09:26:58 +0800, Jianguo Wu wrote: > > Hi Hitoshi, > > Thanks for your reply! please see below. > > On 2013/6/6 21:54, Hitoshi Mitake wrote: > > > Hi Jianguo, > > > > On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu wrote: > >> Hi, > >> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy. > >> But test result isn't consistent with perf bench when THP is off. > >> > >> my program perf bench > >> THP: 3.628368 GB/Sec (with prefault) 3.672879 GB/Sec (with prefault) > >> NO-THP: 3.612743 GB/Sec (with prefault) 6.190187 GB/Sec (with prefault) > >> > >> Below is my code: > >> src = calloc(1, len); > >> dst = calloc(1, len); > >> > >> if (prefault) > >> memcpy(dst, src, len); > >> gettimeofday(&tv_start, NULL); > >> memcpy(dst, src, len); > >> gettimeofday(&tv_end, NULL); > >> > >> timersub(&tv_end, &tv_start, &tv_diff); > >> free(src); > >> free(dst); > >> > >> speed = (double)((double)len / timeval2double(&tv_diff)); > >> print_bps(speed); > >> > >> This is weird, is it possible that perf bench do some build optimize? > >> > >> Thansk, > >> Jianguo Wu. > > > > perf bench mem memcpy is build with -O6. This is the compile command > > line (you can get this with make V=1): > > gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3 > > -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited > > > > Can I see your compile option for your test program and the actual > > command line executing perf bench mem memcpy? > > > > I just compiled my test program with gcc -o memcpy-test memcpy-test.c. > I tried to use the same compile option with perf bench mem memcpy, and > the test result showed no difference. > > My execute command line for perf bench mem memcpy: > #./perf bench mem memcpy -l 1gb -o Thanks for your information. I have three more requests for reproducing the problem: 1. the entire source code of your program 2. your gcc version 3. your glibc version I should've requested it first, sorry :( Thanks, Hitoshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx192.postini.com [74.125.245.192]) by kanga.kvack.org (Postfix) with SMTP id 6E9DC6B0031 for ; Fri, 7 Jun 2013 21:13:51 -0400 (EDT) Message-ID: <51B28531.2050403@huawei.com> Date: Sat, 8 Jun 2013 09:13:21 +0800 From: Jianguo Wu MIME-Version: 1.0 Subject: Re: Transparent Hugepage impact on memcpy References: <51ADAC15.1050103@huawei.com> <51AEAFD8.305@huawei.com> <51B136E2.4010606@huawei.com> <87txlado8e.wl%mitake.hitoshi@gmail.com> In-Reply-To: <87txlado8e.wl%mitake.hitoshi@gmail.com> Content-Type: multipart/mixed; boundary="------------040503040505000904030603" Sender: owner-linux-mm@kvack.org List-ID: To: Hitoshi Mitake Cc: Hitoshi Mitake , linux-mm@kvack.org, Andrea Arcangeli , qiuxishi , Wanpeng Li , Hush Bensen --------------040503040505000904030603 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Hi Hitoshi, On 2013/6/7 21:50, Hitoshi Mitake wrote: > At Fri, 7 Jun 2013 09:26:58 +0800, > Jianguo Wu wrote: >> >> Hi Hitoshi, >> >> Thanks for your reply! please see below. >> >> On 2013/6/6 21:54, Hitoshi Mitake wrote: >> >>> Hi Jianguo, >>> >>> On Wed, Jun 5, 2013 at 12:26 PM, Jianguo Wu wrote: >>>> Hi, >>>> One more question, I wrote a memcpy test program, mostly the same as with perf bench memcpy. >>>> But test result isn't consistent with perf bench when THP is off. >>>> >>>> my program perf bench >>>> THP: 3.628368 GB/Sec (with prefault) 3.672879 GB/Sec (with prefault) >>>> NO-THP: 3.612743 GB/Sec (with prefault) 6.190187 GB/Sec (with prefault) >>>> >>>> Below is my code: >>>> src = calloc(1, len); >>>> dst = calloc(1, len); >>>> >>>> if (prefault) >>>> memcpy(dst, src, len); >>>> gettimeofday(&tv_start, NULL); >>>> memcpy(dst, src, len); >>>> gettimeofday(&tv_end, NULL); >>>> >>>> timersub(&tv_end, &tv_start, &tv_diff); >>>> free(src); >>>> free(dst); >>>> >>>> speed = (double)((double)len / timeval2double(&tv_diff)); >>>> print_bps(speed); >>>> >>>> This is weird, is it possible that perf bench do some build optimize? >>>> >>>> Thansk, >>>> Jianguo Wu. >>> >>> perf bench mem memcpy is build with -O6. This is the compile command >>> line (you can get this with make V=1): >>> gcc -o bench/mem-memcpy-x86-64-asm.o -c -fno-omit-frame-pointer -ggdb3 >>> -funwind-tables -Wall -Wextra -std=gnu99 -Werror -O6 .... # ommited >>> >>> Can I see your compile option for your test program and the actual >>> command line executing perf bench mem memcpy? >>> >> >> I just compiled my test program with gcc -o memcpy-test memcpy-test.c. >> I tried to use the same compile option with perf bench mem memcpy, and >> the test result showed no difference. >> >> My execute command line for perf bench mem memcpy: >> #./perf bench mem memcpy -l 1gb -o > > Thanks for your information. I have three more requests for > reproducing the problem: > > 1. the entire source code of your program Please see the attachment. > 2. your gcc version 4.3.4 > 3. your glibc version glibc-2.11.1-0.17.4 Thanks, Jianguo Wu > > I should've requested it first, sorry :( > > Thanks, > Hitoshi > > . > --------------040503040505000904030603 Content-Type: text/plain; charset="gb18030"; name="memcpy-prefault.c" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="memcpy-prefault.c" I2luY2x1ZGUgPHN0ZGlvLmg+CiNpbmNsdWRlIDxzdGRsaWIuaD4KI2luY2x1ZGUgPHN0cmlu Zy5oPgojaW5jbHVkZSA8c3lzL3RpbWUuaD4KI2luY2x1ZGUgPHVuaXN0ZC5oPgoKI2RlZmlu ZSBLIDEwMjRMTAojZGVmaW5lIHByaW50X2Jwcyh4KSBkbyB7CQkJCQlcCgkJaWYgKHggPCBL KQkJCQkJXAoJCQlwcmludGYoIiAlMTRsZiBCL1NlYyIsIHgpOwkJXAoJCWVsc2UgaWYgKHgg PCBLICogSykJCQkJXAoJCQlwcmludGYoIiAlMTRsZmQgS0IvU2VjIiwgeCAvIEspOwlcCgkJ ZWxzZSBpZiAoeCA8IEsgKiBLICogSykJCQkJXAoJCQlwcmludGYoIiAlMTRsZiBNQi9TZWMi LCB4IC8gSyAvIEspOwlcCgkJZWxzZQkJCQkJCVwKCQkJcHJpbnRmKCIgJTE0bGYgR0IvU2Vj IiwgeCAvIEsgLyBLIC8gSyk7IFwKCX0gd2hpbGUgKDApCgpsb25nIGxvbmcgbG9jYWxfYXRv bGwoY29uc3QgY2hhciAqc3RyKQp7Cgl1bnNpZ25lZCBpbnQgaTsKCWxvbmcgbG9uZyBsZW5n dGggPSAtMSwgdW5pdCA9IDE7CgoJaWYgKCFpc2RpZ2l0KHN0clswXSkpCgkJZ290byBvdXRf ZXJyOwoKCWZvciAoaSA9IDE7IGkgPCBzdHJsZW4oc3RyKTsgaSsrKSB7CgkJc3dpdGNoIChz dHJbaV0pIHsKCQljYXNlICdCJzoKCQljYXNlICdiJzoKCQkJYnJlYWs7CgkJY2FzZSAnSyc6 CgkJCWlmIChzdHJbaSArIDFdICE9ICdCJykKCQkJCWdvdG8gb3V0X2VycjsKCQkJZWxzZQoJ CQkJZ290byBraWxvOwoJCWNhc2UgJ2snOgoJCQlpZiAoc3RyW2kgKyAxXSAhPSAnYicpCgkJ CQlnb3RvIG91dF9lcnI7CmtpbG86CgkJCXVuaXQgPSBLOwoJCQlicmVhazsKCQljYXNlICdN JzoKCQkJaWYgKHN0cltpICsgMV0gIT0gJ0InKQoJCQkJZ290byBvdXRfZXJyOwoJCQllbHNl CgkJCQlnb3RvIG1lZ2E7CgkJY2FzZSAnbSc6CgkJCWlmIChzdHJbaSArIDFdICE9ICdiJykK CQkJCWdvdG8gb3V0X2VycjsKbWVnYToKCQkJdW5pdCA9IEsgKiBLOwoJCQlicmVhazsKCQlj YXNlICdHJzoKCQkJaWYgKHN0cltpICsgMV0gIT0gJ0InKQoJCQkJZ290byBvdXRfZXJyOwoJ CQllbHNlCgkJCQlnb3RvIGdpZ2E7CgkJY2FzZSAnZyc6CgkJCWlmIChzdHJbaSArIDFdICE9 ICdiJykKCQkJCWdvdG8gb3V0X2VycjsKZ2lnYToKCQkJdW5pdCA9IEsgKiBLICogSzsKCQkJ YnJlYWs7CgkJY2FzZSAnVCc6CgkJCWlmIChzdHJbaSArIDFdICE9ICdCJykKCQkJCWdvdG8g b3V0X2VycjsKCQkJZWxzZQoJCQkJZ290byB0ZXJhOwoJCWNhc2UgJ3QnOgoJCQlpZiAoc3Ry W2kgKyAxXSAhPSAnYicpCgkJCQlnb3RvIG91dF9lcnI7CnRlcmE6CgkJCXVuaXQgPSBLICog SyAqIEsgKiBLOwoJCQlicmVhazsKCQljYXNlICdcMCc6CS8qIG9ubHkgc3BlY2lmaWVkIGZp Z3VyZXMgKi8KCQkJdW5pdCA9IDE7CgkJCWJyZWFrOwoJCWRlZmF1bHQ6CgkJCWlmICghaXNk aWdpdChzdHJbaV0pKQoJCQkJZ290byBvdXRfZXJyOwoJCQlicmVhazsKCQl9Cgl9CgoJbGVu Z3RoID0gYXRvbGwoc3RyKSAqIHVuaXQ7Cglnb3RvIG91dDsKCm91dF9lcnI6CglsZW5ndGgg PSAtMTsKb3V0OgoJcmV0dXJuIGxlbmd0aDsKfQoKc3RhdGljIGRvdWJsZSB0aW1ldmFsMmRv dWJsZShzdHJ1Y3QgdGltZXZhbCAqdHMpCnsKCXJldHVybiAoZG91YmxlKXRzLT50dl9zZWMg KwoJCQkoZG91YmxlKXRzLT50dl91c2VjIC8gKGRvdWJsZSkxMDAwMDAwOwp9Cgp2b2lkIGRv X21lbWNweShsb25nIGxvbmcgbGVuLCBpbnQgcHJlZmF1bHQpCnsKCXZvaWQgKnNyYywgKmRz dDsKCXN0cnVjdCB0aW1ldmFsIHR2X3N0YXJ0LCB0dl9lbmQsIHR2X2RpZmY7Cglkb3VibGUg cmVzOwoKCXNyYyA9IGNhbGxvYygxLCBsZW4pOwoJZHN0ID0gY2FsbG9jKDEsIGxlbik7CgoJ aWYgKHByZWZhdWx0KQoJCW1lbWNweShkc3QsIHNyYywgbGVuKTsKCWdldHRpbWVvZmRheSgm dHZfc3RhcnQsIE5VTEwpOwoJbWVtY3B5KGRzdCwgc3JjLCBsZW4pOwoJZ2V0dGltZW9mZGF5 KCZ0dl9lbmQsIE5VTEwpOwoKCXRpbWVyc3ViKCZ0dl9lbmQsICZ0dl9zdGFydCwgJnR2X2Rp ZmYpOwoJZnJlZShzcmMpOwoJZnJlZShkc3QpOwoKCXJlcyA9IChkb3VibGUpKChkb3VibGUp bGVuIC8gdGltZXZhbDJkb3VibGUoJnR2X2RpZmYpKTsKCXByaW50X2JwcyhyZXMpOwoJaWYg KHByZWZhdWx0KQoJCXByaW50ZigiXHQod2l0aCBwcmVmYXVsdCkiKTsKCXByaW50ZigiXG4i KTsKCn0KCmludCBtYWluKGludCBhcmdjLCBjaGFyICphcmd2W10pCnsKCWxvbmcgbG9uZyBs ZW4gPSAtMTsgCgljaGFyIGNoOwoJaW50IHByZWZhdWx0ID0gMDsKCgl3aGlsZSggKGNoPWdl dG9wdChhcmdjLCBhcmd2LCAibDoiKSApICE9IC0xICkgIAoJeyAgCgkJc3dpdGNoKGNoKSAg CgkJeyAgCgkJCWNhc2UgJ2wnOgoJCQkJbGVuID0gbG9jYWxfYXRvbGwob3B0YXJnKTsKCQkJ CWlmIChsZW4gPCAwKSB7CgkJCQkJcHJpbnRmKCJJbnZhbGlkIHNpemVcbiIpOwoJCQkJCXJl dHVybiAwOwoJCQkJfSBlbHNlCQkJCQoJCQkJCXByaW50ZigiIyBDb3B5aW5nICVzIEJ5dGUg Li4uXG4iLCBvcHRhcmcpOwoJCQkJYnJlYWs7CgkJCWRlZmF1bHQ6CgkJCQlyZXR1cm47CgkJ fQoJfQoKCWRvX21lbWNweShsZW4sIDEpOwkKCQoJcmV0dXJuIDA7Cn0K --------------040503040505000904030603-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wanpeng Li Subject: Re: Transparent Hugepage impact on memcpy Date: Tue, 4 Jun 2013 20:30:51 +0800 Message-ID: <2109.09282691336$1370349073@news.gmane.org> References: <51ADAC15.1050103@huawei.com> Reply-To: Wanpeng Li Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="NzB8fVQJ5HfG6fxh" Return-path: Received: from kanga.kvack.org ([205.233.56.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1UjqOI-0003Ap-FJ for glkm-linux-mm-2@m.gmane.org; Tue, 04 Jun 2013 14:31:06 +0200 Received: from psmtp.com (na3sys010amx109.postini.com [74.125.245.109]) by kanga.kvack.org (Postfix) with SMTP id D5E156B0093 for ; Tue, 4 Jun 2013 08:31:03 -0400 (EDT) Received: from /spool/local by e28smtp03.in.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 4 Jun 2013 17:55:30 +0530 Received: from d28relay05.in.ibm.com (d28relay05.in.ibm.com [9.184.220.62]) by d28dlp03.in.ibm.com (Postfix) with ESMTP id 2827C125804F for ; Tue, 4 Jun 2013 18:03:02 +0530 (IST) Received: from d28av03.in.ibm.com (d28av03.in.ibm.com [9.184.220.65]) by d28relay05.in.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r54CUnZ856361008 for ; Tue, 4 Jun 2013 18:00:49 +0530 Received: from d28av03.in.ibm.com (loopback [127.0.0.1]) by d28av03.in.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r54CUrJF018270 for ; Tue, 4 Jun 2013 22:30:53 +1000 Content-Disposition: inline In-Reply-To: <51ADAC15.1050103@huawei.com> Sender: owner-linux-mm@kvack.org List-ID: To: Jianguo Wu Cc: linux-mm@kvack.org, Andrea Arcangeli , qiuxishi --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Tue, Jun 04, 2013 at 04:57:57PM +0800, Jianguo Wu wrote: >Hi all, > >I tested memcpy with perf bench, and found that in prefault case, When Transparent Hugepage is on, >memcpy has worse performance. > >When THP on is 3.672879 GB/Sec (with prefault), while THP off is 6.190187 GB/Sec (with prefault). > I get similar result as you against 3.10-rc4 in the attachment. This dues to the characteristic of thp takes a single page fault for each 2MB virtual region touched by userland. >I think THP will improve performance, but the test result obviously not the case. >Andrea mentioned THP cause "clear_page/copy_page less cache friendly" in >http://events.linuxfoundation.org/slides/2011/lfcs/lfcs2011_hpc_arcangeli.pdf. > >I am not quite understand this, could you please give me some comments, Thanks! > >I test in Linux-3.4-stable, and my machine info is: >Intel(R) Xeon(R) CPU E5520 @ 2.27GHz > >available: 2 nodes (0-1) >node 0 cpus: 0 1 2 3 8 9 10 11 >node 0 size: 24567 MB >node 0 free: 23550 MB >node 1 cpus: 4 5 6 7 12 13 14 15 >node 1 size: 24576 MB >node 1 free: 23767 MB >node distances: >node 0 1 > 0: 10 20 > 1: 20 10 > >Below is test result: >---with THP--- >#cat /sys/kernel/mm/transparent_hugepage/enabled >[always] madvise never >#./perf bench mem memcpy -l 1gb -o ># Running mem/memcpy benchmark... ># Copying 1gb Bytes ... > > 3.672879 GB/Sec (with prefault) > >#./perf stat ... >Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 35455940 cache-misses # 53.504 % of all cache refs [49.45%] > 66267785 cache-references [49.78%] > 2409 page-faults > 450768651 dTLB-loads > [50.78%] > 24580 dTLB-misses > # 0.01% of all dTLB cache hits [51.01%] > 1338974202 dTLB-stores > [50.63%] > 77943 dTLB-misses > [50.24%] > 697404997 iTLB-loads > [49.77%] > 274 iTLB-misses > # 0.00% of all iTLB cache hits [49.30%] > > 0.855041819 seconds time elapsed > >---no THP--- >#cat /sys/kernel/mm/transparent_hugepage/enabled >always madvise [never] > >#./perf bench mem memcpy -l 1gb -o ># Running mem/memcpy benchmark... ># Copying 1gb Bytes ... > > 6.190187 GB/Sec (with prefault) > >#./perf stat ... >Performance counter stats for './perf bench mem memcpy -l 1gb -o': > > 16920763 cache-misses # 98.377 % of all cache refs [50.01%] > 17200000 cache-references [50.04%] > 524652 page-faults > 734365659 dTLB-loads > [50.04%] > 4986387 dTLB-misses > # 0.68% of all dTLB cache hits [50.04%] > 1013408298 dTLB-stores > [50.04%] > 8180817 dTLB-misses > [49.97%] > 1526642351 iTLB-loads > [50.41%] > 56 iTLB-misses > # 0.00% of all iTLB cache hits [50.21%] > > 1.025425847 seconds time elapsed > >Thanks, >Jianguo Wu. > >-- >To unsubscribe, send a message with 'unsubscribe linux-mm' in >the body to majordomo@kvack.org. For more info on Linux MM, >see: http://www.linux-mm.org/ . >Don't email: email@kvack.org --NzB8fVQJ5HfG6fxh Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=thp ---with THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 12.208522 GB/Sec (with prefault) Performance counter stats for './perf bench mem memcpy -l 1gb -o': 26,453,696 cache-misses # 35.411 % of all cache refs [57.66%] 74,704,531 cache-references [58.40%] 2,297 page-faults 146,567,960 dTLB-loads [58.64%] 211,648,685 dTLB-stores [58.63%] 14,533 dTLB-load-misses # 0.01% of all dTLB cache hits [57.46%] 640 iTLB-loads [55.74%] 270,881 iTLB-load-misses # 42325.16% of all iTLB cache hits [55.17%] 0.232425109 seconds time elapsed ---no THP--- #cat /sys/kernel/mm/transparent_hugepage/enabled always madvise [never] # Running mem/memcpy benchmark... # Copying 1gb Bytes ... 18.325087 GB/Sec (with prefault) Performance counter stats for './perf bench mem memcpy -l 1gb -o': 28,498,544 cache-misses # 86.167 % of all cache refs [57.35%] 33,073,611 cache-references [57.71%] 524,540 page-faults 453,500,641 dTLB-loads [57.99%] 409,255,606 dTLB-stores [57.99%] 2,033,985 dTLB-load-misses # 0.45% of all dTLB cache hits [57.52%] 1,180 iTLB-loads [56.69%] 539,056 iTLB-load-misses # 45682.71% of all iTLB cache hits [56.02%] 0.485932214 seconds time elapsed --NzB8fVQJ5HfG6fxh-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org