From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754515AbbJSRrw (ORCPT ); Mon, 19 Oct 2015 13:47:52 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:35183 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751705AbbJSRru (ORCPT ); Mon, 19 Oct 2015 13:47:50 -0400 Date: Mon, 19 Oct 2015 19:47:44 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Linux Kernel Mailing List , Peter Zijlstra , Arnaldo Carvalho de Melo , Namhyung Kim , David Ahern , Jiri Olsa , Hitoshi Mitake , Thomas Gleixner Subject: Re: [PATCH 02/14] perf/bench: Default to all routines in 'perf bench mem' Message-ID: <20151019174744.GA2031@gmail.com> References: <1445241870-24854-1-git-send-email-mingo@kernel.org> <1445241870-24854-3-git-send-email-mingo@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Mon, Oct 19, 2015 at 1:04 AM, Ingo Molnar wrote: > > > > triton:~> perf bench mem all > > # Running mem/memcpy benchmark... > > Routine default (Default memcpy() provided by glibc) > > 4.957170 GB/Sec (with prefault) > > Routine x86-64-unrolled (unrolled memcpy() in arch/x86/lib/memcpy_64.S) > > 4.379204 GB/Sec (with prefault) > > Routine x86-64-movsq (movsq-based memcpy() in arch/x86/lib/memcpy_64.S) > > 4.264465 GB/Sec (with prefault) > > Routine x86-64-movsb (movsb-based memcpy() in arch/x86/lib/memcpy_64.S) > > 6.554111 GB/Sec (with prefault) > > Is this skylake? And why are the numbers so low? Even on my laptop > (Haswell), I get ~21GB/s (when setting cpufreq to performance). No, this was on my desktop, which is a water cooled IvyBridge running at 3.6GHz: processor : 11 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Core(TM) i7-4960X CPU @ 3.60GHz stepping : 4 microcode : 0x416 cpu MHz : 1303.031 cache size : 15360 KB and I didn't really think about the validity of the numbers when I made the changelog, as I rarely benchmark on this box, due to it having various desktop loads running all the time. AAs you noticed the results are highly variable with default settings: triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb 2>&1 | grep GB 5.580357 GB/sec 5.580357 GB/sec 16.551907 GB/sec 16.551907 GB/sec 15.258789 GB/sec 16.837284 GB/sec 16.837284 GB/sec 16.837284 GB/sec 16.551907 GB/sec 16.837284 GB/sec They get more reliable with '-l 10000' (10,000 loops instead of the default 1): triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb -l 10000 2>&1 | grep GB 15.483591 GB/sec 16.975429 GB/sec 17.088396 GB/sec 20.920407 GB/sec 21.346655 GB/sec 21.322372 GB/sec 21.338306 GB/sec 21.342130 GB/sec 21.339984 GB/sec 21.373145 GB/sec that's purely cached. Also note how after a few seconds it gets faster, due to cpufreq as you suspected. So once I fix the frequency of all cores to the max, I get much more reliable results: triton:~/tip> taskset 1 perf stat --null --repeat 10 perf bench mem memcpy -f x86-64-movsb -l 10000 2>&1 | grep -E 'GB|elaps' 21.356879 GB/sec 21.378526 GB/sec 21.351976 GB/sec 21.375203 GB/sec 21.369824 GB/sec 21.353236 GB/sec 21.283708 GB/sec 21.380679 GB/sec 21.347915 GB/sec 21.378572 GB/sec 0.459286278 seconds time elapsed ( +- 0.04% ) I'll add a debug check to 'perf bench' to warn about systems that have variable cpufreq running - this is too easy a mistake to make :-/ So with the benchmark stabilized, I get the following results: triton:~/tip> taskset 1 perf bench mem memcpy -f all -l 10000 # Running 'mem/memcpy' benchmark: # function 'default' (Default memcpy() provided by glibc) # Copying 1MB bytes ... 18.356783 GB/sec # function 'x86-64-unrolled' (unrolled memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 16.294889 GB/sec # function 'x86-64-movsq' (movsq-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 15.760032 GB/sec # function 'x86-64-movsb' (movsb-based memcpy() in arch/x86/lib/memcpy_64.S) # Copying 1MB bytes ... 21.145818 GB/sec which matches your observations: > It's interesting that 'movsb' for you is so much better. It's been > promising before, and it *should* be able to do better than manual > copying, but it's not been that noticeable on the machines I've > tested. But I haven't ued Skylake or Broadwell yet. > > cpufreq might be making a difference too. Maybe it's just ramping up > the CPU? Or is that really repeatable? So modulo the cpufreq multiplier it seems repeatable on this IB system - will try it on SkyLake as well. Before relying on it I also wanted to implement the following 'perf bench' improvements: - make it more representative of kernel usage by benchmarking a list of characteristic lengths, not just the single stupid 1MB buffer. At smaller buffer sizes I'd expect MOVSB to have even more of a fundamental advantage (due to having all the differentiation in hardware) - but we don't know the latencies of those cases, some of which are in microcode I suspect. - measure aligned/unaligned buffer address and length effects as well - measure cache-cold numbers as well. This is pretty hard but not impossible. With that we could start validating our fundamental memory op routines in user-space. Thanks, Ingo