All of lore.kernel.org
 help / color / mirror / Atom feed
From: tip-bot for Borislav Petkov <tipbot@zytor.com>
To: linux-tip-commits@vger.kernel.org
Cc: tglx@linutronix.de, peterz@infradead.org, luto@kernel.org,
	brgerst@gmail.com, bp@alien8.de, dvlasenk@redhat.com,
	fengguang.wu@intel.com, torvalds@linux-foundation.org,
	bp@suse.de, jpoimboe@redhat.com, hpa@zytor.com,
	linux-kernel@vger.kernel.org, mingo@kernel.org
Subject: [tip:x86/asm] x86/asm: Optimize clear_page()
Date: Wed, 1 Mar 2017 01:47:09 -0800	[thread overview]
Message-ID: <tip-49ca7bb328c630dd43be626534b49e19513296fd@git.kernel.org> (raw)
In-Reply-To: <20170215111927.emdgxf2pide3kwro@pd.tnic>

Commit-ID:  49ca7bb328c630dd43be626534b49e19513296fd
Gitweb:     http://git.kernel.org/tip/49ca7bb328c630dd43be626534b49e19513296fd
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Thu, 9 Feb 2017 01:34:49 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 1 Mar 2017 10:18:32 +0100

x86/asm: Optimize clear_page()

Currently, we CALL clear_page() which then JMPs to the proper function
chosen by the alternatives.

What we should do instead is CALL the proper function directly. (This
was something Ingo suggested a while ago). So let's do that.

Measuring our favourite kernel build workload shows that there are no
significant changes in performance.

AMD
===
  -- /tmp/before 2017-02-09 18:01:46.451961188 +0100
  ++ /tmp/after  2017-02-09 18:01:54.883961175 +0100
  @@ -1,15 +1,15 @@
    Performance counter stats for 'system wide' (5 runs):

  -    1028960.373643      cpu-clock (msec)          #    6.000 CPUs utilized            ( +-  1.41% )
  +    1023086.018961      cpu-clock (msec)          #    6.000 CPUs utilized            ( +-  1.20% )
  -           518,744      context-switches          #    0.504 K/sec                    ( +-  1.04% )
  +           518,254      context-switches          #    0.507 K/sec                    ( +-  1.01% )
  -            38,112      cpu-migrations            #    0.037 K/sec                    ( +-  1.95% )
  +            37,917      cpu-migrations            #    0.037 K/sec                    ( +-  1.02% )
  -        20,874,266      page-faults               #    0.020 M/sec                    ( +-  0.07% )
  +        20,918,897      page-faults               #    0.020 M/sec                    ( +-  0.18% )
  - 2,043,646,230,667      cycles                    #    1.986 GHz                      ( +-  0.14% )  (66.67%)
  + 2,045,305,584,032      cycles                    #    1.999 GHz                      ( +-  0.16% )  (66.67%)
  -   553,698,855,431      stalled-cycles-frontend   #   27.09% frontend cycles idle     ( +-  0.07% )  (66.67%)
  +   555,099,401,413      stalled-cycles-frontend   #   27.14% frontend cycles idle     ( +-  0.13% )  (66.67%)
  -   621,544,286,390      stalled-cycles-backend    #   30.41% backend cycles idle      ( +-  0.39% )  (66.67%)
  +   621,371,430,254      stalled-cycles-backend    #   30.38% backend cycles idle      ( +-  0.32% )  (66.67%)
  - 1,738,364,431,659      instructions              #    0.85  insn per cycle
  + 1,739,895,771,901      instructions              #    0.85  insn per cycle
  -                                                  #    0.36  stalled cycles per insn  ( +-  0.11% )  (66.67%)
  +                                                  #    0.36  stalled cycles per insn  ( +-  0.13% )  (66.67%)
  -   391,170,943,850      branches                  #  380.161 M/sec                    ( +-  0.13% )  (66.67%)
  +   391,398,551,757      branches                  #  382.567 M/sec                    ( +-  0.13% )  (66.67%)
  -    22,567,810,411      branch-misses             #    5.77% of all branches          ( +-  0.11% )  (66.67%)
  +    22,574,726,683      branch-misses             #    5.77% of all branches          ( +-  0.13% )  (66.67%)

  -     171.480741921 seconds time elapsed                                          ( +-  1.41% )
  +     170.509229451 seconds time elapsed                                          ( +-  1.20% )

Intel
=====

  -- /tmp/before 2017-02-09 20:36:19.851947473 +0100
  ++ /tmp/after  2017-02-09 20:36:30.151947458 +0100
  @@ -1,15 +1,15 @@
    Performance counter stats for 'system wide' (5 runs):

  -    2207248.598126      cpu-clock (msec)          #    8.000 CPUs utilized            ( +-  0.69% )
  +    2213300.106631      cpu-clock (msec)          #    8.000 CPUs utilized            ( +-  0.73% )
  -           899,342      context-switches          #    0.407 K/sec                    ( +-  0.68% )
  +           898,381      context-switches          #    0.406 K/sec                    ( +-  0.79% )
  -            80,553      cpu-migrations            #    0.036 K/sec                    ( +-  1.13% )
  +            80,979      cpu-migrations            #    0.037 K/sec                    ( +-  1.11% )
  -        36,171,148      page-faults               #    0.016 M/sec                    ( +-  0.02% )
  +        36,179,791      page-faults               #    0.016 M/sec                    ( +-  0.02% )
  - 6,665,288,826,484      cycles                    #    3.020 GHz                      ( +-  0.07% )  (83.33%)
  + 6,671,638,410,799      cycles                    #    3.014 GHz                      ( +-  0.06% )  (83.33%)
  - 5,065,975,115,197      stalled-cycles-frontend   #   76.01% frontend cycles idle     ( +-  0.11% )  (83.33%)
  + 5,076,835,183,223      stalled-cycles-frontend   #   76.10% frontend cycles idle     ( +-  0.11% )  (83.33%)
  - 3,841,556,350,614      stalled-cycles-backend    #   57.64% backend cycles idle      ( +-  0.13% )  (66.67%)
  + 3,852,823,974,333      stalled-cycles-backend    #   57.75% backend cycles idle      ( +-  0.12% )  (66.67%)
  - 4,148,398,171,079      instructions              #    0.62  insn per cycle
  + 4,148,997,156,059      instructions              #    0.62  insn per cycle
  -                                                  #    1.22  stalled cycles per insn  ( +-  0.10% )  (83.33%)
  +                                                  #    1.22  stalled cycles per insn  ( +-  0.11% )  (83.33%)
  -   887,187,118,591      branches                  #  401.943 M/sec                    ( +-  0.09% )  (83.33%)
  +   887,271,341,121      branches                  #  400.882 M/sec                    ( +-  0.11% )  (83.33%)
  -    30,139,439,034      branch-misses             #    3.40% of all branches          ( +-  0.09% )  (83.33%)
  +    30,134,864,997      branch-misses             #    3.40% of all branches          ( +-  0.06% )  (83.33%)

  -     275.904405540 seconds time elapsed                                          ( +-  0.69% )
  +     276.660352016 seconds time elapsed                                          ( +-  0.73% )

allmodconfig vmlinux size grows by a ~1Kb but that's fine - we optimize
our calling of the clear_page variants.

     text    data     bss     dec     hex filename
  9051979 23067670        27009024        59128673        3863b61		vmlinux
  9053000 23067670        27009024        59129694        3863f5e		vmlinux.clear_page

Reported-by: kernel test robot <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170215111927.emdgxf2pide3kwro@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/alternative.h | 17 +++++++++++++++++
 arch/x86/include/asm/page_64.h     | 15 ++++++++++++++-
 arch/x86/lib/clear_page_64.S       | 17 +++++++----------
 3 files changed, 38 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/alternative.h b/arch/x86/include/asm/alternative.h
index 1b02038..12e3d8d 100644
--- a/arch/x86/include/asm/alternative.h
+++ b/arch/x86/include/asm/alternative.h
@@ -227,6 +227,23 @@ static inline int alternatives_text_reserved(void *start, void *end)
 }
 
 /*
+ * Like alternative_call(), but there are two features and respective functions.
+ * If CPU has feature2, function2 is used.
+ * Otherwise, if CPU has feature1, function1 is used.
+ * Otherwise, old function is used.
+ */
+#define alternative_void_call_2(oldfunc, newfunc1, feature1, newfunc2,		\
+				feature2, input...)				\
+{										\
+	register void *__sp asm(_ASM_SP);					\
+	asm volatile (ALTERNATIVE_2("call %P[old]", "call %P[new1]", feature1,	\
+		"call %P[new2]", feature2)					\
+		: "+r" (__sp)							\
+		: [old] "i" (oldfunc), [new1] "i" (newfunc1),			\
+		  [new2] "i" (newfunc2), ## input);				\
+}
+
+/*
  * use this macro(s) if you need more than one output parameter
  * in alternative_io
  */
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index b3bebf9..254abce 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -4,6 +4,7 @@
 #include <asm/page_64_types.h>
 
 #ifndef __ASSEMBLY__
+#include <asm/alternative.h>
 
 /* duplicated to the one in bootmem.h */
 extern unsigned long max_pfn;
@@ -34,7 +35,19 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 #define pfn_valid(pfn)          ((pfn) < max_pfn)
 #endif
 
-void clear_page(void *page);
+void clear_page_orig(void *page);
+void clear_page_rep(void *page);
+void clear_page_erms(void *page);
+
+static inline void clear_page(void *page)
+{
+	alternative_void_call_2(clear_page_orig,
+				clear_page_rep, X86_FEATURE_REP_GOOD,
+				clear_page_erms, X86_FEATURE_ERMS,
+				"D" (page)
+				: "memory", "rax", "rcx");
+}
+
 void copy_page(void *to, void *from);
 
 #endif	/* !__ASSEMBLY__ */
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 5e2af3a..81b1635 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -14,20 +14,15 @@
  * Zero a page.
  * %rdi	- page
  */
-ENTRY(clear_page)
-
-	ALTERNATIVE_2 "jmp clear_page_orig", "", X86_FEATURE_REP_GOOD, \
-		      "jmp clear_page_c_e", X86_FEATURE_ERMS
-
+ENTRY(clear_page_rep)
 	movl $4096/8,%ecx
 	xorl %eax,%eax
 	rep stosq
 	ret
-ENDPROC(clear_page)
-EXPORT_SYMBOL(clear_page)
+ENDPROC(clear_page_rep)
+EXPORT_SYMBOL_GPL(clear_page_rep)
 
 ENTRY(clear_page_orig)
-
 	xorl   %eax,%eax
 	movl   $4096/64,%ecx
 	.p2align 4
@@ -47,10 +42,12 @@ ENTRY(clear_page_orig)
 	nop
 	ret
 ENDPROC(clear_page_orig)
+EXPORT_SYMBOL_GPL(clear_page_orig)
 
-ENTRY(clear_page_c_e)
+ENTRY(clear_page_erms)
 	movl $4096,%ecx
 	xorl %eax,%eax
 	rep stosb
 	ret
-ENDPROC(clear_page_c_e)
+ENDPROC(clear_page_erms)
+EXPORT_SYMBOL_GPL(clear_page_erms)

  parent reply	other threads:[~2017-03-01  9:49 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-09 19:51 [PATCH] x86: Optimize clear_page() Borislav Petkov
2017-02-15 11:19 ` [PATCH -v1.1] " Borislav Petkov
2017-02-19 13:41   ` Borislav Petkov
2017-03-01  9:47   ` tip-bot for Borislav Petkov [this message]
2017-03-07  5:11     ` [tip:x86/asm] x86/asm: " Yinghai Lu
2017-03-07  7:30       ` Ingo Molnar
2017-03-07 18:57         ` Yinghai Lu
2017-03-08 10:47           ` Ingo Molnar
2017-03-02  1:09   ` [lkp-robot] [x86] ed3ce2a917: BUG:unable_to_handle_kernel kernel test robot
2017-03-02  1:09     ` kernel test robot
2017-03-02 18:19     ` Borislav Petkov
2017-03-02 18:19       ` Borislav Petkov
2017-03-09  2:13       ` Ye Xiaolong
2017-03-09  2:13         ` Ye Xiaolong
2017-03-09  2:30         ` Fengguang Wu
2017-03-09  2:30           ` Fengguang Wu
2017-03-09  8:17           ` Borislav Petkov
2017-03-09  8:17             ` Borislav Petkov
2017-03-09  8:14         ` Borislav Petkov
2017-03-09  8:14           ` Borislav Petkov
2017-03-10  2:33           ` Ye Xiaolong
2017-03-10  2:33             ` Ye Xiaolong
2017-03-10  8:30             ` Borislav Petkov
2017-03-10  8:30               ` Borislav Petkov
2017-03-07  8:34   ` [tip:x86/asm] x86/asm: Optimize clear_page() tip-bot for Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=tip-49ca7bb328c630dd43be626534b49e19513296fd@git.kernel.org \
    --to=tipbot@zytor.com \
    --cc=bp@alien8.de \
    --cc=bp@suse.de \
    --cc=brgerst@gmail.com \
    --cc=dvlasenk@redhat.com \
    --cc=fengguang.wu@intel.com \
    --cc=hpa@zytor.com \
    --cc=jpoimboe@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.