[PATCH 0/6] Macrofying inline assembly for better compilation

linux-sparse.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Macrofying inline assembly for better compilation
@ 2018-05-17 16:13 Nadav Amit
  2018-05-17 16:13 ` [PATCH 1/6] x86: objtool: use asm macro for better compiler decisions Nadav Amit
  2018-05-18  9:20 ` [PATCH 0/6] Macrofying inline assembly for better compilation David Laight
  0 siblings, 2 replies; 4+ messages in thread
From: Nadav Amit @ 2018-05-17 16:13 UTC (permalink / raw)
  To: linux-kernel, x86
  Cc: nadav.amit, Nadav Amit, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook, linux-sparse,
	Peter Zijlstra, Randy Dunlap, Thomas Gleixner, virtualization

This patch-set deals with an interesting yet stupid problem: kernel code
that does not get inlined despite its simplicity. There are several
causes for this behavior: "cold" attribute on __init, different function
optimization levels; conditional constant computations based on
__builtin_constant_p(); and finally large inline assembly blocks.

This patch-set deals with the inline assembly problem. I separated these
patches from the others (that were sent in the RFC) for easier
inclusion.

The problem with inline assembly is that inline assembly is often used
by the kernel for things that are other than code - for example,
assembly directives and data. GCC however is oblivious to the content of
the blocks and assumes their cost in space and time is proportional to
the number of the perceived assembly "instruction", according to the
number of newlines and semicolons. Alternatives, paravirt and other
mechanisms are affected, causing code not to be inlined, and degrading
compilation quality in general.

The solution that this patch-set carries for this problem is to create
an assembly macro, and then call it from the inline assembly block.  As
a result, the compiler sees a single "instruction" and assigns the more
appropriate cost to the code. In addition, this patch-set removes
unneeded new-lines from common x86 inline asm's, which "confuse" GCC
heuristics.

Overall this patch-set slightly increases the kernel size (my build was
done using my localmodconfig + localyesconfig for the record):

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18148888 10064016 2936832 31149736 1db4ea8 ./vmlinux after (+0.06%)

The patch-set eliminates many of the static text symbols:
Before: 40033
After:	39650	(-1%)

A microbenchmark with a loop of page-fault and MADV_DONTNEED show 2%
performance improvement with this patch-set (when PTI is off).

Changes from RFC:
- Better formatting [Jan]
- i386 build problems [0-day]
- Inline comments
- Separating __builtin_constant_p() into a different future patch-set

Cc: Alok Kataria <akataria@vmware.com>
Cc: Christopher Li <sparse@chrisli.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-sparse@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: x86@kernel.org

Nadav Amit (6):
  x86: objtool: use asm macro for better compiler decisions
  x86: bug: prevent gcc distortions
  x86: alternative: macrofy locks for better inlining
  x86: prevent inline distortion by paravirt ops
  x86: refcount: prevent gcc distortions
  x86: removing unneeded new-lines

 arch/x86/include/asm/alternative.h    | 34 +++++++++++----
 arch/x86/include/asm/asm.h            |  4 +-
 arch/x86/include/asm/bug.h            | 56 ++++++++++++++++--------
 arch/x86/include/asm/cmpxchg.h        | 10 ++---
 arch/x86/include/asm/paravirt_types.h | 63 +++++++++++++++++----------
 arch/x86/include/asm/refcount.h       | 62 ++++++++++++++++----------
 arch/x86/include/asm/special_insns.h  | 12 ++---
 include/linux/compiler.h              | 37 ++++++++++++----
 8 files changed, 183 insertions(+), 95 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/6] x86: objtool: use asm macro for better compiler decisions
  2018-05-17 16:13 [PATCH 0/6] Macrofying inline assembly for better compilation Nadav Amit
@ 2018-05-17 16:13 ` Nadav Amit
  2018-05-18  9:20 ` [PATCH 0/6] Macrofying inline assembly for better compilation David Laight
  1 sibling, 0 replies; 4+ messages in thread
From: Nadav Amit @ 2018-05-17 16:13 UTC (permalink / raw)
  To: linux-kernel, x86; +Cc: nadav.amit, Nadav Amit, Christopher Li, linux-sparse

GCC considers the number of statements in inlined assembly blocks,
according to new-lines and semicolons, as an indication to the cost of
the block in time and space. This data is distorted by the kernel code,
which puts information in alternative sections. As a result, the
compiler may perform incorrect inlining and branch optimizations.

In the case of objtool, this distortion is extreme, since anyhow the
annotations of objtool are discarded during linkage.

The solution is to set an assembly macro and call it from the inlinedv
assembly block. As a result GCC considers the inline assembly block as
a single instruction.

This patch slightly increases the kernel size.

   text    data     bss     dec     hex filename
18126699 10066728 2936832 31130259 1db0293 ./vmlinux before
18126824 10067268 2936832 31130924 1db052c ./vmlinux after (+665)

But allows more aggressive inlining. Static text symbols:
Before: 40033
After: 40015 (-18)

Cc: Christopher Li <sparse@chrisli.org>
Cc: linux-sparse@vger.kernel.org

Signed-off-by: Nadav Amit <namit@vmware.com>
---
 include/linux/compiler.h | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c63601..6cbabc6b195a 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -97,19 +97,40 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
  * These macros help objtool understand GCC code flow for unreachable code.
  * The __COUNTER__ based labels are a hack to make each instance of the macros
  * unique, to convince GCC not to merge duplicate inline asm statements.
+ *
+ * The annotation logic is encapsulated within assembly macros, which are then
+ * called on each annotation. This hack is necessary to prevent GCC from
+ * considering the inline assembly blocks as costly in time and space, which can
+ * prevent function inlining and lead to other bad compilation decisions. GCC
+ * computes inline assembly cost according to the number of perceived number of
+ * assembly instruction, based on the number of new-lines and semicolons in the
+ * assembly block. Since the annotations will be discarded during linkage, the
+ * macros make the annotations to be considered "cheap" and let GCC to emit
+ * better code.
  */
+asm(".macro __annotate_reachable counter:req\n"
+    "\\counter:\n\t"
+    ".pushsection .discard.reachable\n\t"
+    ".long \\counter\\()b -.\n\t"
+    ".popsection\n\t"
+    ".endm");
+
 #define annotate_reachable() ({						\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.reachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_reachable %c0" : : "i" (__COUNTER__));	\
 })
+
+asm(".macro __annotate_unreachable counter:req\n"
+    "\\counter:\n\t"
+    ".pushsection .discard.unreachable\n\t"
+    ".long \\counter\\()b -.\n\t"
+    ".popsection\n\t"
+    ".endm");
+
 #define annotate_unreachable() ({					\
-	asm volatile("%c0:\n\t"						\
-		     ".pushsection .discard.unreachable\n\t"		\
-		     ".long %c0b - .\n\t"				\
-		     ".popsection\n\t" : : "i" (__COUNTER__));		\
+	asm volatile("__annotate_unreachable %c0" : :			\
+		     "i" (__COUNTER__));				\
 })
+
 #define ASM_UNREACHABLE							\
 	"999:\n\t"							\
 	".pushsection .discard.unreachable\n\t"				\
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* RE: [PATCH 0/6] Macrofying inline assembly for better compilation
  2018-05-17 16:13 [PATCH 0/6] Macrofying inline assembly for better compilation Nadav Amit
  2018-05-17 16:13 ` [PATCH 1/6] x86: objtool: use asm macro for better compiler decisions Nadav Amit
@ 2018-05-18  9:20 ` David Laight
  2018-05-18 14:15   ` Nadav Amit
  1 sibling, 1 reply; 4+ messages in thread
From: David Laight @ 2018-05-18  9:20 UTC (permalink / raw)
  To: 'Nadav Amit', linux-kernel@vger.kernel.org,
	x86@kernel.org
  Cc: nadav.amit@gmail.com, Alok Kataria, Christopher Li,
	H. Peter Anvin, Ingo Molnar, Jan Beulich, Jonathan Corbet,
	Josh Poimboeuf, Juergen Gross, Kees Cook,
	linux-sparse@vger.kernel.org, Peter Zijlstra, Randy Dunlap,
	Thomas Gleixner, virtualization@lists.linux-foundation.org

From: Nadav Amit
> Sent: 17 May 2018 17:14
> This patch-set deals with an interesting yet stupid problem: kernel code
> that does not get inlined despite its simplicity. There are several
> causes for this behavior: "cold" attribute on __init, different function
> optimization levels; conditional constant computations based on
> __builtin_constant_p(); and finally large inline assembly blocks.
> 
> This patch-set deals with the inline assembly problem. I separated these
> patches from the others (that were sent in the RFC) for easier
> inclusion.
> 
> The problem with inline assembly is that inline assembly is often used
> by the kernel for things that are other than code - for example,
> assembly directives and data. GCC however is oblivious to the content of
> the blocks and assumes their cost in space and time is proportional to
> the number of the perceived assembly "instruction", according to the
> number of newlines and semicolons. Alternatives, paravirt and other
> mechanisms are affected, causing code not to be inlined, and degrading
> compilation quality in general.
> 
> The solution that this patch-set carries for this problem is to create
> an assembly macro, and then call it from the inline assembly block.  As
> a result, the compiler sees a single "instruction" and assigns the more
> appropriate cost to the code. In addition, this patch-set removes
> unneeded new-lines from common x86 inline asm's, which "confuse" GCC
> heuristics.

Can't you get the same effect by using always_inline ?

	David

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/6] Macrofying inline assembly for better compilation
  2018-05-18  9:20 ` [PATCH 0/6] Macrofying inline assembly for better compilation David Laight
@ 2018-05-18 14:15   ` Nadav Amit
  0 siblings, 0 replies; 4+ messages in thread
From: Nadav Amit @ 2018-05-18 14:15 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel@vger.kernel.org, x86@kernel.org, Alok Kataria,
	Christopher Li, H. Peter Anvin, Ingo Molnar, Jan Beulich,
	Jonathan Corbet, Josh Poimboeuf, Juergen Gross, Kees Cook,
	linux-sparse@vger.kernel.org, Peter Zijlstra, Randy Dunlap,
	Thomas Gleixner, virtualization@lists.linux-foundation.org

David Laight <David.Laight@ACULAB.COM> wrote:

> From: Nadav Amit
>> Sent: 17 May 2018 17:14
>> This patch-set deals with an interesting yet stupid problem: kernel code
>> that does not get inlined despite its simplicity. There are several
>> causes for this behavior: "cold" attribute on __init, different function
>> optimization levels; conditional constant computations based on
>> __builtin_constant_p(); and finally large inline assembly blocks.
>> 
>> This patch-set deals with the inline assembly problem. I separated these
>> patches from the others (that were sent in the RFC) for easier
>> inclusion.
>> 
>> The problem with inline assembly is that inline assembly is often used
>> by the kernel for things that are other than code - for example,
>> assembly directives and data. GCC however is oblivious to the content of
>> the blocks and assumes their cost in space and time is proportional to
>> the number of the perceived assembly "instruction", according to the
>> number of newlines and semicolons. Alternatives, paravirt and other
>> mechanisms are affected, causing code not to be inlined, and degrading
>> compilation quality in general.
>> 
>> The solution that this patch-set carries for this problem is to create
>> an assembly macro, and then call it from the inline assembly block.  As
>> a result, the compiler sees a single "instruction" and assigns the more
>> appropriate cost to the code. In addition, this patch-set removes
>> unneeded new-lines from common x86 inline asm's, which "confuse" GCC
>> heuristics.
> 
> Can't you get the same effect by using always_inline ?

I wanted and forgot to mention in the cover-letter why always_inline is not
a proper solution:

1. It is not easy to go over 400 functions and mark them as __always_inline.
   Maintaining it afterwards (i.e., removing the __always_inline if the
   function is changed and becomes “heavier") is even harder.

2. The kernel can be configured in a many ways, which would make
   functions more “cheaper” or more “expensive”, so you cannot always
   predetermine whether a function should be inlined.

3. If you mark a function __always_inline you can just cause the calling
   function not to be inlined (when it should be inlined as well). It becomes
   a whack-a-mole.

4. It is not only about inlining. The compiler also makes branch decisions
   based on the perceived cost of the code, including inlined function.

Regards,
Nadav


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-05-18 14:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-17 16:13 [PATCH 0/6] Macrofying inline assembly for better compilation Nadav Amit
2018-05-17 16:13 ` [PATCH 1/6] x86: objtool: use asm macro for better compiler decisions Nadav Amit
2018-05-18  9:20 ` [PATCH 0/6] Macrofying inline assembly for better compilation David Laight
2018-05-18 14:15   ` Nadav Amit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).