* [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE
@ 2012-11-02 11:41 Kirill Smelkov
2012-11-22 8:10 ` Kirill Smelkov
0 siblings, 1 reply; 4+ messages in thread
From: Kirill Smelkov @ 2012-11-02 11:41 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Kirill Smelkov
[continuing 281dc5c5 "Give up on pushing CC_OPTIMIZE_FOR_SIZE"]
Recently I've been beaten hard by CC_OPTIMIZE_FOR_SIZE=y on X86
performance-wise. The problem turned out to be for -Os gcc wants to
inline __builtin_memcpy, to which x86 memcpy directly refers,
---- 8< ---- arch/x86/include/asm/string_32.h
#if (__GNUC__ >= 4)
#define memcpy(t, f, n) __builtin_memcpy(t, f, n)
to "rep; movsb" which is several times slower compared to "rep; movsl".
For me this turned out in vivi driver, where memcpy is used to copy
lines with colorbars, and this is one of the most significant parts of
the workload:
---- 8< ---- drivers/media/platform/vivi.c
static void vivi_fillbuff(struct vivi_dev *dev, struct vivi_buffer *buf)
{
...
for (h = 0; h < hmax; h++)
memcpy(vbuf + h * wmax * dev->pixelsize,
dev->line + (dev->mv_count % wmax) * dev->pixelsize,
wmax * dev->pixelsize);
Gcc insists on using movb, even if it knows dest and src alignment. For
example with gcc-4.4, -4.7 and yesterday's gcc trunk, for following function
---- 8< ----
void doit(unsigned long *dst, unsigned long *src, unsigned n)
{
void *__d = __builtin_assume_aligned(dst, 4);
void *__s = __builtin_assume_aligned(src, 4);
__builtin_memcpy(__d, __s, n);
}
it still wants to use movsb with -Os:
00000000 <doit>:
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 57 push %edi
4: 8b 4d 10 mov 0x10(%ebp),%ecx
7: 56 push %esi
8: 8b 7d 08 mov 0x8(%ebp),%edi
b: 8b 75 0c mov 0xc(%ebp),%esi
e: f3 a4 rep movsb %ds:(%esi),%es:(%edi)
10: 5e pop %esi
11: 5f pop %edi
12: 5d pop %ebp
13: c3 ret
and even if I change "n" to "4*n"...
On the other hand, with -O2, it generates call to memcpy, which at least
has rep; movsl inside it, and things works several times faster.
So tell people to not enable CC_OPTIMIZE_FOR_SIZE by default.
Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
init/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..6a448d5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1119,7 +1119,7 @@ config CC_OPTIMIZE_FOR_SIZE
Enabling this option will pass "-Os" instead of "-O2" to gcc
resulting in a smaller kernel.
- If unsure, say Y.
+ If unsure, say N.
config SYSCTL
bool
--
1.8.0.316.g291341c
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE
2012-11-02 11:41 [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE Kirill Smelkov
@ 2012-11-22 8:10 ` Kirill Smelkov
2012-11-24 0:10 ` Dave Jones
0 siblings, 1 reply; 4+ messages in thread
From: Kirill Smelkov @ 2012-11-22 8:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
On Fri, Nov 02, 2012 at 03:41:01PM +0400, Kirill Smelkov wrote:
> [continuing 281dc5c5 "Give up on pushing CC_OPTIMIZE_FOR_SIZE"]
>
> Recently I've been beaten hard by CC_OPTIMIZE_FOR_SIZE=y on X86
> performance-wise. The problem turned out to be for -Os gcc wants to
> inline __builtin_memcpy, to which x86 memcpy directly refers,
>
> ---- 8< ---- arch/x86/include/asm/string_32.h
> #if (__GNUC__ >= 4)
> #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
>
> to "rep; movsb" which is several times slower compared to "rep; movsl".
>
> For me this turned out in vivi driver, where memcpy is used to copy
> lines with colorbars, and this is one of the most significant parts of
> the workload:
>
> ---- 8< ---- drivers/media/platform/vivi.c
> static void vivi_fillbuff(struct vivi_dev *dev, struct vivi_buffer *buf)
> {
> ...
>
> for (h = 0; h < hmax; h++)
> memcpy(vbuf + h * wmax * dev->pixelsize,
> dev->line + (dev->mv_count % wmax) * dev->pixelsize,
> wmax * dev->pixelsize);
>
> Gcc insists on using movb, even if it knows dest and src alignment. For
> example with gcc-4.4, -4.7 and yesterday's gcc trunk, for following function
>
> ---- 8< ----
> void doit(unsigned long *dst, unsigned long *src, unsigned n)
> {
> void *__d = __builtin_assume_aligned(dst, 4);
> void *__s = __builtin_assume_aligned(src, 4);
>
> __builtin_memcpy(__d, __s, n);
> }
>
> it still wants to use movsb with -Os:
>
> 00000000 <doit>:
> 0: 55 push %ebp
> 1: 89 e5 mov %esp,%ebp
> 3: 57 push %edi
> 4: 8b 4d 10 mov 0x10(%ebp),%ecx
> 7: 56 push %esi
> 8: 8b 7d 08 mov 0x8(%ebp),%edi
> b: 8b 75 0c mov 0xc(%ebp),%esi
> e: f3 a4 rep movsb %ds:(%esi),%es:(%edi)
> 10: 5e pop %esi
> 11: 5f pop %edi
> 12: 5d pop %ebp
> 13: c3 ret
>
> and even if I change "n" to "4*n"...
>
> On the other hand, with -O2, it generates call to memcpy, which at least
> has rep; movsl inside it, and things works several times faster.
>
> So tell people to not enable CC_OPTIMIZE_FOR_SIZE by default.
>
> Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
> ---
> init/Kconfig | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 6fdd6e3..6a448d5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1119,7 +1119,7 @@ config CC_OPTIMIZE_FOR_SIZE
> Enabling this option will pass "-Os" instead of "-O2" to gcc
> resulting in a smaller kernel.
>
> - If unsure, say Y.
> + If unsure, say N.
>
> config SYSCTL
> bool
Linus, maybe you missed this. What is the reason telling people
CC_OPTIMIZE_FOR_SIZE should be Y if we actually do otherwise?
commit 281dc5c5ec0fb299514567cbc358562649c1af95
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date: Sun May 22 14:30:36 2011 -0700
Give up on pushing CC_OPTIMIZE_FOR_SIZE
I still happen to believe that I$ miss costs are a major thing, but
sadly, -Os doesn't seem to be the solution. With or without it, gcc
will miss some obvious code size improvements, and with it enabled gcc
will sometimes make choices that aren't good even with high I$ miss
ratios.
For example, with -Os, gcc on x86 will turn a 20-byte constant memcpy
into a "rep movsl". While I sincerely hope that x86 CPU's will some day
do a good job at that, they certainly don't do it yet, and the cost is
higher than a L1 I$ miss would be.
Some day I hope we can re-enable this.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff --git a/init/Kconfig b/init/Kconfig
index 4986ecc..ffcdad7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -908,7 +908,6 @@ endif
config CC_OPTIMIZE_FOR_SIZE
bool "Optimize for size"
- default y
help
Enabling this option will pass "-Os" instead of "-O2" to gcc
resulting in a smaller kernel.
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE
2012-11-22 8:10 ` Kirill Smelkov
@ 2012-11-24 0:10 ` Dave Jones
2013-01-17 8:58 ` Borislav Petkov
0 siblings, 1 reply; 4+ messages in thread
From: Dave Jones @ 2012-11-24 0:10 UTC (permalink / raw)
To: Kirill Smelkov; +Cc: Linus Torvalds, linux-kernel
On Thu, Nov 22, 2012 at 12:10:41PM +0400, Kirill Smelkov wrote:
> Linus, maybe you missed this. What is the reason telling people
> CC_OPTIMIZE_FOR_SIZE should be Y if we actually do otherwise?
>
>
> commit 281dc5c5ec0fb299514567cbc358562649c1af95
> Author: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Sun May 22 14:30:36 2011 -0700
>
> Give up on pushing CC_OPTIMIZE_FOR_SIZE
>
> I still happen to believe that I$ miss costs are a major thing, but
> sadly, -Os doesn't seem to be the solution. With or without it, gcc
> will miss some obvious code size improvements, and with it enabled gcc
> will sometimes make choices that aren't good even with high I$ miss
> ratios.
>
> For example, with -Os, gcc on x86 will turn a 20-byte constant memcpy
> into a "rep movsl". While I sincerely hope that x86 CPU's will some day
> do a good job at that, they certainly don't do it yet, and the cost is
> higher than a L1 I$ miss would be.
>
> Some day I hope we can re-enable this.
>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 4986ecc..ffcdad7 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -908,7 +908,6 @@ endif
>
> config CC_OPTIMIZE_FOR_SIZE
> bool "Optimize for size"
> - default y
> help
> Enabling this option will pass "-Os" instead of "-O2" to gcc
> resulting in a smaller kernel.
FWIW, we gave up using this in Fedora quite a few releases back because we
kept finding workloads that it truly sucked on.
For code that is quite bloaty and not particularly performance critical,
maybe we could move -Os to individual directories.
Dave
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE
2012-11-24 0:10 ` Dave Jones
@ 2013-01-17 8:58 ` Borislav Petkov
0 siblings, 0 replies; 4+ messages in thread
From: Borislav Petkov @ 2013-01-17 8:58 UTC (permalink / raw)
To: Dave Jones, Kirill Smelkov, Linus Torvalds, linux-kernel
On Fri, Nov 23, 2012 at 07:10:40PM -0500, Dave Jones wrote:
> FWIW, we gave up using this in Fedora quite a few releases back
> because we kept finding workloads that it truly sucked on. For code
> that is quite bloaty and not particularly performance critical, maybe
> we could move -Os to individual directories.
FWIW, one more minor thing speaking against -Os: it warns worngly about
an unitialized variable while -O2 doesn't:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55759
Maybe there's more stuff wrong with it...
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-01-17 8:58 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-02 11:41 [PATCH] Tell the world we gave up on pushing CC_OPTIMIZE_FOR_SIZE Kirill Smelkov
2012-11-22 8:10 ` Kirill Smelkov
2012-11-24 0:10 ` Dave Jones
2013-01-17 8:58 ` Borislav Petkov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox