* Re: [patch] i386: use C code for current_thread_info()
@ 2006-06-12 17:14 Chuck Ebbert
2006-06-12 17:55 ` Adrian Bunk
2006-06-12 18:48 ` Andreas Mohr
0 siblings, 2 replies; 15+ messages in thread
From: Chuck Ebbert @ 2006-06-12 17:14 UTC (permalink / raw)
To: Emmanuel Fleury; +Cc: linux-kernel
In-Reply-To: <448C85B7.1010902@labri.fr>
On Sun, 11 Jun 2006 23:05:59 +0200, Emmanuel Fleury wrote:
> > Looking at the generated code, it seems the compiler just makes dumb
> > choices and tends to recompute current_thread_info() in unlikely code
> > paths even when there is no register pressure. 4.0.2 makes better
> > choices.
>
> What size with gcc 4.1.2 ? (just curiosity)
The 3.3 vs 4.0 comparisons were with two different configs, so only
relative gain/loss with asm vs. C could be compared.
I downloaded gcc 4.1.1 and compared to 4.0.2 with the exact same config,
since I was curious how much better it might be overall.
gcc 4.0.2:
text data bss dec hex filename
3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-C/vmlinux
3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb-asm/vmlinux
-2064
gcc 4.1.1:
text data bss dec hex filename
3614686 520416 311672 4446774 43da36 2.6.17-rc6-nb-C/vmlinux
3616942 520416 311672 4449030 43e306 2.6.17-rc6-nb-asm/vmlinux
-2256
Kernel code starts out ~30K bytes smaller with gcc 4.1 and using C
for current_thread_info() helps even more than with 4.0. Nice...
Maybe a patch that enables C code for gcc 4.0+ would work, since
on 3.3 the asm code is better?
--
Chuck
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-12 17:14 [patch] i386: use C code for current_thread_info() Chuck Ebbert
@ 2006-06-12 17:55 ` Adrian Bunk
2006-06-12 18:48 ` Andreas Mohr
1 sibling, 0 replies; 15+ messages in thread
From: Adrian Bunk @ 2006-06-12 17:55 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: Emmanuel Fleury, linux-kernel
On Mon, Jun 12, 2006 at 01:14:34PM -0400, Chuck Ebbert wrote:
> In-Reply-To: <448C85B7.1010902@labri.fr>
>
> On Sun, 11 Jun 2006 23:05:59 +0200, Emmanuel Fleury wrote:
>
> > > Looking at the generated code, it seems the compiler just makes dumb
> > > choices and tends to recompute current_thread_info() in unlikely code
> > > paths even when there is no register pressure. 4.0.2 makes better
> > > choices.
> >
> > What size with gcc 4.1.2 ? (just curiosity)
>
> The 3.3 vs 4.0 comparisons were with two different configs, so only
> relative gain/loss with asm vs. C could be compared.
>
> I downloaded gcc 4.1.1 and compared to 4.0.2 with the exact same config,
> since I was curious how much better it might be overall.
>
> gcc 4.0.2:
> text data bss dec hex filename
> 3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-C/vmlinux
> 3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb-asm/vmlinux
> -2064
>
> gcc 4.1.1:
> text data bss dec hex filename
> 3614686 520416 311672 4446774 43da36 2.6.17-rc6-nb-C/vmlinux
> 3616942 520416 311672 4449030 43e306 2.6.17-rc6-nb-asm/vmlinux
> -2256
>
> Kernel code starts out ~30K bytes smaller with gcc 4.1 and using C
> for current_thread_info() helps even more than with 4.0. Nice...
>
> Maybe a patch that enables C code for gcc 4.0+ would work, since
> on 3.3 the asm code is better?
No.
gcc 3.3 is still supported, and it's important that the kernel runs
correctly when compiled with this compiler (implying workarounds for
compiler bugs).
But for best performance, people should always use the latest gcc.
(And if there was a case where usage of the latest gcc would give a
worse performance, that would be a bug in gcc and/or the kernel that
should be reported.)
So unless there's a _really_ significant regression with older gcc
versions, adding #ifdef's for such micro optimizations is not worth it.
> Chuck
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch] i386: use C code for current_thread_info()
2006-06-12 17:14 [patch] i386: use C code for current_thread_info() Chuck Ebbert
2006-06-12 17:55 ` Adrian Bunk
@ 2006-06-12 18:48 ` Andreas Mohr
1 sibling, 0 replies; 15+ messages in thread
From: Andreas Mohr @ 2006-06-12 18:48 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: Emmanuel Fleury, linux-kernel
Hi,
On Mon, Jun 12, 2006 at 01:14:34PM -0400, Chuck Ebbert wrote:
> Kernel code starts out ~30K bytes smaller with gcc 4.1 and using C
> for current_thread_info() helps even more than with 4.0. Nice...
Especially since current_thread_info() often has an AGI stall (read:
severe pipeline stall) since it often cannot properly intermingle
with nearby opcodes due to lack of suitable ones, e.g. at a
function prologue.
mov $0xffffe000,%eax
and %esp,%eax
are fundamentally incompatible due to having to wait for the address
generation before the "and" can be executed.
This shows up during profiling quite noticeably (IIRC 8 hits vs. 1 to 2
hits on other places), which really hurts since this function is used
basically *everywhere*.
As such whatever optimization we can gain (e.g. the compiler merging
multiple current_thread_info() invocations) is very worthwhile.
I've come up with this C variant recently, too, but I didn't think
it would make too much of a difference, however it seems it's really
useful when going towards newer gcc versions.
Andreas Mohr
--
No programming skills!? Why not help translate many Linux applications!
https://launchpad.ubuntu.com/rosetta
(or alternatively buy nicely packaged Linux distros/OSS software to help
support Linux developers creating shiny new things for you?)
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
@ 2006-06-13 5:26 Albert Cahalan
0 siblings, 0 replies; 15+ messages in thread
From: Albert Cahalan @ 2006-06-13 5:26 UTC (permalink / raw)
To: linux-kernel, emmanuel.fleury, torvalds, s0348365, jengelh,
76306.1226, akpm
Chuck Ebbert writes:
> Using C code for current_thread_info() lets the compiler optimize it.
> With gcc 4.0.2, kernel is smaller:
The often-forgotten __attribute__((const)) might do the job.
The function is indeed const as far as gcc can see, except
perhaps near the schedular code that switches stacks.
This applies to many similar functions, like the ones used
to get current. It should work for the C code too.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
@ 2006-06-13 1:50 Chuck Ebbert
2006-06-13 6:43 ` Andreas Mohr
0 siblings, 1 reply; 15+ messages in thread
From: Chuck Ebbert @ 2006-06-13 1:50 UTC (permalink / raw)
To: Andreas Mohr; +Cc: linux-kernel, Emmanuel Fleury, Linus Torvalds
In-Reply-To: <20060612184833.GA29177@rhlx01.fht-esslingen.de>
On Mon, 12 Jun 2006 20:48:33 +0200, Andreas Mohr wrote:
> > Kernel code starts out ~30K bytes smaller with gcc 4.1 and using C
> > for current_thread_info() helps even more than with 4.0. Nice...
>
> Especially since current_thread_info() often has an AGI stall (read:
> severe pipeline stall) since it often cannot properly intermingle
> with nearby opcodes due to lack of suitable ones, e.g. at a
> function prologue.
> mov $0xffffe000,%eax
> and %esp,%eax
> are fundamentally incompatible due to having to wait for the address
> generation before the "and" can be executed.
> This shows up during profiling quite noticeably (IIRC 8 hits vs. 1 to 2
> hits on other places), which really hurts since this function is used
> basically *everywhere*.
Hmmm. The compiler does it this way:
mov %esp,%eax
and $0xffffe000,%eax
which could be faster because esp can be moved to eax while the mask
is being fetched.
--
Chuck
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-13 1:50 Chuck Ebbert
@ 2006-06-13 6:43 ` Andreas Mohr
2006-06-13 9:27 ` Avi Kivity
0 siblings, 1 reply; 15+ messages in thread
From: Andreas Mohr @ 2006-06-13 6:43 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, Emmanuel Fleury, Linus Torvalds
Hi,
On Mon, Jun 12, 2006 at 09:50:17PM -0400, Chuck Ebbert wrote:
> In-Reply-To: <20060612184833.GA29177@rhlx01.fht-esslingen.de>
>
> On Mon, 12 Jun 2006 20:48:33 +0200, Andreas Mohr wrote:
>
> > > Kernel code starts out ~30K bytes smaller with gcc 4.1 and using C
> > > for current_thread_info() helps even more than with 4.0. Nice...
> >
> > Especially since current_thread_info() often has an AGI stall (read:
> > severe pipeline stall) since it often cannot properly intermingle
> > with nearby opcodes due to lack of suitable ones, e.g. at a
> > function prologue.
> > mov $0xffffe000,%eax
> > and %esp,%eax
> > are fundamentally incompatible due to having to wait for the address
> > generation before the "and" can be executed.
> > This shows up during profiling quite noticeably (IIRC 8 hits vs. 1 to 2
> > hits on other places), which really hurts since this function is used
> > basically *everywhere*.
>
> Hmmm. The compiler does it this way:
>
> mov %esp,%eax
> and $0xffffe000,%eax
Well, not here, since mine was a live example of the old asm version
compiled with gcc 3.2.3 (but I doubt the gcc version matters in this case
since a compiler wouldn't decide to create completely different asm
opcodes from a hand-coded asm version...).
> which could be faster because esp can be moved to eax while the mask
> is being fetched.
That might be true, but it could easily happen to not be (an AGI stall
stalls both pipelines, IIRC, so in one case you can have good instruction
parallelism before the 0xffffe000 address and in the other case after it).
It would be very interesting to run some well-known kernel benchmarks,
since given the very wide-spread use of the function this could easily make
a noticeable difference.
Hmm, in fact I'm afraid that your version may be slower since it has the esp
operation first which doesn't mix at all with function prologues due to
heavy esp modification (i.e. waiting for esp to settle down) in the function
stack frame generation. And as said before, current_thread_info() is heavily
used near prologues since in many functions we're basically doing
"which cpu are we on? get cpu data! now get to work...".
This could be so bad as to disallow any and all attempt at making the compiler
guess it since a hand-coded version would account for this (but a compiler
certainly should, too!).
Come to think of it, did you check whether the compiler also did those opcodes
the other way around at some places depending on environment?
Since this important function sucks on x86 (due to braindead lack of registers
on this architecture, to possibly permanently store the calculated stack
base value inside), I'm thinking of investigating it again.
An entirely different way would be to store the stack base value in a
global variable and update that on each context switch, but it would increase
context switch overhead and have >= 2 cycles access time for L1 cache (which
would be the best memory access case!), which would most likely be more
combined overhead than an AGI stall (I was mistaken in declaring the stall
a pipeline flush - it's only a stall for a couple cycles, not a full flush
wasting ~ 15 cycles).
And I wouldn't be too astonished to learn that this has exactly been the
previous solution and then deprecated for modernization or performance
reasons...
Andreas Mohr
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-13 6:43 ` Andreas Mohr
@ 2006-06-13 9:27 ` Avi Kivity
0 siblings, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2006-06-13 9:27 UTC (permalink / raw)
To: Andreas Mohr; +Cc: Chuck Ebbert, linux-kernel, Emmanuel Fleury, Linus Torvalds
Andreas Mohr wrote:
>
> An entirely different way would be to store the stack base value in a
> global variable and update that on each context switch, but it would
> increase
> context switch overhead and have >= 2 cycles access time for L1 cache
> (which
> would be the best memory access case!), which would most likely be more
> combined overhead than an AGI stall (I was mistaken in declaring the
> stall
> a pipeline flush - it's only a stall for a couple cycles, not a full
> flush
> wasting ~ 15 cycles).
>
That wouldn't work on SMP. You'd need per-cpu variables, which are
likely even slower.
[One way around that would be to use a segment register for the per-cpu
areas]
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
@ 2006-06-11 20:43 Chuck Ebbert
2006-06-11 21:05 ` Emmanuel Fleury
0 siblings, 1 reply; 15+ messages in thread
From: Chuck Ebbert @ 2006-06-11 20:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Andrew Morton, linux-kernel
In-Reply-To: <Pine.LNX.4.64.0606111225380.5498@g5.osdl.org>
On Sun, 11 Jun 2006 12:33:10 -0700, Linus Torbalds wrote:
> On Sun, 11 Jun 2006, Chuck Ebbert wrote:
> >
> > Using C code for current_thread_info() lets the compiler optimize it.
>
> Ok, me likee. I just worry that this might break some older gcc version.
> Have you checked with gcc-3.2 or something?
I just tried gcc 3.3.3 and the kernel gets a little bigger but it boots
and runs OK. That's the oldest compiler I can find.
text data bss dec hex filename
3593627 559864 342728 4496219 449b5b 2.6.17-rc6-32-post/vmlinux
3591371 559864 342728 4493963 44928b 2.6.17-rc6-32/vmlinux
+2256
Looking at the generated code, it seems the compiler just makes dumb
choices and tends to recompute current_thread_info() in unlikely code
paths even when there is no register pressure. 4.0.2 makes better
choices.
--
Chuck
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 20:43 Chuck Ebbert
@ 2006-06-11 21:05 ` Emmanuel Fleury
0 siblings, 0 replies; 15+ messages in thread
From: Emmanuel Fleury @ 2006-06-11 21:05 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel
Chuck Ebbert wrote:
>
> I just tried gcc 3.3.3 and the kernel gets a little bigger but it boots
> and runs OK. That's the oldest compiler I can find.
>
> text data bss dec hex filename
> 3593627 559864 342728 4496219 449b5b 2.6.17-rc6-32-post/vmlinux
> 3591371 559864 342728 4493963 44928b 2.6.17-rc6-32/vmlinux
> +2256
>
> Looking at the generated code, it seems the compiler just makes dumb
> choices and tends to recompute current_thread_info() in unlikely code
> paths even when there is no register pressure. 4.0.2 makes better
> choices.
What size with gcc 4.1.2 ? (just curiosity)
Regards
--
Emmanuel Fleury | Office: 211
Associate Professor, | Phone: +33 (0)5 40 00 35 24
LaBRI, Domaine Universitaire | Fax: +33 (0)5 40 00 66 69
351, Cours de la Libération | email: fleury@labri.fr
33405 Talence Cedex, France | URL: http://www.labri.fr/~fleury
^ permalink raw reply [flat|nested] 15+ messages in thread
* [patch] i386: use C code for current_thread_info()
@ 2006-06-11 19:07 Chuck Ebbert
2006-06-11 19:33 ` Linus Torvalds
2006-06-11 19:42 ` Jan Engelhardt
0 siblings, 2 replies; 15+ messages in thread
From: Chuck Ebbert @ 2006-06-11 19:07 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Linus Torvalds
Using C code for current_thread_info() lets the compiler optimize it.
With gcc 4.0.2, kernel is smaller:
text data bss dec hex filename
3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-post/vmlinux
3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb/vmlinux
-------
-2064
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
--- 2.6.17-rc6-32.orig/include/asm-i386/thread_info.h
+++ 2.6.17-rc6-32/include/asm-i386/thread_info.h
@@ -84,17 +84,15 @@ struct thread_info {
#define init_stack (init_thread_union.stack)
+/* how to get the current stack pointer from C */
+register unsigned long current_stack_pointer asm("esp") __attribute_used__;
+
/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
- struct thread_info *ti;
- __asm__("andl %%esp,%0; ":"=r" (ti) : "0" (~(THREAD_SIZE - 1)));
- return ti;
+ return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE - 1));
}
-/* how to get the current stack pointer from C */
-register unsigned long current_stack_pointer asm("esp") __attribute_used__;
-
/* thread information allocation */
#ifdef CONFIG_DEBUG_STACK_USAGE
#define alloc_thread_info(tsk) \
--
Chuck
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 19:07 Chuck Ebbert
@ 2006-06-11 19:33 ` Linus Torvalds
2006-06-11 19:42 ` Jan Engelhardt
1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-11 19:33 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, Andrew Morton
On Sun, 11 Jun 2006, Chuck Ebbert wrote:
>
> Using C code for current_thread_info() lets the compiler optimize it.
Ok, me likee. I just worry that this might break some older gcc version.
Have you checked with gcc-3.2 or something?
We've had that "current_stack_pointer" thing for a long while, but we only
use it in the irq-stack path (where it's really very incidental: the whole
use of the "previous_esp" thing is only for the oops tracing. Maybe there
was some reason (like gcc generating bad code) for not using it earlier?
(Admittedly, a more likely reason is probably just nobody looking at it,
but it sounds like we should check it out anyway, just in case)
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 19:07 Chuck Ebbert
2006-06-11 19:33 ` Linus Torvalds
@ 2006-06-11 19:42 ` Jan Engelhardt
2006-06-11 20:33 ` Jan Engelhardt
2006-06-11 20:44 ` Alistair John Strachan
1 sibling, 2 replies; 15+ messages in thread
From: Jan Engelhardt @ 2006-06-11 19:42 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, Andrew Morton, Linus Torvalds
>
>Using C code for current_thread_info() lets the compiler optimize it.
>With gcc 4.0.2, kernel is smaller:
>
> text data bss dec hex filename
> 3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-post/vmlinux
> 3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb/vmlinux
> -------
> -2064
>
If possible, can you or someone post the results for x86_64?
Jan Engelhardt
--
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 19:42 ` Jan Engelhardt
@ 2006-06-11 20:33 ` Jan Engelhardt
2006-06-11 20:44 ` Alistair John Strachan
1 sibling, 0 replies; 15+ messages in thread
From: Jan Engelhardt @ 2006-06-11 20:33 UTC (permalink / raw)
To: Chuck Ebbert; +Cc: linux-kernel, Andrew Morton, Linus Torvalds
>
>If possible, can you or someone post the results for x86_64?
>
Ignore this request - 'was a asm-i386 change ^_^
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 19:42 ` Jan Engelhardt
2006-06-11 20:33 ` Jan Engelhardt
@ 2006-06-11 20:44 ` Alistair John Strachan
2006-06-12 8:10 ` Andi Kleen
1 sibling, 1 reply; 15+ messages in thread
From: Alistair John Strachan @ 2006-06-11 20:44 UTC (permalink / raw)
To: Jan Engelhardt; +Cc: Chuck Ebbert, linux-kernel, Andrew Morton, Linus Torvalds
On Sunday 11 June 2006 20:42, Jan Engelhardt wrote:
> >Using C code for current_thread_info() lets the compiler optimize it.
> >With gcc 4.0.2, kernel is smaller:
> >
> > text data bss dec hex filename
> > 3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-post/vmlinux
> > 3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb/vmlinux
> > -------
> > -2064
>
> If possible, can you or someone post the results for x86_64?
Patch is for i386, x86_64's current_thread_info() is already C.
--
Cheers,
Alistair.
Third year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch] i386: use C code for current_thread_info()
2006-06-11 20:44 ` Alistair John Strachan
@ 2006-06-12 8:10 ` Andi Kleen
0 siblings, 0 replies; 15+ messages in thread
From: Andi Kleen @ 2006-06-12 8:10 UTC (permalink / raw)
To: Alistair John Strachan
Cc: Chuck Ebbert, linux-kernel, Andrew Morton, Linus Torvalds
Alistair John Strachan <s0348365@sms.ed.ac.uk> writes:
> On Sunday 11 June 2006 20:42, Jan Engelhardt wrote:
> > >Using C code for current_thread_info() lets the compiler optimize it.
> > >With gcc 4.0.2, kernel is smaller:
> > >
> > > text data bss dec hex filename
> > > 3645212 555556 312024 4512792 44dc18 2.6.17-rc6-nb-post/vmlinux
> > > 3647276 555556 312024 4514856 44e428 2.6.17-rc6-nb/vmlinux
> > > -------
> > > -2064
> >
> > If possible, can you or someone post the results for x86_64?
>
> Patch is for i386, x86_64's current_thread_info() is already C.
Actually read_pda() is inline assembly. But so far gcc/binutils don't support
%gs references in the way the kernel needs it directly, so it has
to stay this way.
-Andi
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-06-13 9:27 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12 17:14 [patch] i386: use C code for current_thread_info() Chuck Ebbert
2006-06-12 17:55 ` Adrian Bunk
2006-06-12 18:48 ` Andreas Mohr
-- strict thread matches above, loose matches on Subject: below --
2006-06-13 5:26 Albert Cahalan
2006-06-13 1:50 Chuck Ebbert
2006-06-13 6:43 ` Andreas Mohr
2006-06-13 9:27 ` Avi Kivity
2006-06-11 20:43 Chuck Ebbert
2006-06-11 21:05 ` Emmanuel Fleury
2006-06-11 19:07 Chuck Ebbert
2006-06-11 19:33 ` Linus Torvalds
2006-06-11 19:42 ` Jan Engelhardt
2006-06-11 20:33 ` Jan Engelhardt
2006-06-11 20:44 ` Alistair John Strachan
2006-06-12 8:10 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox