* [RFC] exit_thread() speedups in x86 process.c
@ 2005-06-14 1:16 cutaway
2005-06-14 3:08 ` Coywolf Qi Hunt
0 siblings, 1 reply; 11+ messages in thread
From: cutaway @ 2005-06-14 1:16 UTC (permalink / raw)
To: linux-kernel
In the current exit_thread() implementation, it appears including the I/O
port map tear down code within the exit_thread() generates enough autovar
data that the compiler needs to spill 4 registers to the stack resulting in
(4) PUSH on entry and (4) POP on exit.
When I tried extracting the map teardown into a seperate function, the
situation changed dramatically to where NO REGISTERS were being
pushed/popped in the normal path entry/exit.
Below is the original generated code, code my proposal generated, and an
#ifdef'd change that produced this elimination of the PUSH/POP's.
Unless I'm on drugs, this looks like a solid winner in a fairly important
code path :)
--------- Original exit_thread() -------
615 .globl exit_thread
616 .type exit_thread,@function
617 exit_thread:
618 02cc 55 pushl %ebp
619 02cd 57 pushl %edi
620 02ce 56 pushl %esi
621 02cf 53 pushl %ebx
622 02d0 B800E0FF movl $-8192,%eax
blah, blah...
629 02e5 85C0 testl %eax,%eax
630 02e7 7507 jne .L1675
631 .L1657:
632 02e9 5B popl %ebx
633 02ea 5E popl %esi
634 02eb 5F popl %edi
635 02ec 5D popl %ebp
636 02ed C3 ret
637 02ee 89F6 .p2align 2
638 .L1675:
639 02f0 50 pushl %eax
640 02f1 E8FCFFFF call kfree
...Lots of stuff here to tear down port maps...
--------- Proposed exit_thread() -------
655 .globl exit_thread
656 .type exit_thread,@function
657 exit_thread:
///////////////////////////////////////
// Note how all PUSH/POP's are
// gone from the mainline code now
///////////////////////////////////////
658 0340 B800E0FF movl $-8192,%eax
658 FF
659
660 0345 21E0 andl %esp,%eax
661
662 0347 8B00 movl (%eax),%eax
663 0349 05C00100 addl $448,%eax
663 00
664 034e 8B907C02 movl 636(%eax),%edx
664 0000
665 0354 85D2 testl %edx,%edx
666 0356 7504 jne .L1676
667 0358 C3 ret
668 0359 8D7600 .p2align 2
669 .L1676:
670 035c 50 pushl %eax
671 035d E86AFFFF call NukePortMap
671 FF
672 0362 58 popl %eax
673 0363 C3 ret
---- This is the change that eliminates the PUSH/POP's ---
#ifdef __TONYI__
static void NukePortMap(struct thread_struct *t)
{
int cpu = get_cpu();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
kfree(t->io_bitmap_ptr);
t->io_bitmap_ptr = NULL;
/*
* Careful, clear this in the TSS too:
*/
memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
t->io_bitmap_max = 0;
tss->io_bitmap_owner = NULL;
tss->io_bitmap_max = 0;
tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
put_cpu();
}
#endif
/*
* Free current thread data structures etc..
*/
void exit_thread(void)
{
struct task_struct *tsk = current;
struct thread_struct *t = &tsk->thread;
/* The process may have allocated an io port bitmap... nuke it. */
if (unlikely(NULL != t->io_bitmap_ptr)) {
#ifdef __TONYI__
NukePortMap(t);
#else
int cpu = get_cpu();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
kfree(t->io_bitmap_ptr);
t->io_bitmap_ptr = NULL;
/*
* Careful, clear this in the TSS too:
*/
memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
t->io_bitmap_max = 0;
tss->io_bitmap_owner = NULL;
tss->io_bitmap_max = 0;
tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
put_cpu();
#endif
}
}
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-06-14 1:16 [RFC] exit_thread() speedups in x86 process.c cutaway
@ 2005-06-14 3:08 ` Coywolf Qi Hunt
2005-06-14 4:26 ` cutaway
0 siblings, 1 reply; 11+ messages in thread
From: Coywolf Qi Hunt @ 2005-06-14 3:08 UTC (permalink / raw)
To: cutaway@bellsouth.net; +Cc: linux-kernel
On 6/14/05, cutaway@bellsouth.net <cutaway@bellsouth.net> wrote:
> In the current exit_thread() implementation, it appears including the I/O
> port map tear down code within the exit_thread() generates enough autovar
> data that the compiler needs to spill 4 registers to the stack resulting in
> (4) PUSH on entry and (4) POP on exit.
>
> When I tried extracting the map teardown into a seperate function, the
> situation changed dramatically to where NO REGISTERS were being
> pushed/popped in the normal path entry/exit.
>
> Below is the original generated code, code my proposal generated, and an
> #ifdef'd change that produced this elimination of the PUSH/POP's.
>
> Unless I'm on drugs, this looks like a solid winner in a fairly important
> code path :)
I see the effect, But I think it would be better to leave the job to
gcc to generate better code for unlikely, imho.
--
Coywolf Qi Hunt
http://ahbl.org/~coywolf/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-06-14 3:08 ` Coywolf Qi Hunt
@ 2005-06-14 4:26 ` cutaway
2005-06-14 7:43 ` Denis Vlasenko
0 siblings, 1 reply; 11+ messages in thread
From: cutaway @ 2005-06-14 4:26 UTC (permalink / raw)
To: Coywolf Qi Hunt; +Cc: linux-kernel
The problem with that approach is GCC would still just relocate the push/pop
block to the bottom of the function. This means you won't be likely to pick
up anything useful in L1 or L2 as the function exits normally - in fact
you'd typically be guaranteed to be picking up a partial line of gorp that
is completely worthless later on.
This is one of my issues with the notion of unlikely() being smoothed on
everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
serendipitous L1/L2 advantages that could be had by locating related
functions next to each other.
When you take the unlikely stuff completely out of line in a seperate
functions located elsewhere, the mainline code can make better use of the
caches. The Intel parts thrive on L1 hits and die if they're not getting
them.
----- Original Message -----
From: "Coywolf Qi Hunt" <coywolf@gmail.com>
I see the effect, But I think it would be better to leave the job to
gcc to generate better code for unlikely, imho.
--
Coywolf Qi Hunt
http://ahbl.org/~coywolf/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-06-14 4:26 ` cutaway
@ 2005-06-14 7:43 ` Denis Vlasenko
0 siblings, 0 replies; 11+ messages in thread
From: Denis Vlasenko @ 2005-06-14 7:43 UTC (permalink / raw)
To: cutaway, Coywolf Qi Hunt; +Cc: linux-kernel
On Tuesday 14 June 2005 07:26, cutaway@bellsouth.net wrote:
> The problem with that approach is GCC would still just relocate the push/pop
> block to the bottom of the function. This means you won't be likely to pick
> up anything useful in L1 or L2 as the function exits normally - in fact
> you'd typically be guaranteed to be picking up a partial line of gorp that
> is completely worthless later on.
>
> This is one of my issues with the notion of unlikely() being smoothed on
> everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
> serendipitous L1/L2 advantages that could be had by locating related
> functions next to each other.
>
> When you take the unlikely stuff completely out of line in a seperate
> functions located elsewhere, the mainline code can make better use of the
> caches. The Intel parts thrive on L1 hits and die if they're not getting
> them.
That's exactly what compiler can do by itself. The fact that currently
it isn't smart enough to od it means that it has to be improved.
You propose that people have to do compiler's job.
--
vda
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-06-22 5:48 Chuck Ebbert
2005-06-22 8:41 ` cutaway
0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-06-22 5:48 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: cutaway, Coywolf Qi Hunt, linux-kernel
On Tue, 14 Jun 2005 10:43:03 +0300, Denis Vlasenko wrote:
> On Tuesday 14 June 2005 07:26, cutaway@bellsouth.net wrote:
> > The problem with that approach is GCC would still just relocate the push/pop
> > block to the bottom of the function. This means you won't be likely to pick
> > up anything useful in L1 or L2 as the function exits normally - in fact
> > you'd typically be guaranteed to be picking up a partial line of gorp that
> > is completely worthless later on.
> >
> > This is one of my issues with the notion of unlikely() being smoothed on
> > everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
> > serendipitous L1/L2 advantages that could be had by locating related
> > functions next to each other.
> >
> > When you take the unlikely stuff completely out of line in a seperate
> > functions located elsewhere, the mainline code can make better use of the
> > caches. The Intel parts thrive on L1 hits and die if they're not getting
> > them.
>
> That's exactly what compiler can do by itself. The fact that currently
> it isn't smart enough to od it means that it has to be improved.
> You propose that people have to do compiler's job.
Not just the compiler -- the linker would need to be a lot smarter too:
/* foo.c */
extern int bar1(void), bar2(void), bar3(void);
main() {
if (likely(bar1()))
bar2();
else
bar3();
}
/* bar.c */
int bar1(void) { return 1; }
int bar2(void) { whatever; }
int bar3(void) { whatever; }
When you compile bar.c the compiler has _no_ idea which functions are likely
to be called.
And doing this manually is trivial:
1. Add two sections to vmlinux.lds.S: .fast.text and .slow.text
2. Define __fast __attribute__(__section__(".fast.text"))
3. Define __slow similarly
4. Start tagging functions with __fast and __slow as needed
Very little work for much potential gain, AFAICS.
--
Chuck
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-06-22 5:48 Chuck Ebbert
@ 2005-06-22 8:41 ` cutaway
0 siblings, 0 replies; 11+ messages in thread
From: cutaway @ 2005-06-22 8:41 UTC (permalink / raw)
To: Chuck Ebbert, Denis Vlasenko; +Cc: Coywolf Qi Hunt, linux-kernel
FWIW Chuck, this is precisely how OS/2 Warp 3 got page tuned when I was with
IBM Boca many years ago. The compilers got tweaked to be able to emit
function code to different text sections and a massive system wide code
triage was undertaken based on "common usage scenario" profiling run data
from the perf analysis group. The results spoke for themselves compared to
the previous 2.X releases - this plan can work and pay off very well. It
DID worked and paid off very well.
----- Original Message -----
From: "Chuck Ebbert" <76306.1226@compuserve.com>
To: "Denis Vlasenko" <vda@ilport.com.ua>
Cc: <cutaway@bellsouth.net>; "Coywolf Qi Hunt" <coywolf@gmail.com>;
"linux-kernel" <linux-kernel@vger.kernel.org>
Sent: Wednesday, June 22, 2005 01:48
Subject: Re: [RFC] exit_thread() speedups in x86 process.c
>
> And doing this manually is trivial:
>
> 1. Add two sections to vmlinux.lds.S: .fast.text and .slow.text
> 2. Define __fast __attribute__(__section__(".fast.text"))
> 3. Define __slow similarly
> 4. Start tagging functions with __fast and __slow as needed
>
> Very little work for much potential gain, AFAICS.
>
>
> --
> Chuck
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-02 2:57 Chuck Ebbert
2005-07-02 11:56 ` Denis Vlasenko
0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-07-02 2:57 UTC (permalink / raw)
To: cutaway@bellsouth.net, Denis Vlasenko; +Cc: linux-kernel, Coywolf Qi Hunt
On Wed, 22 Jun 2005 at 04:41:47 -0400, cutaway wrote:
> The compilers got tweaked to be able to emit
> function code to different text sections and a massive system wide code
> triage was undertaken based on "common usage scenario" profiling run data
> from the perf analysis group.
Linux scheduler code is in its own text section already, but
that might be for profiling the code instead of for performance.
(Look for "__sched" in the source code.)
The gains may not be as much as you think since on X86 and at least
some other archs the entire kernel is in one large page. Still, it's
got to make some kind of sense to put infrequently-used code in its
own section just to reduce cache pollution.
I came up with this but only the "__slow" part really makes sense:
--- 2.6.12.1/arch/i386/kernel/vmlinux.lds.S 2004-09-03 19:55:27.000000000 -0400
+++ 2.6.12.1-ce1/arch/i386/kernel/vmlinux.lds.S 2005-06-26 01:48:23.770212000 -0400
@@ -16,9 +16,11 @@ SECTIONS
/* read-only */
_text = .; /* Text and read-only data */
.text : {
+ *(.fast.text)
*(.text)
SCHED_TEXT
LOCK_TEXT
+ *(.slow.text)
*(.fixup)
*(.gnu.warning)
} = 0x9090
--- 2.6.12.1/arch/x86_64/kernel/vmlinux.lds.S 2005-06-24 00:50:21.180212000 -0400
+++ 2.6.12.1-ce1/arch/x86_64/kernel/vmlinux.lds.S 2005-06-26 01:50:09.100212000 -0400
@@ -15,9 +15,11 @@ SECTIONS
phys_startup_64 = startup_64 - LOAD_OFFSET;
_text = .; /* Text and read-only data */
.text : {
+ *(.fast.text)
*(.text)
SCHED_TEXT
LOCK_TEXT
+ *(.slow.text)
*(.fixup)
*(.gnu.warning)
} = 0x9090
--- 2.6.12.1/include/linux/init.h 2005-01-04 21:48:02.000000000 -0500
+++ 2.6.12.1-ce1/include/linux/init.h 2005-06-26 01:59:29.580212000 -0400
@@ -46,6 +46,17 @@
#define __exitdata __attribute__ ((__section__(".exit.data")))
#define __exit_call __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))
+/*
+ * Probably belongs in some other header (compiler.h?)
+ */
+#ifdef CONFIG_X86
+#define __fast __attribute__ ((__section__(".fast.text")))
+#define __slow __attribute__ ((__section__(".slow.text")))
+#else
+#define __fast
+#define __slow
+#endif
+
#ifdef MODULE
#define __exit __attribute__ ((__section__(".exit.text")))
#else
--
Chuck
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-07-02 2:57 Chuck Ebbert
@ 2005-07-02 11:56 ` Denis Vlasenko
2005-07-03 19:59 ` cutaway
0 siblings, 1 reply; 11+ messages in thread
From: Denis Vlasenko @ 2005-07-02 11:56 UTC (permalink / raw)
To: Chuck Ebbert, cutaway@bellsouth.net; +Cc: linux-kernel, Coywolf Qi Hunt
On Saturday 02 July 2005 05:57, Chuck Ebbert wrote:
> On Wed, 22 Jun 2005 at 04:41:47 -0400, cutaway wrote:
>
> > The compilers got tweaked to be able to emit
> > function code to different text sections and a massive system wide code
> > triage was undertaken based on "common usage scenario" profiling run data
> > from the perf analysis group.
>
> Linux scheduler code is in its own text section already, but
> that might be for profiling the code instead of for performance.
> (Look for "__sched" in the source code.)
>
> The gains may not be as much as you think since on X86 and at least
> some other archs the entire kernel is in one large page. Still, it's
> got to make some kind of sense to put infrequently-used code in its
> own section just to reduce cache pollution.
>
> I came up with this
Nice.
> but only the "__slow" part really makes sense:
80/20 rule says that 80% of code runs 20% of time,
thus we need only __fast. Everything else will be by default __slow.
(IOW: normal .text section is __slow, no need to add another one).
If gcc will someday get per-function support for using -O2 / -Os
like optimizations, they could be added to the __fast macro.
> --- 2.6.12.1/arch/i386/kernel/vmlinux.lds.S 2004-09-03 19:55:27.000000000 -0400
> +++ 2.6.12.1-ce1/arch/i386/kernel/vmlinux.lds.S 2005-06-26 01:48:23.770212000 -0400
> @@ -16,9 +16,11 @@ SECTIONS
> /* read-only */
> _text = .; /* Text and read-only data */
> .text : {
> + *(.fast.text)
> *(.text)
> SCHED_TEXT
> LOCK_TEXT
> + *(.slow.text)
> *(.fixup)
> *(.gnu.warning)
> } = 0x9090
> --- 2.6.12.1/arch/x86_64/kernel/vmlinux.lds.S 2005-06-24 00:50:21.180212000 -0400
> +++ 2.6.12.1-ce1/arch/x86_64/kernel/vmlinux.lds.S 2005-06-26 01:50:09.100212000 -0400
> @@ -15,9 +15,11 @@ SECTIONS
> phys_startup_64 = startup_64 - LOAD_OFFSET;
> _text = .; /* Text and read-only data */
> .text : {
> + *(.fast.text)
> *(.text)
> SCHED_TEXT
> LOCK_TEXT
> + *(.slow.text)
> *(.fixup)
> *(.gnu.warning)
> } = 0x9090
> --- 2.6.12.1/include/linux/init.h 2005-01-04 21:48:02.000000000 -0500
> +++ 2.6.12.1-ce1/include/linux/init.h 2005-06-26 01:59:29.580212000 -0400
> @@ -46,6 +46,17 @@
> #define __exitdata __attribute__ ((__section__(".exit.data")))
> #define __exit_call __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))
>
> +/*
> + * Probably belongs in some other header (compiler.h?)
> + */
> +#ifdef CONFIG_X86
> +#define __fast __attribute__ ((__section__(".fast.text")))
> +#define __slow __attribute__ ((__section__(".slow.text")))
> +#else
> +#define __fast
> +#define __slow
> +#endif
> +
> #ifdef MODULE
> #define __exit __attribute__ ((__section__(".exit.text")))
> #else
--
vda
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
[not found] ` <200507021456.40667.vda@ilport.com.ua.suse.lists.linux.kernel>
@ 2005-07-03 16:05 ` Andi Kleen
0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2005-07-03 16:05 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: linux-kernel
Denis Vlasenko <vda@ilport.com.ua> writes:
>
> 80/20 rule says that 80% of code runs 20% of time,
> thus we need only __fast. Everything else will be by default __slow.
> (IOW: normal .text section is __slow, no need to add another one).
__slow could include noinline. With unit-at-a-time gcc tends
otherwise to inline too aggressively. With __fast that would not be
possible.
-Andi
P.S.: gcc 4.x already supports .cold even on the basic block level.
However I believe it's only active with profile feedback, which is not
practical for kernel builds.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
2005-07-02 11:56 ` Denis Vlasenko
@ 2005-07-03 19:59 ` cutaway
0 siblings, 0 replies; 11+ messages in thread
From: cutaway @ 2005-07-03 19:59 UTC (permalink / raw)
To: Denis Vlasenko; +Cc: linux-kernel
----- Original Message -----
From: "Denis Vlasenko" <vda@ilport.com.ua>
To: "Chuck Ebbert" <76306.1226@compuserve.com>; <cutaway@bellsouth.net>
Cc: "linux-kernel" <linux-kernel@vger.kernel.org>; "Coywolf Qi Hunt"
<coywolf@gmail.com>
Sent: Saturday, July 02, 2005 07:56
Subject: Re: [RFC] exit_thread() speedups in x86 process.c
>
> 80/20 rule says that 80% of code runs 20% of time,
> thus we need only __fast. Everything else will be by default __slow.
> (IOW: normal .text section is __slow, no need to add another one).
What makes you think __fast implies everything else should necessarily be
"slow"?
You might want to entertain the idea that some systems employ several
different speeds of memory where the penalty for making such assumptions
could be extreme if the bootstrap were to metric available memory regions
for response and locate portions of the system accordingly someday.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-05 18:30 William Cohen
0 siblings, 0 replies; 11+ messages in thread
From: William Cohen @ 2005-07-05 18:30 UTC (permalink / raw)
To: linux-kernel
One of the difficulties function reordering is getting useful data to
figure out a reasonable order for the functions. People do guess wrong
on the frequency of particular functions. Also naive ordering techniques
like just ordering functions based on frequency do not work well.
A North Carolina State University senior project was done to generate
ordering to improve TLB and cache hit ratios using the information
generated by valgrind for user-space programs:
http://www.bclennox.com/cgi-bin/show.cgi?page=home
Xen is running the domains (and kernels) in user-space. Would it be
possible to adapt valgrind/xen to run a domain so that similar
information could be collected about a particular kernel? Being able to
use valgrind on the kernel could allow other analysis of the kernel
code, e.g. find unwanted TLB flushes.
-Will
I am not subscribed to the list, so please cc me on responses.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2005-07-05 18:35 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-14 1:16 [RFC] exit_thread() speedups in x86 process.c cutaway
2005-06-14 3:08 ` Coywolf Qi Hunt
2005-06-14 4:26 ` cutaway
2005-06-14 7:43 ` Denis Vlasenko
-- strict thread matches above, loose matches on Subject: below --
2005-06-22 5:48 Chuck Ebbert
2005-06-22 8:41 ` cutaway
2005-07-02 2:57 Chuck Ebbert
2005-07-02 11:56 ` Denis Vlasenko
2005-07-03 19:59 ` cutaway
[not found] <200507012258_MC3-1-A340-3A81@compuserve.com.suse.lists.linux.kernel>
[not found] ` <200507021456.40667.vda@ilport.com.ua.suse.lists.linux.kernel>
2005-07-03 16:05 ` Andi Kleen
2005-07-05 18:30 William Cohen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox