public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC]  exit_thread() speedups in x86 process.c
@ 2005-06-14  1:16 cutaway
  2005-06-14  3:08 ` Coywolf Qi Hunt
  0 siblings, 1 reply; 11+ messages in thread
From: cutaway @ 2005-06-14  1:16 UTC (permalink / raw)
  To: linux-kernel

In the current exit_thread() implementation, it appears including the I/O
port map tear down code within the exit_thread() generates enough autovar
data that the compiler needs to spill 4 registers to the stack resulting in
(4) PUSH on entry and (4) POP on exit.

When I tried extracting the map teardown into a seperate function, the
situation changed dramatically to where NO REGISTERS were being
pushed/popped in the normal path entry/exit.

Below is the original generated code, code my proposal generated, and an
#ifdef'd change that produced this elimination of the PUSH/POP's.

Unless I'm on drugs, this looks like a solid winner in a fairly important
code path :)

--------- Original exit_thread() -------
 615               .globl exit_thread
 616               .type exit_thread,@function
 617               exit_thread:
 618 02cc 55        pushl %ebp
 619 02cd 57        pushl %edi
 620 02ce 56        pushl %esi
 621 02cf 53        pushl %ebx
 622 02d0 B800E0FF  movl $-8192,%eax

    blah, blah...

 629 02e5 85C0      testl %eax,%eax
 630 02e7 7507      jne .L1675
 631               .L1657:
 632 02e9 5B        popl %ebx
 633 02ea 5E        popl %esi
 634 02eb 5F        popl %edi
 635 02ec 5D        popl %ebp
 636 02ed C3        ret
 637 02ee 89F6      .p2align 2
 638               .L1675:
 639 02f0 50        pushl %eax
 640 02f1 E8FCFFFF  call kfree


    ...Lots of stuff here to tear down port maps...

--------- Proposed exit_thread() -------

 655               .globl exit_thread
 656               .type exit_thread,@function
 657               exit_thread:
///////////////////////////////////////
// Note how all PUSH/POP's are
// gone from the mainline code now
///////////////////////////////////////
 658 0340 B800E0FF  movl $-8192,%eax
 658      FF
 659
 660 0345 21E0      andl %esp,%eax
 661
 662 0347 8B00      movl (%eax),%eax
 663 0349 05C00100  addl $448,%eax
 663      00
 664 034e 8B907C02  movl 636(%eax),%edx
 664      0000
 665 0354 85D2      testl %edx,%edx
 666 0356 7504      jne .L1676
 667 0358 C3        ret
 668 0359 8D7600    .p2align 2
 669               .L1676:
 670 035c 50        pushl %eax
 671 035d E86AFFFF  call NukePortMap
 671      FF
 672 0362 58        popl %eax
 673 0363 C3        ret

---- This is the change that eliminates the PUSH/POP's ---

#ifdef __TONYI__
static void NukePortMap(struct thread_struct *t)
{
 int cpu = get_cpu();
 struct tss_struct *tss = &per_cpu(init_tss, cpu);

 kfree(t->io_bitmap_ptr);
 t->io_bitmap_ptr = NULL;
 /*
  * Careful, clear this in the TSS too:
  */
 memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
 t->io_bitmap_max = 0;
 tss->io_bitmap_owner = NULL;
 tss->io_bitmap_max = 0;
 tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
 put_cpu();
}
#endif

/*
 * Free current thread data structures etc..
 */
void exit_thread(void)
{
 struct task_struct *tsk = current;
 struct thread_struct *t = &tsk->thread;

 /* The process may have allocated an io port bitmap... nuke it. */
 if (unlikely(NULL != t->io_bitmap_ptr)) {
#ifdef __TONYI__
  NukePortMap(t);
#else
  int cpu = get_cpu();
  struct tss_struct *tss = &per_cpu(init_tss, cpu);

  kfree(t->io_bitmap_ptr);
  t->io_bitmap_ptr = NULL;
  /*
   * Careful, clear this in the TSS too:
   */
  memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
  t->io_bitmap_max = 0;
  tss->io_bitmap_owner = NULL;
  tss->io_bitmap_max = 0;
  tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
  put_cpu();
#endif
 }
}


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-06-14  1:16 [RFC] exit_thread() speedups in x86 process.c cutaway
@ 2005-06-14  3:08 ` Coywolf Qi Hunt
  2005-06-14  4:26   ` cutaway
  0 siblings, 1 reply; 11+ messages in thread
From: Coywolf Qi Hunt @ 2005-06-14  3:08 UTC (permalink / raw)
  To: cutaway@bellsouth.net; +Cc: linux-kernel

On 6/14/05, cutaway@bellsouth.net <cutaway@bellsouth.net> wrote:
> In the current exit_thread() implementation, it appears including the I/O
> port map tear down code within the exit_thread() generates enough autovar
> data that the compiler needs to spill 4 registers to the stack resulting in
> (4) PUSH on entry and (4) POP on exit.
> 
> When I tried extracting the map teardown into a seperate function, the
> situation changed dramatically to where NO REGISTERS were being
> pushed/popped in the normal path entry/exit.
> 
> Below is the original generated code, code my proposal generated, and an
> #ifdef'd change that produced this elimination of the PUSH/POP's.
> 
> Unless I'm on drugs, this looks like a solid winner in a fairly important
> code path :)


I see the effect, But I think it would be better to leave the job to
gcc to generate better code for unlikely, imho.

-- 
Coywolf Qi Hunt
http://ahbl.org/~coywolf/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-06-14  3:08 ` Coywolf Qi Hunt
@ 2005-06-14  4:26   ` cutaway
  2005-06-14  7:43     ` Denis Vlasenko
  0 siblings, 1 reply; 11+ messages in thread
From: cutaway @ 2005-06-14  4:26 UTC (permalink / raw)
  To: Coywolf Qi Hunt; +Cc: linux-kernel

The problem with that approach is GCC would still just relocate the push/pop
block to the bottom of the function.  This means you won't be likely to pick
up anything useful in L1 or L2 as the function exits normally - in fact
you'd typically be guaranteed to be picking up a partial line of gorp that
is completely worthless later on.

This is one of my issues with the notion of unlikely() being smoothed on
everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
serendipitous L1/L2 advantages that could be had by locating related
functions next to each other.

When you take the unlikely stuff completely out of line in a seperate
functions located elsewhere, the mainline code can make better use of the
caches.  The Intel parts thrive on L1 hits and die if they're not getting
them.

----- Original Message ----- 
From: "Coywolf Qi Hunt" <coywolf@gmail.com>


I see the effect, But I think it would be better to leave the job to
gcc to generate better code for unlikely, imho.

-- 
Coywolf Qi Hunt
http://ahbl.org/~coywolf/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-06-14  4:26   ` cutaway
@ 2005-06-14  7:43     ` Denis Vlasenko
  0 siblings, 0 replies; 11+ messages in thread
From: Denis Vlasenko @ 2005-06-14  7:43 UTC (permalink / raw)
  To: cutaway, Coywolf Qi Hunt; +Cc: linux-kernel

On Tuesday 14 June 2005 07:26, cutaway@bellsouth.net wrote:
> The problem with that approach is GCC would still just relocate the push/pop
> block to the bottom of the function.  This means you won't be likely to pick
> up anything useful in L1 or L2 as the function exits normally - in fact
> you'd typically be guaranteed to be picking up a partial line of gorp that
> is completely worthless later on.
> 
> This is one of my issues with the notion of unlikely() being smoothed on
> everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
> serendipitous L1/L2 advantages that could be had by locating related
> functions next to each other.
> 
> When you take the unlikely stuff completely out of line in a seperate
> functions located elsewhere, the mainline code can make better use of the
> caches.  The Intel parts thrive on L1 hits and die if they're not getting
> them.

That's exactly what compiler can do by itself. The fact that currently
it isn't smart enough to od it means that it has to be improved.
You propose that people have to do compiler's job.
--
vda


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-06-22  5:48 Chuck Ebbert
  2005-06-22  8:41 ` cutaway
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-06-22  5:48 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: cutaway, Coywolf Qi Hunt, linux-kernel

On Tue, 14 Jun 2005 10:43:03 +0300, Denis Vlasenko wrote:

> On Tuesday 14 June 2005 07:26, cutaway@bellsouth.net wrote:
> > The problem with that approach is GCC would still just relocate the push/pop
> > block to the bottom of the function.  This means you won't be likely to pick
> > up anything useful in L1 or L2 as the function exits normally - in fact
> > you'd typically be guaranteed to be picking up a partial line of gorp that
> > is completely worthless later on.
> > 
> > This is one of my issues with the notion of unlikely() being smoothed on
> > everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
> > serendipitous L1/L2 advantages that could be had by locating related
> > functions next to each other.
> > 
> > When you take the unlikely stuff completely out of line in a seperate
> > functions located elsewhere, the mainline code can make better use of the
> > caches.  The Intel parts thrive on L1 hits and die if they're not getting
> > them.
> 
> That's exactly what compiler can do by itself. The fact that currently
> it isn't smart enough to od it means that it has to be improved.
> You propose that people have to do compiler's job.

  Not just the compiler -- the linker would need to be a lot smarter too:


/* foo.c */

extern int bar1(void), bar2(void), bar3(void);

main() {
        if (likely(bar1()))
                bar2();
        else
                bar3();
}


/* bar.c */

int bar1(void) { return 1; }
int bar2(void) { whatever; }
int bar3(void) { whatever; }


  When you compile bar.c the compiler has _no_ idea which functions are likely
to be called.

  And doing this manually is trivial:

        1. Add two sections to vmlinux.lds.S: .fast.text and .slow.text
        2. Define __fast __attribute__(__section__(".fast.text"))
        3. Define __slow similarly
        4. Start tagging functions with __fast and __slow as needed

  Very little work for much potential gain, AFAICS.


--
Chuck

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-06-22  5:48 Chuck Ebbert
@ 2005-06-22  8:41 ` cutaway
  0 siblings, 0 replies; 11+ messages in thread
From: cutaway @ 2005-06-22  8:41 UTC (permalink / raw)
  To: Chuck Ebbert, Denis Vlasenko; +Cc: Coywolf Qi Hunt, linux-kernel

FWIW Chuck, this is precisely how OS/2 Warp 3 got page tuned when I was with
IBM Boca many years ago.  The compilers got tweaked to be able to emit
function code to different text sections and a massive system wide code
triage was undertaken based on "common usage scenario" profiling run data
from the perf analysis group.  The results spoke for themselves compared to
the previous 2.X releases - this plan can work and pay off very well.  It
DID worked and paid off very well.


----- Original Message ----- 
From: "Chuck Ebbert" <76306.1226@compuserve.com>
To: "Denis Vlasenko" <vda@ilport.com.ua>
Cc: <cutaway@bellsouth.net>; "Coywolf Qi Hunt" <coywolf@gmail.com>;
"linux-kernel" <linux-kernel@vger.kernel.org>
Sent: Wednesday, June 22, 2005 01:48
Subject: Re: [RFC] exit_thread() speedups in x86 process.c


>
>   And doing this manually is trivial:
>
>         1. Add two sections to vmlinux.lds.S: .fast.text and .slow.text
>         2. Define __fast __attribute__(__section__(".fast.text"))
>         3. Define __slow similarly
>         4. Start tagging functions with __fast and __slow as needed
>
>   Very little work for much potential gain, AFAICS.
>
>
> --
> Chuck


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-02  2:57 Chuck Ebbert
  2005-07-02 11:56 ` Denis Vlasenko
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-07-02  2:57 UTC (permalink / raw)
  To: cutaway@bellsouth.net, Denis Vlasenko; +Cc: linux-kernel, Coywolf Qi Hunt

On Wed, 22 Jun 2005 at 04:41:47 -0400, cutaway wrote:

> The compilers got tweaked to be able to emit
> function code to different text sections and a massive system wide code
> triage was undertaken based on "common usage scenario" profiling run data
> from the perf analysis group.

  Linux scheduler code is in its own text section already, but
that might be for profiling the code instead of for performance.
(Look for "__sched" in the source code.)

  The gains may not be as much as you think since on X86 and at least
some other archs the entire kernel is in one large page.  Still, it's
got to make some kind of sense to put infrequently-used code in its
own section just to reduce cache pollution.

  I came up with this but only the "__slow" part really makes sense:


--- 2.6.12.1/arch/i386/kernel/vmlinux.lds.S     2004-09-03 19:55:27.000000000 -0400
+++ 2.6.12.1-ce1/arch/i386/kernel/vmlinux.lds.S 2005-06-26 01:48:23.770212000 -0400
@@ -16,9 +16,11 @@ SECTIONS
   /* read-only */
   _text = .;                   /* Text and read-only data */
   .text : {
+       *(.fast.text)
        *(.text)
        SCHED_TEXT
        LOCK_TEXT
+       *(.slow.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
--- 2.6.12.1/arch/x86_64/kernel/vmlinux.lds.S   2005-06-24 00:50:21.180212000 -0400
+++ 2.6.12.1-ce1/arch/x86_64/kernel/vmlinux.lds.S       2005-06-26 01:50:09.100212000 -0400
@@ -15,9 +15,11 @@ SECTIONS
   phys_startup_64 = startup_64 - LOAD_OFFSET;
   _text = .;                   /* Text and read-only data */
   .text : {
+       *(.fast.text)
        *(.text)
        SCHED_TEXT
        LOCK_TEXT
+       *(.slow.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
--- 2.6.12.1/include/linux/init.h       2005-01-04 21:48:02.000000000 -0500
+++ 2.6.12.1-ce1/include/linux/init.h   2005-06-26 01:59:29.580212000 -0400
@@ -46,6 +46,17 @@
 #define __exitdata     __attribute__ ((__section__(".exit.data")))
 #define __exit_call    __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))
 
+/*
+ * Probably belongs in some other header (compiler.h?)
+ */
+#ifdef CONFIG_X86
+#define __fast         __attribute__ ((__section__(".fast.text")))
+#define __slow         __attribute__ ((__section__(".slow.text")))
+#else
+#define __fast
+#define __slow
+#endif
+
 #ifdef MODULE
 #define __exit         __attribute__ ((__section__(".exit.text")))
 #else

--
Chuck

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-07-02  2:57 Chuck Ebbert
@ 2005-07-02 11:56 ` Denis Vlasenko
  2005-07-03 19:59   ` cutaway
  0 siblings, 1 reply; 11+ messages in thread
From: Denis Vlasenko @ 2005-07-02 11:56 UTC (permalink / raw)
  To: Chuck Ebbert, cutaway@bellsouth.net; +Cc: linux-kernel, Coywolf Qi Hunt

On Saturday 02 July 2005 05:57, Chuck Ebbert wrote:
> On Wed, 22 Jun 2005 at 04:41:47 -0400, cutaway wrote:
> 
> > The compilers got tweaked to be able to emit
> > function code to different text sections and a massive system wide code
> > triage was undertaken based on "common usage scenario" profiling run data
> > from the perf analysis group.
> 
>   Linux scheduler code is in its own text section already, but
> that might be for profiling the code instead of for performance.
> (Look for "__sched" in the source code.)
> 
>   The gains may not be as much as you think since on X86 and at least
> some other archs the entire kernel is in one large page.  Still, it's
> got to make some kind of sense to put infrequently-used code in its
> own section just to reduce cache pollution.
> 
>   I came up with this

Nice.

> but only the "__slow" part really makes sense:

80/20 rule says that 80% of code runs 20% of time,
thus we need only __fast. Everything else will be by default __slow.
(IOW: normal .text section is __slow, no need to add another one).

If gcc will someday get per-function support for using -O2 / -Os
like optimizations, they could be added to the __fast macro.

> --- 2.6.12.1/arch/i386/kernel/vmlinux.lds.S     2004-09-03 19:55:27.000000000 -0400
> +++ 2.6.12.1-ce1/arch/i386/kernel/vmlinux.lds.S 2005-06-26 01:48:23.770212000 -0400
> @@ -16,9 +16,11 @@ SECTIONS
>    /* read-only */
>    _text = .;                   /* Text and read-only data */
>    .text : {
> +       *(.fast.text)
>         *(.text)
>         SCHED_TEXT
>         LOCK_TEXT
> +       *(.slow.text)
>         *(.fixup)
>         *(.gnu.warning)
>         } = 0x9090
> --- 2.6.12.1/arch/x86_64/kernel/vmlinux.lds.S   2005-06-24 00:50:21.180212000 -0400
> +++ 2.6.12.1-ce1/arch/x86_64/kernel/vmlinux.lds.S       2005-06-26 01:50:09.100212000 -0400
> @@ -15,9 +15,11 @@ SECTIONS
>    phys_startup_64 = startup_64 - LOAD_OFFSET;
>    _text = .;                   /* Text and read-only data */
>    .text : {
> +       *(.fast.text)
>         *(.text)
>         SCHED_TEXT
>         LOCK_TEXT
> +       *(.slow.text)
>         *(.fixup)
>         *(.gnu.warning)
>         } = 0x9090
> --- 2.6.12.1/include/linux/init.h       2005-01-04 21:48:02.000000000 -0500
> +++ 2.6.12.1-ce1/include/linux/init.h   2005-06-26 01:59:29.580212000 -0400
> @@ -46,6 +46,17 @@
>  #define __exitdata     __attribute__ ((__section__(".exit.data")))
>  #define __exit_call    __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))
>  
> +/*
> + * Probably belongs in some other header (compiler.h?)
> + */
> +#ifdef CONFIG_X86
> +#define __fast         __attribute__ ((__section__(".fast.text")))
> +#define __slow         __attribute__ ((__section__(".slow.text")))
> +#else
> +#define __fast
> +#define __slow
> +#endif
> +
>  #ifdef MODULE
>  #define __exit         __attribute__ ((__section__(".exit.text")))
>  #else
--
vda


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
       [not found] ` <200507021456.40667.vda@ilport.com.ua.suse.lists.linux.kernel>
@ 2005-07-03 16:05   ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2005-07-03 16:05 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

Denis Vlasenko <vda@ilport.com.ua> writes:
> 
> 80/20 rule says that 80% of code runs 20% of time,
> thus we need only __fast. Everything else will be by default __slow.
> (IOW: normal .text section is __slow, no need to add another one).

__slow could include noinline.  With unit-at-a-time gcc tends 
otherwise to inline too aggressively. With __fast that would not be
possible.


-Andi

P.S.: gcc 4.x already supports .cold even on the basic block level.
However I believe it's only active with profile feedback, which is not
practical for kernel builds.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
  2005-07-02 11:56 ` Denis Vlasenko
@ 2005-07-03 19:59   ` cutaway
  0 siblings, 0 replies; 11+ messages in thread
From: cutaway @ 2005-07-03 19:59 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

----- Original Message ----- 
From: "Denis Vlasenko" <vda@ilport.com.ua>
To: "Chuck Ebbert" <76306.1226@compuserve.com>; <cutaway@bellsouth.net>
Cc: "linux-kernel" <linux-kernel@vger.kernel.org>; "Coywolf Qi Hunt"
<coywolf@gmail.com>
Sent: Saturday, July 02, 2005 07:56
Subject: Re: [RFC] exit_thread() speedups in x86 process.c


>
> 80/20 rule says that 80% of code runs 20% of time,
> thus we need only __fast. Everything else will be by default __slow.
> (IOW: normal .text section is __slow, no need to add another one).

What makes you think __fast implies everything else should necessarily be
"slow"?

You might want to entertain the idea that some systems employ several
different speeds of memory where the penalty for making such assumptions
could be extreme if the bootstrap were to metric available memory regions
for response and locate portions of the system accordingly someday.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-05 18:30 William Cohen
  0 siblings, 0 replies; 11+ messages in thread
From: William Cohen @ 2005-07-05 18:30 UTC (permalink / raw)
  To: linux-kernel

One of the difficulties function reordering is getting useful data to 
figure out a reasonable order for the functions. People do guess wrong 
on the frequency of particular functions. Also naive ordering techniques 
like just ordering functions based on frequency do not work well.

A North Carolina State University senior project was done to generate 
ordering to improve TLB and cache hit ratios using the information 
generated by valgrind for user-space programs:

http://www.bclennox.com/cgi-bin/show.cgi?page=home

Xen is running the domains (and kernels) in user-space. Would it be 
possible to adapt valgrind/xen to run a domain so that similar 
information could be collected about a particular kernel?  Being able to 
use valgrind on the kernel could allow other analysis of the kernel 
code, e.g. find unwanted TLB flushes.

-Will

I am not subscribed to the list, so please cc me on responses.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-07-05 18:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-14  1:16 [RFC] exit_thread() speedups in x86 process.c cutaway
2005-06-14  3:08 ` Coywolf Qi Hunt
2005-06-14  4:26   ` cutaway
2005-06-14  7:43     ` Denis Vlasenko
  -- strict thread matches above, loose matches on Subject: below --
2005-06-22  5:48 Chuck Ebbert
2005-06-22  8:41 ` cutaway
2005-07-02  2:57 Chuck Ebbert
2005-07-02 11:56 ` Denis Vlasenko
2005-07-03 19:59   ` cutaway
     [not found] <200507012258_MC3-1-A340-3A81@compuserve.com.suse.lists.linux.kernel>
     [not found] ` <200507021456.40667.vda@ilport.com.ua.suse.lists.linux.kernel>
2005-07-03 16:05   ` Andi Kleen
2005-07-05 18:30 William Cohen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox