public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-05 18:30 William Cohen
  0 siblings, 0 replies; 11+ messages in thread
From: William Cohen @ 2005-07-05 18:30 UTC (permalink / raw)
  To: linux-kernel

One of the difficulties function reordering is getting useful data to 
figure out a reasonable order for the functions. People do guess wrong 
on the frequency of particular functions. Also naive ordering techniques 
like just ordering functions based on frequency do not work well.

A North Carolina State University senior project was done to generate 
ordering to improve TLB and cache hit ratios using the information 
generated by valgrind for user-space programs:

http://www.bclennox.com/cgi-bin/show.cgi?page=home

Xen is running the domains (and kernels) in user-space. Would it be 
possible to adapt valgrind/xen to run a domain so that similar 
information could be collected about a particular kernel?  Being able to 
use valgrind on the kernel could allow other analysis of the kernel 
code, e.g. find unwanted TLB flushes.

-Will

I am not subscribed to the list, so please cc me on responses.

^ permalink raw reply	[flat|nested] 11+ messages in thread
[parent not found: <200507012258_MC3-1-A340-3A81@compuserve.com.suse.lists.linux.kernel>]
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-07-02  2:57 Chuck Ebbert
  2005-07-02 11:56 ` Denis Vlasenko
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-07-02  2:57 UTC (permalink / raw)
  To: cutaway@bellsouth.net, Denis Vlasenko; +Cc: linux-kernel, Coywolf Qi Hunt

On Wed, 22 Jun 2005 at 04:41:47 -0400, cutaway wrote:

> The compilers got tweaked to be able to emit
> function code to different text sections and a massive system wide code
> triage was undertaken based on "common usage scenario" profiling run data
> from the perf analysis group.

  Linux scheduler code is in its own text section already, but
that might be for profiling the code instead of for performance.
(Look for "__sched" in the source code.)

  The gains may not be as much as you think since on X86 and at least
some other archs the entire kernel is in one large page.  Still, it's
got to make some kind of sense to put infrequently-used code in its
own section just to reduce cache pollution.

  I came up with this but only the "__slow" part really makes sense:


--- 2.6.12.1/arch/i386/kernel/vmlinux.lds.S     2004-09-03 19:55:27.000000000 -0400
+++ 2.6.12.1-ce1/arch/i386/kernel/vmlinux.lds.S 2005-06-26 01:48:23.770212000 -0400
@@ -16,9 +16,11 @@ SECTIONS
   /* read-only */
   _text = .;                   /* Text and read-only data */
   .text : {
+       *(.fast.text)
        *(.text)
        SCHED_TEXT
        LOCK_TEXT
+       *(.slow.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
--- 2.6.12.1/arch/x86_64/kernel/vmlinux.lds.S   2005-06-24 00:50:21.180212000 -0400
+++ 2.6.12.1-ce1/arch/x86_64/kernel/vmlinux.lds.S       2005-06-26 01:50:09.100212000 -0400
@@ -15,9 +15,11 @@ SECTIONS
   phys_startup_64 = startup_64 - LOAD_OFFSET;
   _text = .;                   /* Text and read-only data */
   .text : {
+       *(.fast.text)
        *(.text)
        SCHED_TEXT
        LOCK_TEXT
+       *(.slow.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
--- 2.6.12.1/include/linux/init.h       2005-01-04 21:48:02.000000000 -0500
+++ 2.6.12.1-ce1/include/linux/init.h   2005-06-26 01:59:29.580212000 -0400
@@ -46,6 +46,17 @@
 #define __exitdata     __attribute__ ((__section__(".exit.data")))
 #define __exit_call    __attribute_used__ __attribute__ ((__section__ (".exitcall.exit")))
 
+/*
+ * Probably belongs in some other header (compiler.h?)
+ */
+#ifdef CONFIG_X86
+#define __fast         __attribute__ ((__section__(".fast.text")))
+#define __slow         __attribute__ ((__section__(".slow.text")))
+#else
+#define __fast
+#define __slow
+#endif
+
 #ifdef MODULE
 #define __exit         __attribute__ ((__section__(".exit.text")))
 #else

--
Chuck

^ permalink raw reply	[flat|nested] 11+ messages in thread
* Re: [RFC] exit_thread() speedups in x86 process.c
@ 2005-06-22  5:48 Chuck Ebbert
  2005-06-22  8:41 ` cutaway
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2005-06-22  5:48 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: cutaway, Coywolf Qi Hunt, linux-kernel

On Tue, 14 Jun 2005 10:43:03 +0300, Denis Vlasenko wrote:

> On Tuesday 14 June 2005 07:26, cutaway@bellsouth.net wrote:
> > The problem with that approach is GCC would still just relocate the push/pop
> > block to the bottom of the function.  This means you won't be likely to pick
> > up anything useful in L1 or L2 as the function exits normally - in fact
> > you'd typically be guaranteed to be picking up a partial line of gorp that
> > is completely worthless later on.
> > 
> > This is one of my issues with the notion of unlikely() being smoothed on
> > everywhere like Bondo<g> - it also makes it "unlikely" that you'll get any
> > serendipitous L1/L2 advantages that could be had by locating related
> > functions next to each other.
> > 
> > When you take the unlikely stuff completely out of line in a seperate
> > functions located elsewhere, the mainline code can make better use of the
> > caches.  The Intel parts thrive on L1 hits and die if they're not getting
> > them.
> 
> That's exactly what compiler can do by itself. The fact that currently
> it isn't smart enough to od it means that it has to be improved.
> You propose that people have to do compiler's job.

  Not just the compiler -- the linker would need to be a lot smarter too:


/* foo.c */

extern int bar1(void), bar2(void), bar3(void);

main() {
        if (likely(bar1()))
                bar2();
        else
                bar3();
}


/* bar.c */

int bar1(void) { return 1; }
int bar2(void) { whatever; }
int bar3(void) { whatever; }


  When you compile bar.c the compiler has _no_ idea which functions are likely
to be called.

  And doing this manually is trivial:

        1. Add two sections to vmlinux.lds.S: .fast.text and .slow.text
        2. Define __fast __attribute__(__section__(".fast.text"))
        3. Define __slow similarly
        4. Start tagging functions with __fast and __slow as needed

  Very little work for much potential gain, AFAICS.


--
Chuck

^ permalink raw reply	[flat|nested] 11+ messages in thread
* [RFC]  exit_thread() speedups in x86 process.c
@ 2005-06-14  1:16 cutaway
  2005-06-14  3:08 ` Coywolf Qi Hunt
  0 siblings, 1 reply; 11+ messages in thread
From: cutaway @ 2005-06-14  1:16 UTC (permalink / raw)
  To: linux-kernel

In the current exit_thread() implementation, it appears including the I/O
port map tear down code within the exit_thread() generates enough autovar
data that the compiler needs to spill 4 registers to the stack resulting in
(4) PUSH on entry and (4) POP on exit.

When I tried extracting the map teardown into a seperate function, the
situation changed dramatically to where NO REGISTERS were being
pushed/popped in the normal path entry/exit.

Below is the original generated code, code my proposal generated, and an
#ifdef'd change that produced this elimination of the PUSH/POP's.

Unless I'm on drugs, this looks like a solid winner in a fairly important
code path :)

--------- Original exit_thread() -------
 615               .globl exit_thread
 616               .type exit_thread,@function
 617               exit_thread:
 618 02cc 55        pushl %ebp
 619 02cd 57        pushl %edi
 620 02ce 56        pushl %esi
 621 02cf 53        pushl %ebx
 622 02d0 B800E0FF  movl $-8192,%eax

    blah, blah...

 629 02e5 85C0      testl %eax,%eax
 630 02e7 7507      jne .L1675
 631               .L1657:
 632 02e9 5B        popl %ebx
 633 02ea 5E        popl %esi
 634 02eb 5F        popl %edi
 635 02ec 5D        popl %ebp
 636 02ed C3        ret
 637 02ee 89F6      .p2align 2
 638               .L1675:
 639 02f0 50        pushl %eax
 640 02f1 E8FCFFFF  call kfree


    ...Lots of stuff here to tear down port maps...

--------- Proposed exit_thread() -------

 655               .globl exit_thread
 656               .type exit_thread,@function
 657               exit_thread:
///////////////////////////////////////
// Note how all PUSH/POP's are
// gone from the mainline code now
///////////////////////////////////////
 658 0340 B800E0FF  movl $-8192,%eax
 658      FF
 659
 660 0345 21E0      andl %esp,%eax
 661
 662 0347 8B00      movl (%eax),%eax
 663 0349 05C00100  addl $448,%eax
 663      00
 664 034e 8B907C02  movl 636(%eax),%edx
 664      0000
 665 0354 85D2      testl %edx,%edx
 666 0356 7504      jne .L1676
 667 0358 C3        ret
 668 0359 8D7600    .p2align 2
 669               .L1676:
 670 035c 50        pushl %eax
 671 035d E86AFFFF  call NukePortMap
 671      FF
 672 0362 58        popl %eax
 673 0363 C3        ret

---- This is the change that eliminates the PUSH/POP's ---

#ifdef __TONYI__
static void NukePortMap(struct thread_struct *t)
{
 int cpu = get_cpu();
 struct tss_struct *tss = &per_cpu(init_tss, cpu);

 kfree(t->io_bitmap_ptr);
 t->io_bitmap_ptr = NULL;
 /*
  * Careful, clear this in the TSS too:
  */
 memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
 t->io_bitmap_max = 0;
 tss->io_bitmap_owner = NULL;
 tss->io_bitmap_max = 0;
 tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
 put_cpu();
}
#endif

/*
 * Free current thread data structures etc..
 */
void exit_thread(void)
{
 struct task_struct *tsk = current;
 struct thread_struct *t = &tsk->thread;

 /* The process may have allocated an io port bitmap... nuke it. */
 if (unlikely(NULL != t->io_bitmap_ptr)) {
#ifdef __TONYI__
  NukePortMap(t);
#else
  int cpu = get_cpu();
  struct tss_struct *tss = &per_cpu(init_tss, cpu);

  kfree(t->io_bitmap_ptr);
  t->io_bitmap_ptr = NULL;
  /*
   * Careful, clear this in the TSS too:
   */
  memset(tss->io_bitmap, 0xff, tss->io_bitmap_max);
  t->io_bitmap_max = 0;
  tss->io_bitmap_owner = NULL;
  tss->io_bitmap_max = 0;
  tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET;
  put_cpu();
#endif
 }
}


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-07-05 18:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-05 18:30 [RFC] exit_thread() speedups in x86 process.c William Cohen
     [not found] <200507012258_MC3-1-A340-3A81@compuserve.com.suse.lists.linux.kernel>
     [not found] ` <200507021456.40667.vda@ilport.com.ua.suse.lists.linux.kernel>
2005-07-03 16:05   ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2005-07-02  2:57 Chuck Ebbert
2005-07-02 11:56 ` Denis Vlasenko
2005-07-03 19:59   ` cutaway
2005-06-22  5:48 Chuck Ebbert
2005-06-22  8:41 ` cutaway
2005-06-14  1:16 cutaway
2005-06-14  3:08 ` Coywolf Qi Hunt
2005-06-14  4:26   ` cutaway
2005-06-14  7:43     ` Denis Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox