sched_test_yield benchmark

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched_test_yield benchmark
@ 2001-01-19 14:30 Bill Hartner
  2001-01-19 14:52 ` [Lse-tech] " Andi Kleen
  2001-01-19 16:35 ` Davide Libenzi
  0 siblings, 2 replies; 3+ messages in thread
From: Bill Hartner @ 2001-01-19 14:30 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Andrea Arcangeli, Mike Kravetz, lse-tech, linux-kernel

Just a couple of notes on the sched_test_yield benchmark.
I posted it to the mailing list in Dec.  I have a todo to get
a home for it.  There are some issues though.  See below.

(1) Beware of the changes in sys_sched_yield() for 2.4.0.  Depending
    on how many processors on the test system and how many threads
    created, schedule() may or may not be called when calling
    sched_yield().

    I included sys_sched_yield() at the end of this note.

warning : from this point on I wonder in and out of gray space
so correct me if I am wrong.

(2) The benchmark uses student's t-distribution (0.95) with a
    1% interval width.  On 2.4.0, convergence is pretty good so I
    feel comfortable with the results it reports.  But, be aware of
    the (3) below.

    Run the benchmark multiple times for a certain number of threads
    on a specified number of CPUs. Look for the following in (3).

(3) For the i386 arch :

    My observations were made on an 8-way 550 Mhz PIII Xeon 2MB L2 cache.

    The task structures are page aligned.  So when running the benchmark
    you may see what I *suspect* are L1/L2 cache effects.  The set of
    yielding threads will read the same page offsets in the task struct
    and will dirty the same page offsets on it's kernel stack.  So
    depending on the number of threads and the locations of their task
    structs in physical memory and the associatively of the caches, you
    may see (for example) results like :

                **       **             **
    50 50 50 50 75 50 50 35 50 50 50 50 75

    Also, the number of threads, the order of the task structs on the
    run_queue, thread migration from cpu to cpu, and how many times
    recalculate is done may vary the results from run to run.

    I am looking into this, but not very actively though, busy with
    other stuff.  Hope to get back to it soon.

    What I may do to address this is to allocate more threads than
    requested and then examine the physical address of the task
    structure and then select a subset of the task structs (threads) to
    use in the run.  The benchmark may produce more consistent results
    from run to run (assuming what I suspect is going on is really the
    case).

    You mileage may vary.

Thoughts ?

Bill Hartner
IBM Linux Technology Center - kernel performance
bhartner@us.ibm.com

------from kernel/sched.c 2.4.0 -------

asmlinkage long sys_sched_yield(void)
{
     /*
      * Trick. sched_yield() first counts the number of truly
      * 'pending' runnable processes, then returns if it's
      * only the current processes. (This test does not have
      * to be atomic.) In threaded applications this optimization
      * gets triggered quite often.
      */

     int nr_pending = nr_running;

#if CONFIG_SMP
     int i;

     // Substract non-idle processes running on other CPUs.
     for (i = 0; i < smp_num_cpus; i++)
          if (aligned_data[i].schedule_data.curr != idle_task(i))
               nr_pending--;
#else
     // on UP this process is on the runqueue as well
     nr_pending--;
#endif
     if (nr_pending) {
          /*
           * This process can only be rescheduled by us,
           * so this is safe without any locking.
           */
          if (current->policy == SCHED_OTHER)
               current->policy |= SCHED_YIELD;
          current->need_resched = 1;
     }
     return 0;
}



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Lse-tech] sched_test_yield benchmark
  2001-01-19 14:30 sched_test_yield benchmark Bill Hartner
@ 2001-01-19 14:52 ` Andi Kleen
  2001-01-19 16:35 ` Davide Libenzi
  1 sibling, 0 replies; 3+ messages in thread
From: Andi Kleen @ 2001-01-19 14:52 UTC (permalink / raw)
  To: Bill Hartner
  Cc: Davide Libenzi, Andrea Arcangeli, Mike Kravetz, lse-tech,
	linux-kernel

On Fri, Jan 19, 2001 at 09:30:55AM -0500, Bill Hartner wrote:
> (3) For the i386 arch :
> 
>     My observations were made on an 8-way 550 Mhz PIII Xeon 2MB L2 cache.
> 
>     The task structures are page aligned.  So when running the benchmark
>     you may see what I *suspect* are L1/L2 cache effects.  The set of
>     yielding threads will read the same page offsets in the task struct
>     and will dirty the same page offsets on it's kernel stack.  So
>     depending on the number of threads and the locations of their task
>     structs in physical memory and the associatively of the caches, you
>     may see (for example) results like :

This is a know problem. It is caused by the way 2.2+ "current" works on i386.
It is at the bottom of the kernel stack and is computed from the stack pointer
using an and.  The kernel stack needs to be page aligned because of the 
allocator and the way the and works. 

The reason current cannot be put into a normal global variable is that there is 
usually no easy way to find out which CPU you're running on and the global
variable would need to be indexed by the CPU. Also using a real global on UP
is a real loser in terms of code size on i386 (several KB difference) 

This unfortunately means that task_struct ends up on similar cache 
colours and depending on the CPU could not use the caches very well.
One hope for i386 is that future CPUs have better caches so that too
aggressive cache colouring may not be worth it (actually I think that was already 
hoped for the P2) 

E.g. on IA64, m68k, x86-64 and other architectures which have enough registers
the kernel can afford to just put current into a global register variable. 
This means a cache colouring allocation could be used for the task_struct because
it doesn't need to be on a maskable address [the ports currently do not do that, 
but they could] 

It would be interesting if you could also rerun your tests with prefetching
in the scheduler loop enabled.  I can supply a patch for that.

-Andi 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: sched_test_yield benchmark
  2001-01-19 14:30 sched_test_yield benchmark Bill Hartner
  2001-01-19 14:52 ` [Lse-tech] " Andi Kleen
@ 2001-01-19 16:35 ` Davide Libenzi
  1 sibling, 0 replies; 3+ messages in thread
From: Davide Libenzi @ 2001-01-19 16:35 UTC (permalink / raw)
  To: Bill Hartner; +Cc: Andrea Arcangeli, Mike Kravetz, lse-tech, linux-kernel

On Friday 19 January 2001 07:59, Bill Hartner wrote:
> Just a couple of notes on the sched_test_yield benchmark.
> I posted it to the mailing list in Dec.  I have a todo to get
> a home for it.  There are some issues though.  See below.
>
> (1) Beware of the changes in sys_sched_yield() for 2.4.0.  Depending
>     on how many processors on the test system and how many threads
>     created, schedule() may or may not be called when calling
>     sched_yield().

In Your test You're using at least 16 tasks with an 8 way SMP, so schedule() 
should be always called ( if You're using my test suite tasks are always 
running ).



>
> (3) For the i386 arch :
>
>     My observations were made on an 8-way 550 Mhz PIII Xeon 2MB L2 cache.

Hey, this should be the machine I've lost two days ago :^)


>
>     The task structures are page aligned.  So when running the benchmark
>     you may see what I *suspect* are L1/L2 cache effects.  The set of
>     yielding threads will read the same page offsets in the task struct
>     and will dirty the same page offsets on it's kernel stack.  So
>     depending on the number of threads and the locations of their task
>     structs in physical memory and the associatively of the caches, you
>     may see (for example) results like :
>
>                 **       **             **
>     50 50 50 50 75 50 50 35 50 50 50 50 75
>
>     Also, the number of threads, the order of the task structs on the
>     run_queue, thread migration from cpu to cpu, and how many times
>     recalculate is done may vary the results from run to run.

Yep, this is the issue.
Why not move scheduling fields in a separate structure with a different 
alignment :

struct s_sched_fields {
 ...
} sched_fields[];

inline struct s_sched_fields * get_sched_fields_ptr(task_struct * ) {

}

This will reduce the probability that scheduling fields will fall onto the 
same cache line.



- Davide
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2001-01-19 16:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-01-19 14:30 sched_test_yield benchmark Bill Hartner
2001-01-19 14:52 ` [Lse-tech] " Andi Kleen
2001-01-19 16:35 ` Davide Libenzi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox