[PATCH] - Reduce overhead of calc

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] - Reduce overhead of calc_load
@ 2006-03-17 14:57 Jack Steiner
  2006-03-17 14:59 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Steiner @ 2006-03-17 14:57 UTC (permalink / raw)
  To: mingo, akpm; +Cc: linux-kernel

Currently, count_active_tasks() calls both nr_running() & nr_interruptible().
Each of these functions does a "for_each_cpu" & reads values from the
runqueue of each cpu. Although this is not a lot of instructions, each
runqueue may be located on different node. Depending on the architecture,
a unique TLB entry may be required to access each runqueue. 

Since there may be more runqueues than cpu TLB entries, a scan of all runqueues 
can trash the TLB. Each memory reference incurs a TLB miss & refill.

In addition, the runqueue cacheline that contains nr_running & 
nr_uninterruptible may be evicted from the cache between the two passes.
This causes unnecessary cache misses.

Combining nr_running() & nr_interruptible() into a single function
substantially reduces the TLB & cache misses on large systems. This should 
have no measureable effect on smaller systems.

On a 128p IA64 system running a memory stress workload, the new function reduced
the overhead of calc_load() from 605 usec/call to 324 usec/call. 

	Signed-off-by: Jack Steiner <steiner@sgi.com>

 include/linux/sched.h |    1 +
 kernel/sched.c        |   16 ++++++++++++++++
 kernel/timer.c        |    8 +++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h	2006-03-16 16:17:55.982340489 -0600
+++ linux/include/linux/sched.h	2006-03-16 16:26:01.137096980 -0600
@@ -98,6 +98,7 @@ extern int last_pid;
 DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned long nr_running(void);
+extern unsigned long nr_active(void);
 extern unsigned long nr_uninterruptible(void);
 extern unsigned long nr_iowait(void);

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c	2006-03-16 16:17:56.376832371 -0600
+++ linux/kernel/sched.c	2006-03-16 16:26:01.141002841 -0600
@@ -1655,6 +1655,22 @@ unsigned long nr_iowait(void)
 	return sum;
 }

+unsigned long nr_active(void)
+{
+	unsigned long i, running = 0, uninterruptible = 0;
+
+	for_each_online_cpu(i) {
+		running += cpu_rq(i)->nr_running;
+		uninterruptible += cpu_rq(i)->nr_uninterruptible;
+	}
+
+	if (unlikely((long)uninterruptible < 0))
+		uninterruptible = 0;
+
+	return running + uninterruptible;
+}
+
+
 #ifdef CONFIG_SMP

 /*
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c	2006-03-16 16:17:56.385620556 -0600
+++ linux/kernel/timer.c	2006-03-16 16:26:01.141979306 -0600
@@ -849,7 +849,7 @@ void update_process_times(int user_tick)
  */
 static unsigned long count_active_tasks(void)
 {
-	return (nr_running() + nr_uninterruptible()) * FIXED_1;
+	return nr_active() * FIXED_1;
 }

 /*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-17 14:57 [PATCH] - Reduce overhead of calc_load Jack Steiner
@ 2006-03-17 14:59 ` Ingo Molnar
  2006-03-17 15:26   ` Jack Steiner
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2006-03-17 14:59 UTC (permalink / raw)
  To: Jack Steiner; +Cc: akpm, linux-kernel


* Jack Steiner <steiner@sgi.com> wrote:

> 	Signed-off-by: Jack Steiner <steiner@sgi.com>

Acked-by: Ingo Molnar <mingo@elte.hu>

>  extern int nr_processes(void);
>  extern unsigned long nr_running(void);
> +extern unsigned long nr_active(void);
>  extern unsigned long nr_uninterruptible(void);
>  extern unsigned long nr_iowait(void);

ob'nit, i'd make it:

>  extern int nr_processes(void);
>  extern unsigned long nr_running(void);
>  extern unsigned long nr_uninterruptible(void);
> +extern unsigned long nr_active(void);
>  extern unsigned long nr_iowait(void);

just to keep the logical order.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-17 14:59 ` Ingo Molnar
@ 2006-03-17 15:26   ` Jack Steiner
  2006-03-18  1:15     ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Jack Steiner @ 2006-03-17 15:26 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: akpm, linux-kernel

Currently, count_active_tasks() calls both nr_running() & nr_interruptible().
Each of these functions does a "for_each_cpu" & reads values from the
runqueue of each cpu. Although this is not a lot of instructions, each
runqueue may be located on different node. Depending on the architecture,
a unique TLB entry may be required to access each runqueue. 

Since there may be more runqueues than cpu TLB entries, a scan of all runqueues 
can trash the TLB. Each memory reference incurs a TLB miss & refill.

In addition, the runqueue cacheline that contains nr_running & 
nr_uninterruptible may be evicted from the cache between the two passes.
This causes unnecessary cache misses.

Combining nr_running() & nr_interruptible() into a single function
substantially reduces the TLB & cache misses on large systems. This should 
have no measureable effect on smaller systems.

On a 128p IA64 system running a memory stress workload, the new function reduced
the overhead of calc_load() from 605 usec/call to 324 usec/call. 

(Same as previous patch except reorder extern as requested by Ingo)

	Signed-off-by: Jack Steiner <steiner@sgi.com>
	Acked-by: Ingo Molnar <mingo@elte.hu>

 include/linux/sched.h |    1 +
 kernel/sched.c        |   16 ++++++++++++++++
 kernel/timer.c        |    8 +++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h	2006-03-16 16:17:55.982340489 -0600
+++ linux/include/linux/sched.h	2006-03-17 09:23:44.847612520 -0600
@@ -99,6 +99,7 @@ DECLARE_PER_CPU(unsigned long, process_c
 extern int nr_processes(void);
 extern unsigned long nr_running(void);
 extern unsigned long nr_uninterruptible(void);
+extern unsigned long nr_active(void);
 extern unsigned long nr_iowait(void);

 #include <linux/time.h>
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c	2006-03-16 16:17:56.376832371 -0600
+++ linux/kernel/sched.c	2006-03-16 16:26:01.141002841 -0600
@@ -1655,6 +1655,22 @@ unsigned long nr_iowait(void)
 	return sum;
 }

+unsigned long nr_active(void)
+{
+	unsigned long i, running = 0, uninterruptible = 0;
+
+	for_each_online_cpu(i) {
+		running += cpu_rq(i)->nr_running;
+		uninterruptible += cpu_rq(i)->nr_uninterruptible;
+	}
+
+	if (unlikely((long)uninterruptible < 0))
+		uninterruptible = 0;
+
+	return running + uninterruptible;
+}
+
+
 #ifdef CONFIG_SMP

 /*
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c	2006-03-16 16:17:56.385620556 -0600
+++ linux/kernel/timer.c	2006-03-16 16:26:01.141979306 -0600
@@ -849,7 +849,7 @@ void update_process_times(int user_tick)
  */
 static unsigned long count_active_tasks(void)
 {
-	return (nr_running() + nr_uninterruptible()) * FIXED_1;
+	return nr_active() * FIXED_1;
 }

 /*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-17 15:26   ` Jack Steiner
@ 2006-03-18  1:15     ` Andrew Morton
  2006-03-18  2:09       ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2006-03-18  1:15 UTC (permalink / raw)
  To: Jack Steiner; +Cc: mingo, linux-kernel

Jack Steiner <steiner@sgi.com> wrote:
>
> +unsigned long nr_active(void)
> +{
> +	unsigned long i, running = 0, uninterruptible = 0;
> +
> +	for_each_online_cpu(i) {
> +		running += cpu_rq(i)->nr_running;
> +		uninterruptible += cpu_rq(i)->nr_uninterruptible;
> +	}
> +
> +	if (unlikely((long)uninterruptible < 0))
> +		uninterruptible = 0;
> +
> +	return running + uninterruptible;
> +}

Is that check for (uninterruptible < 0) (copied from nr_uninterruptible)
really needed?  Can rq->nr_uninterruptible actually go negative?

Perhaps nr_context_switches() and nr_iowait() should also go into this
function, then we rename it all to

	struct sched_stuff {
		unsigned nr_uninterruptible;
		unsigned nr_running;
		unsigned nr_active;
		unsigned long nr_context_switches;
	};

	void get_sched_stuff(struct sched_stuff *);

and then convert all those random little counter-upper-callers we have.

And then give get_sched_stuff() a hotplug handler (probably unneeded) and
then scratch our heads over why nr_uninterruptible() iterates across all
possible CPUs while this new nr_active() iterates over all online CPUs like
nr_running() and unlike nr_context_switches().

IOW: this code's an inefficient mess and needs some caring for.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  1:15     ` Andrew Morton
@ 2006-03-18  2:09       ` Nick Piggin
  2006-03-18  2:37         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2006-03-18  2:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jack Steiner, mingo, linux-kernel

Andrew Morton wrote:
> Jack Steiner <steiner@sgi.com> wrote:
> 
>>+unsigned long nr_active(void)
>>+{
>>+	unsigned long i, running = 0, uninterruptible = 0;
>>+
>>+	for_each_online_cpu(i) {
>>+		running += cpu_rq(i)->nr_running;
>>+		uninterruptible += cpu_rq(i)->nr_uninterruptible;
>>+	}
>>+
>>+	if (unlikely((long)uninterruptible < 0))
>>+		uninterruptible = 0;
>>+
>>+	return running + uninterruptible;
>>+}
> 
> 
> Is that check for (uninterruptible < 0) (copied from nr_uninterruptible)
> really needed?  Can rq->nr_uninterruptible actually go negative?
> 

The sum cannot if there are no concurrent updates, however when
there are concurrent updates then it can go negative.

rq->nr_uninterruptible itself is meaningless because it can be
incremented on one rq and decremented on another.

> Perhaps nr_context_switches() and nr_iowait() should also go into this
> function, then we rename it all to
> 
> 	struct sched_stuff {
> 		unsigned nr_uninterruptible;
> 		unsigned nr_running;
> 		unsigned nr_active;
> 		unsigned long nr_context_switches;
> 	};
> 
> 	void get_sched_stuff(struct sched_stuff *);
> 
> and then convert all those random little counter-upper-callers we have.
> 

Is there a need? Do they (except calc_load) use multiple values at
the same time?

> And then give get_sched_stuff() a hotplug handler (probably unneeded) and

What would the hotplug handler do?

> then scratch our heads over why nr_uninterruptible() iterates across all
> possible CPUs while this new nr_active() iterates over all online CPUs like
> nr_running() and unlike nr_context_switches().
> 

I think it need only iterate over possible CPUs.

> 
> IOW: this code's an inefficient mess and needs some caring for.

What are the performance critical places that call the nr_blah() functions?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  2:09       ` Nick Piggin
@ 2006-03-18  2:37         ` Andrew Morton
  2006-03-18  2:46           ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2006-03-18  2:37 UTC (permalink / raw)
  To: Nick Piggin; +Cc: steiner, mingo, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> > Perhaps nr_context_switches() and nr_iowait() should also go into this
> > function, then we rename it all to
> > 
> > 	struct sched_stuff {
> > 		unsigned nr_uninterruptible;
> > 		unsigned nr_running;
> > 		unsigned nr_active;
> > 		unsigned long nr_context_switches;
> > 	};
> > 
> > 	void get_sched_stuff(struct sched_stuff *);
> > 
> > and then convert all those random little counter-upper-callers we have.
> > 
> 
> Is there a need? Do they (except calc_load) use multiple values at
> the same time?

Don't know.  It might happen in the future.  And the additional cost is
practically zero.

> > And then give get_sched_stuff() a hotplug handler (probably unneeded) and
> 
> What would the hotplug handler do?

Move the stats from the going-away CPU into the current CPU's runqueue.

> > then scratch our heads over why nr_uninterruptible() iterates across all
> > possible CPUs while this new nr_active() iterates over all online CPUs like
> > nr_running() and unlike nr_context_switches().
> > 
> 
> I think it need only iterate over possible CPUs.

Someone who has four online CPUs, sixteen present CPUs and 128 possible
CPUs would be justifiably disappointed, no?

> > 
> > IOW: this code's an inefficient mess and needs some caring for.
> 
> What are the performance critical places that call the nr_blah() functions?
> 

That depends upon the frequency with which userspace reads /proc/loadavg,
/proc/stat or /proc/future-stuff.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  2:37         ` Andrew Morton
@ 2006-03-18  2:46           ` Nick Piggin
  2006-03-18  5:13             ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2006-03-18  2:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, mingo, linux-kernel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>Is there a need? Do they (except calc_load) use multiple values at
>>the same time?
> 
> 
> Don't know.  It might happen in the future.  And the additional cost is
> practically zero.
> 

Unless it happens to hit another cacheline (cachelines for all other
CPUs but our own will most likely be invalid on this cpu). In which
case the cost could double quite easily.

> 
>>>And then give get_sched_stuff() a hotplug handler (probably unneeded) and
>>
>>What would the hotplug handler do?
> 
> 
> Move the stats from the going-away CPU into the current CPU's runqueue.
> 

Oh, that. Yeah that is handled already for nr_uninterruptible (although,
ironically, it isn't needed because of the for_each_cpu loop there!)

>>I think it need only iterate over possible CPUs.
> 
> 
> Someone who has four online CPUs, sixteen present CPUs and 128 possible
> CPUs would be justifiably disappointed, no?
> 

Yes. Ingo? This can be changed, right?

> 
>>>IOW: this code's an inefficient mess and needs some caring for.
>>
>>What are the performance critical places that call the nr_blah() functions?
>>
> 
> 
> That depends upon the frequency with which userspace reads /proc/loadavg,
> /proc/stat or /proc/future-stuff.
> 
> 

I think it might be better to leave it for the moment. If something comes
up we can always take a look at it then (it isn't particularly tricky code).

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  2:46           ` Nick Piggin
@ 2006-03-18  5:13             ` Andrew Morton
  2006-03-18  5:38               ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2006-03-18  5:13 UTC (permalink / raw)
  To: Nick Piggin; +Cc: steiner, mingo, linux-kernel

Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> Andrew Morton wrote:
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> >>Is there a need? Do they (except calc_load) use multiple values at
> >>the same time?
> > 
> > 
> > Don't know.  It might happen in the future.  And the additional cost is
> > practically zero.
> > 
> 
> Unless it happens to hit another cacheline (cachelines for all other
> CPUs but our own will most likely be invalid on this cpu). In which
> case the cost could double quite easily.
> 

That would be an inefficient implementation.  Let's not implement it
inefficiently.

> I think it might be better to leave it for the moment. If something comes
> up we can always take a look at it then (it isn't particularly tricky code).

What we're seeing here is a proliferation of little functions, all of which
do the same thing, some of them in different ways.

Take a look at (for example) nr_iowait.  We forget to spill the count out
of the departing CPU's runqueue and hence we have to sum it across all
possible CPUs and we don't bother accounting for the possibility of the sum
going negative because we happen to dink with the runqueue of a
now-possibly-downed CPU.  It's inefficient and it's inconsistent and some
of it is, or will become incorrect.  The other counters there probably have
various combinations of these problems but I can't be bothered checking
them all because they're all implemented differently.

Better to do them all in the one place and do them all the same way.  I'd
suggest a cacheline-aligned struct of local_t's which can be queried into a
struct of ulongs.

That query should only look at online CPUs, which becomes rather necessary
if we're to allocate runqueues only for online CPUs (desirable - the thing
is huge).

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  5:13             ` Andrew Morton
@ 2006-03-18  5:38               ` Nick Piggin
  2006-03-18  6:10                 ` Nick Piggin
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Piggin @ 2006-03-18  5:38 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, mingo, linux-kernel

Andrew Morton wrote:
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>>Andrew Morton wrote:
>>
>>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>
>>>>Is there a need? Do they (except calc_load) use multiple values at
>>>>the same time?
>>>
>>>
>>>Don't know.  It might happen in the future.  And the additional cost is
>>>practically zero.
>>>
>>
>>Unless it happens to hit another cacheline (cachelines for all other
>>CPUs but our own will most likely be invalid on this cpu). In which
>>case the cost could double quite easily.
>>
> 
> 
> That would be an inefficient implementation.  Let's not implement it
> inefficiently.
> 

Unconditionally adding up n fields in the runqueue versus 1 field?
It is inevitable that they will cross cacheline boundaries on some
CPU architectures and with some per-cpu implementations isn't it?

> 
>>I think it might be better to leave it for the moment. If something comes
>>up we can always take a look at it then (it isn't particularly tricky code).
> 
> 
> What we're seeing here is a proliferation of little functions, all of which
> do the same thing, some of them in different ways.
> 

Of course they should be made consistent where it makes sense.

> Take a look at (for example) nr_iowait.  We forget to spill the count out
> of the departing CPU's runqueue and hence we have to sum it across all

I don't think a departing runqueue should have any iowaiters on it, should it?

> possible CPUs and we don't bother accounting for the possibility of the sum
> going negative because we happen to dink with the runqueue of a
> now-possibly-downed CPU.  It's inefficient and it's inconsistent and some
> of it is, or will become incorrect.  The other counters there probably have
> various combinations of these problems but I can't be bothered checking
> them all because they're all implemented differently.
> 

Maybe (they're also used in different ways, and with different races to
be careful of so in some respects that is inevitable). But that doesn't
mean we should introduce this new get_sched_stats thing.

> Better to do them all in the one place and do them all the same way.  I'd
> suggest a cacheline-aligned struct of local_t's which can be queried into a
> struct of ulongs.
> 

Now common scheduler operations have to access the runqueue cacheilnes
and this disjoint "stats" structure cacheline, so basic operations will
get slower. Not to mention that all but one are protected by the runqueue
lock, so local_t isn't needed for the rest of them.

> That query should only look at online CPUs, which becomes rather necessary
> if we're to allocate runqueues only for online CPUs (desirable - the thing
> is huge).
> 

Sure, those consistency and efficiency changes should be made now, with
the current structure.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] - Reduce overhead of calc_load
  2006-03-18  5:38               ` Nick Piggin
@ 2006-03-18  6:10                 ` Nick Piggin
  0 siblings, 0 replies; 10+ messages in thread
From: Nick Piggin @ 2006-03-18  6:10 UTC (permalink / raw)
  To: Andrew Morton; +Cc: steiner, mingo, linux-kernel

Nick Piggin wrote:
> Andrew Morton wrote:

>> Take a look at (for example) nr_iowait.  We forget to spill the count out
>> of the departing CPU's runqueue and hence we have to sum it across all
> 
> 
> I don't think a departing runqueue should have any iowaiters on it, 
> should it?
> 

No I'm wrong there. Instead of doing all these migrations and having the
ugly nr_iowait and nr_uninterruptible things I would much prefer just to
migrate the stat when the actual task is migrated. This should take care
of cpu hotplug, and also can make everything consistent.

I'll try doing a patch for that sometime.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-03-18  6:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-17 14:57 [PATCH] - Reduce overhead of calc_load Jack Steiner
2006-03-17 14:59 ` Ingo Molnar
2006-03-17 15:26   ` Jack Steiner
2006-03-18  1:15     ` Andrew Morton
2006-03-18  2:09       ` Nick Piggin
2006-03-18  2:37         ` Andrew Morton
2006-03-18  2:46           ` Nick Piggin
2006-03-18  5:13             ` Andrew Morton
2006-03-18  5:38               ` Nick Piggin
2006-03-18  6:10                 ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox