Re: [PATCH 1/13] timestamp fixes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/13] timestamp fixes
       [not found] <42235517.5070504@us.ibm.com>
@ 2005-02-28 18:11 ` Andrew Theurer
  2005-03-01  8:09   ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Theurer @ 2005-02-28 18:11 UTC (permalink / raw)
  To: mingo, nickpiggin, linux-kernel

Nick, can you describe the system you run the DB tests on?  Do you have 
any cpu idle time stats and hopefully some context switch rate stats? 

I think I understand the concern [patch 6] of stealing a task from one 
node to an idle cpu in another node, but I wonder if we can have some 
sort of check for idle balance: if the domain/node to steal from has 
some idle cpu somewhere, we do not steal, period.  To do this we have a 
cpu_idle bitmask, we update as cpus go idle/busy, and we reference this 
cpu_idle & sd->cpu_mask to see if there's at least one cpu that's idle.

> Ingo wrote:
>
> But i expect fork/clone balancing to be almost certainly a problem. (We
> didnt get it right for all workloads in 2.6.7, and i think it cannot be
> gotten right currently either, without userspace API help - but i'd be
> happy to be proven wrong.)

Perhaps initially one could balance on fork up to the domain level which 
has task_hot_time=0, up to a shared cache by default.  Anything above 
that could require a numactl like preference from userspace.

-Andrew Theurer

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-28 18:11 ` [PATCH 1/13] timestamp fixes Andrew Theurer
@ 2005-03-01  8:09   ` Nick Piggin
  2005-03-01  9:03     ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2005-03-01  8:09 UTC (permalink / raw)
  To: Andrew Theurer; +Cc: mingo, linux-kernel

Andrew Theurer wrote:
> Nick, can you describe the system you run the DB tests on?  Do you have 
> any cpu idle time stats and hopefully some context switch rate stats?

Yeah, it is dbt3-pgsql on OSDL's 8-way STP machines. I think they're
PIII Xeons with 2MB L2 cache.

I had been having some difficulty running them recently, but I might
have just worked out what the problem was, so hopefully I can benchmark
the patchset I just sent to Andrew (plus fixes from Ingo, etc).

Basically what would happen is that with seemingly small changes, the
"throughput" figure would go from about 260 tps to under 100. I don't
know exactly why this benchmark is particularly sensitive to the
problem, but it has been a good canary for too-much-idle-time
regressions in the past.

> I think I understand the concern [patch 6] of stealing a task from one 
> node to an idle cpu in another node, but I wonder if we can have some 
> sort of check for idle balance: if the domain/node to steal from has 
> some idle cpu somewhere, we do not steal, period.  To do this we have a 
> cpu_idle bitmask, we update as cpus go idle/busy, and we reference this 
> cpu_idle & sd->cpu_mask to see if there's at least one cpu that's idle.
> 

I think this could become a scalability problem... and I'd prefer to
initially try to address problems of too much idle time by tuning them
out rather than use things like wake-to-idle-cpu.

One of my main problems with wake to idle is that it is explicitly
introducing a bias so that wakers repel their wakees. With all else
being equal, we want exactly the opposite effect (hence all the affine
wakeups and wake balancing stuff).

The regular periodic balancing may be dumb, but at least it doesn't
have that kind of bias.

The other thing of course is that we want to reduce internode task
movement at almost all costs, and the periodic balancer can be very
careful about this - something more difficult for wake_idle unless
we introduce more complexity there (a very time critical path).

Anyway, I'm not saying no to anything at this stage... let's just see
how we go.

>> Ingo wrote:
>>
>> But i expect fork/clone balancing to be almost certainly a problem. (We
>> didnt get it right for all workloads in 2.6.7, and i think it cannot be
>> gotten right currently either, without userspace API help - but i'd be
>> happy to be proven wrong.)
> 
> 
> Perhaps initially one could balance on fork up to the domain level which 
> has task_hot_time=0, up to a shared cache by default.  Anything above 
> that could require a numactl like preference from userspace.
> 

That may end up being what we have to do. Probably doesn't make
much sense to do it at all if we aren't doing it for NUMA.

Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-03-01  8:09   ` Nick Piggin
@ 2005-03-01  9:03     ` Nick Piggin
  0 siblings, 0 replies; 7+ messages in thread
From: Nick Piggin @ 2005-03-01  9:03 UTC (permalink / raw)
  To: Andrew Theurer; +Cc: mingo, linux-kernel

Nick Piggin wrote:
> Andrew Theurer wrote:
> 
>> Nick, can you describe the system you run the DB tests on?  Do you 
>> have any cpu idle time stats and hopefully some context switch rate 
>> stats?
> 
> 
> Yeah, it is dbt3-pgsql on OSDL's 8-way STP machines. I think they're
> PIII Xeons with 2MB L2 cache.
> 
> I had been having some difficulty running them recently, but I might
> have just worked out what the problem was, so hopefully I can benchmark
> the patchset I just sent to Andrew (plus fixes from Ingo, etc).
> 

Nope. Still not working :P


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 0/13] Multiprocessor CPU scheduler patches
@ 2005-02-24  7:14 Nick Piggin
  2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2005-02-24  7:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Hi,

I hope that you can include the following set of CPU scheduler
patches in -mm soon, if you have no other significant performance
work going on.

There are some fairly significant changes, with a few basic aims:
* Improve SMT behaviour
* Improve CMP behaviour, CMP/NUMA scheduling (ie. Opteron)
* Reduce task movement, esp over NUMA nodes.

They are not going to be very well tuned for most usages at the
moment (unfortunately dbt2/3-pgsql on OSDL isn't working, which
is a good one). So hopefully I can address regressions as they
come up.

There are a few problems with the scheduler currently:

Problem #1:
It has _very_ aggressive idle CPU pulling. Not only does it not
really obey imbalances, it is also wrong for eg. an SMT CPU
who's sibling is not idle. The reason this was done really is to
bring down idle time on some workloads (dbt2-pgsql, other
database stuff).

So I address this in the following ways; reduce special casing
for idle balancing, revert some of the recent moves toward even
more aggressive balancing.

Then provide a range of averaging levels for CPU "load averages",
and we choose which to use in which situation on a sched-domain
basis. This allows idle balancing to use a more instantaneous value
for calculating load, so idle CPUs need not wait many timer ticks
for the load averages to catch up. This can hopefully solve our
idle time problems.

Also, further moderate "affine wakeups", which can tend to move
most tasks to one CPU on some workloads and cause idle problems.

Problem #2:
The second problem is that balance-on-exec is not sched-domains
aware. This means it will tend to (for example) fill up two cores
of a CPU on one socket, then fill up two cores on the next socket,
etc. What we want is to try to spread load evenly across memory
controllers.

So make that sched-domains aware following the same pattern as
find_busiest_group / find_busiest_queue.

Problem #3:
Lastly, implement balance-on-fork/clone again. I have come to the
realisation that for NUMA, this is probably the best solution.
Run-cloned-child-last has run out of steam on CMP systems. What
it was supposed to do was provide a period where the child could
be pulled to another CPU before it starts running and allocating
memory. Unfortunately on CMP systems, this tends to just be to the
other sibling.

Also, having such a difference between thread and process creation
was not really ideal, so we balance on all types of fork/clone.
This really helps some things (like STREAM) on CMP Opterons, but
also hurts others, so naturally it is settable per-domain.

Problem #4:
Sched domains isn't very useful to me in its current form. Bring
it up to date with what I've been using. I don't think anyone other
than myself uses it so that should be OK.

Nick

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/13] timestamp fixes
  2005-02-24  7:14 [PATCH 0/13] Multiprocessor CPU scheduler patches Nick Piggin
@ 2005-02-24  7:16 ` Nick Piggin
  2005-02-24  7:46   ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2005-02-24  7:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6 bytes --]

1/13


[-- Attachment #2: sched-timestamp-fixes.patch --]
[-- Type: text/x-patch, Size: 1377 bytes --]

Some fixes for unsynchronised TSCs. A task's timestamp may have been set
by another CPU. Although we try to adjust this correctly with the
timestamp_last_tick field, there is no guarantee this will be exactly right.

Signed-off-by: Nick Piggin <nickpiggin@yahoo.com.au>

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c	2005-02-24 17:31:25.384986289 +1100
+++ linux-2.6/kernel/sched.c	2005-02-24 17:43:39.356379395 +1100
@@ -648,6 +648,7 @@
 
 static void recalc_task_prio(task_t *p, unsigned long long now)
 {
+	/* Caller must always ensure 'now >= p->timestamp' */
 	unsigned long long __sleep_time = now - p->timestamp;
 	unsigned long sleep_time;
 
@@ -2703,8 +2704,10 @@
 
 	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
+	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG))
 		run_time = now - prev->timestamp;
+		if (unlikely((long long)now - prev->timestamp < 0))
+			run_time = 0;
 	else
 		run_time = NS_MAX_SLEEP_AVG;
 
@@ -2782,6 +2785,8 @@
 
 	if (!rt_task(next) && next->activated > 0) {
 		unsigned long long delta = now - next->timestamp;
+		if (unlikely((long long)now - next->timestamp < 0))
+			delta = 0;
 
 		if (next->activated == 1)
 			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
@ 2005-02-24  7:46   ` Ingo Molnar
  2005-02-24  7:56     ` Nick Piggin
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2005-02-24  7:46 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> 1/13
> 

ugh, has this been tested? It needs the patch below.

	Ingo

Signed-off-by: Ingo Molnar <mingo@elte.hu>

--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -2704,11 +2704,11 @@ need_resched_nonpreemptible:
 
 	schedstat_inc(rq, sched_cnt);
 	now = sched_clock();
-	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG))
+	if (likely((long long)now - prev->timestamp < NS_MAX_SLEEP_AVG)) {
 		run_time = now - prev->timestamp;
 		if (unlikely((long long)now - prev->timestamp < 0))
 			run_time = 0;
-	else
+	} else
 		run_time = NS_MAX_SLEEP_AVG;
 
 	/*

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:46   ` Ingo Molnar
@ 2005-02-24  7:56     ` Nick Piggin
  2005-02-24  8:34       ` Ingo Molnar
  0 siblings, 1 reply; 7+ messages in thread
From: Nick Piggin @ 2005-02-24  7:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andrew Morton, linux-kernel

On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> > 1/13
> > 
> 
> ugh, has this been tested? It needs the patch below.
> 

Yes. Which might also explain why I didn't see -ve intervals :(
Thanks Ingo.

In the context of the whole patchset, testing has mainly been
based around multiprocessor behaviour so this doesn't invalidate
that.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/13] timestamp fixes
  2005-02-24  7:56     ` Nick Piggin
@ 2005-02-24  8:34       ` Ingo Molnar
  0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2005-02-24  8:34 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-kernel

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> On Thu, 2005-02-24 at 08:46 +0100, Ingo Molnar wrote:
> > * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > > 1/13
> > > 
> > 
> > ugh, has this been tested? It needs the patch below.
> > 
> 
> Yes. Which might also explain why I didn't see -ve intervals :( Thanks
> Ingo.
> 
> In the context of the whole patchset, testing has mainly been based
> around multiprocessor behaviour so this doesn't invalidate that.

nono, by 'this' i only meant that patch. The other ones look mainly OK,
but obviously they need a _ton_ of testing.

these:

 [PATCH 1/13] timestamp fixes
   (+fix)
 [PATCH 2/13] improve pinned task handling
 [PATCH 3/13] rework schedstats

can go into BK right after 2.6.11 is released as they are fixes or
norisk-improvements. [lets call them 'group A'] These three:

 [PATCH 4/13] find_busiest_group fixlets
 [PATCH 5/13] find_busiest_group cleanup

 [PATCH 7/13] better active balancing heuristic

look pretty fine too and i'd suggest early BK integration too - but in
theory they could impact things negatively so that's where immediate BK
integration has to stop in the first phase, to get some feedback. [lets 
call them 'group B']

these:

 [PATCH 6/13] no aggressive idle balancing

 [PATCH 8/13] generalised CPU load averaging
 [PATCH 9/13] less affine wakups
 [PATCH 10/13] remove aggressive idle balancing
 [PATCH 11/13] sched-domains aware balance-on-fork
  [PATCH 12/13] schedstats additions for sched-balance-fork
 [PATCH 13/13] basic tuning

change things radically, and i'm uneasy about them even in the 2.6.12
timeframe. [lets call them 'group C'] I'd suggest we give them a go in
-mm and see how things go, so all of them get:

  Acked-by: Ingo Molnar <mingo@elte.hu>

If things dont stabilize quickly then we need to do it piecemail wise.
The only possible natural split seems to be to go for the running-task
balancing changes first:

 [PATCH 6/13] no aggressive idle balancing

 [PATCH 8/13] generalised CPU load averaging
 [PATCH 9/13] less affine wakups
 [PATCH 10/13] remove aggressive idle balancing

 [PATCH 13/13] basic tuning

perhaps #8 and relevant portions of #13 could be moved from group C into
group B and thus hit BK early, but that would need remerging.

and then for the fork/clone-balancing changes:

 [PATCH 11/13] sched-domains aware balance-on-fork
 [PATCH 12/13] schedstats additions for sched-balance-fork

a more finegrained splitup doesnt make much sense, as these groups are
pretty compact conceptually.

But i expect fork/clone balancing to be almost certainly a problem. (We
didnt get it right for all workloads in 2.6.7, and i think it cannot be
gotten right currently either, without userspace API help - but i'd be
happy to be proven wrong.)

(if you agree with my generic analysis then when you regenerate your
patches next time please reorder them according to the flow above, and
please try to insert future fixlets not end-of-stream but according to
the conceptual grouping.)

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-03-01  9:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <42235517.5070504@us.ibm.com>
2005-02-28 18:11 ` [PATCH 1/13] timestamp fixes Andrew Theurer
2005-03-01  8:09   ` Nick Piggin
2005-03-01  9:03     ` Nick Piggin
2005-02-24  7:14 [PATCH 0/13] Multiprocessor CPU scheduler patches Nick Piggin
2005-02-24  7:16 ` [PATCH 1/13] timestamp fixes Nick Piggin
2005-02-24  7:46   ` Ingo Molnar
2005-02-24  7:56     ` Nick Piggin
2005-02-24  8:34       ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox