[RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)
@ 2010-09-20 19:19 Mathieu Desnoyers
  2010-09-21 12:01 ` Mike Galbraith
  2010-09-21 22:01 ` Jake Edge
  0 siblings, 2 replies; 5+ messages in thread
From: Mathieu Desnoyers @ 2010-09-20 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mike Galbraith, Linus Torvalds, Andrew Morton,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

This patch tweaks the fair vruntime calculation of both the parent and the child
after a fork to double vruntime increment speed, but this is only applied to
their first slice after the fork. The goal of this scheme is that their
respective vruntime will increment faster in the first slice after the fork, so
a workload doing many forks (e.g.  make -j10) will have a limited impact on
latency-sensitive workloads.

This is an alternative to START_DEBIT which does not have the downside of moving
newly forked threads to the end of the runqueue.

Changelog since v3:
- Take Peter Zijlstra's comments into account.
- Move timeout check and penality reset to __update_curr().

Changelog since v2:
- Apply vruntime penality even the first time the exec time has moved across
  the timeout.

Changelog since v1:
- Moving away from modifying the task weight from within the scheduler, as it is
  error-prone: modifying the weight of a queued task leads to cpu weight errors.
  For the moment, just tweak calc_delta_fair vruntime calculation. Eventually we
  could revisit the weight modification approach if we decide that it's worth
  the more intrusive changes. I redid the START_NICE benchmark, which did not
  change much: it is still appealing.


Latency benchmark:

* wakeup-latency.c (SIGEV_THREAD) with make -j10 on UP 2.0GHz

Kernel used: mainline 2.6.35.2 with smaller min_granularity and check_preempt
vruntime vs runtime comparison patches applied.

- START_DEBIT (vanilla setting)

maximum latency: 26409.0 µs
average latency: 6762.1 µs
missed timer events: 0

- NO_START_DEBIT, NO_START_NICE

maximum latency: 10001.8 µs
average latency: 1618.7 µs
missed timer events: 0

- START_NICE

maximum latency: 8351.2 µs
average latency: 1597.7 µs
missed timer events: 0


On the Xorg interactivity aspect, I notice a major improvement with START_NICE
compared to the two other settings. I just came up with a very simple repeatable
low-tech test that takes into account both input and video update
responsiveness:

Start make -j10 in a gnome-terminal
In another gnome-terminal, start pressing the space bar, holding it.
Use the cursor speed (my cursor is a full rectangle) as latency indicator. With
low latency, its speed should be constant, no stopping and no sudden
acceleration.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
 include/linux/sched.h   |    2 +
 kernel/sched.c          |    2 +
 kernel/sched_debug.c    |   11 +++++---
 kernel/sched_fair.c     |   65 ++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched_features.h |    6 ++++
 5 files changed, 75 insertions(+), 11 deletions(-)

Index: linux-2.6-lttng.git/kernel/sched_features.h
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_features.h
+++ linux-2.6-lttng.git/kernel/sched_features.h
@@ -12,6 +12,12 @@ SCHED_FEAT(GENTLE_FAIR_SLEEPERS, 1)
 SCHED_FEAT(START_DEBIT, 1)
 
 /*
+ * After a fork, ensure both the parent and the child get niced for their
+ * following slice.
+ */
+SCHED_FEAT(START_NICE, 0)
+
+/*
  * Should wakeups try to preempt running tasks.
  */
 SCHED_FEAT(WAKEUP_PREEMPT, 1)
Index: linux-2.6-lttng.git/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.git.orig/include/linux/sched.h
+++ linux-2.6-lttng.git/include/linux/sched.h
@@ -1132,6 +1132,8 @@ struct sched_entity {
 	u64			prev_sum_exec_runtime;
 
 	u64			nr_migrations;
+	u64			fork_nice_timeout;
+	unsigned int		fork_nice_penality;
 
 #ifdef CONFIG_SCHEDSTATS
 	struct sched_statistics statistics;
Index: linux-2.6-lttng.git/kernel/sched.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched.c
+++ linux-2.6-lttng.git/kernel/sched.c
@@ -2421,6 +2421,8 @@ static void __sched_fork(struct task_str
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
+	p->se.fork_nice_timeout		= 0;
+	p->se.fork_nice_penality	= 0;
 
 #ifdef CONFIG_SCHEDSTATS
 	memset(&p->se.statistics, 0, sizeof(p->se.statistics));
Index: linux-2.6-lttng.git/kernel/sched_fair.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_fair.c
+++ linux-2.6-lttng.git/kernel/sched_fair.c
@@ -432,6 +432,8 @@ calc_delta_fair(unsigned long delta, str
 {
 	if (unlikely(se->load.weight != NICE_0_LOAD))
 		delta = calc_delta_mine(delta, NICE_0_LOAD, &se->load);
+	if (se->fork_nice_penality)
+		delta <<= se->fork_nice_penality;
 
 	return delta;
 }
@@ -481,6 +483,8 @@ static u64 sched_slice(struct cfs_rq *cf
 			load = &lw;
 		}
 		slice = calc_delta_mine(slice, se->load.weight, load);
+		if (se->fork_nice_penality)
+			slice <<= se->fork_nice_penality;
 	}
 	return slice;
 }
@@ -511,6 +515,13 @@ __update_curr(struct cfs_rq *cfs_rq, str
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
 	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
+	if (curr->fork_nice_penality) {
+		if ((s64)(curr->sum_exec_runtime
+		    - curr->fork_nice_timeout) > 0) {
+			curr->fork_nice_penality = 0;
+			curr->fork_nice_timeout = 0;
+		}
+	}
 
 	curr->vruntime += delta_exec_weighted;
 	update_min_vruntime(cfs_rq);
@@ -830,7 +841,12 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 * update can refer to the ->curr item and we need to reflect this
 	 * movement in our normalized position.
 	 */
-	if (!(flags & DEQUEUE_SLEEP))
+	if (flags & DEQUEUE_SLEEP) {
+		if (se->fork_nice_penality) {
+			se->fork_nice_penality = 0;
+			se->fork_nice_timeout = 0;
+		}
+	} else
 		se->vruntime -= cfs_rq->min_vruntime;
 }
 
@@ -1576,8 +1592,6 @@ select_task_rq_fair(struct rq *rq, struc
 static unsigned long
 wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
 {
-	unsigned long gran = sysctl_sched_wakeup_granularity;
-
 	/*
 	 * Since its curr running now, convert the gran from real-time
 	 * to virtual-time in his units.
@@ -1591,10 +1605,7 @@ wakeup_gran(struct sched_entity *curr, s
 	 * This is especially important for buddies when the leftmost
 	 * task is higher priority than the buddy.
 	 */
-	if (unlikely(se->load.weight != NICE_0_LOAD))
-		gran = calc_delta_fair(gran, se);
-
-	return gran;
+	return calc_delta_fair(sysctl_sched_wakeup_granularity, se);
 }
 
 /*
@@ -3525,6 +3536,42 @@ static void task_tick_fair(struct rq *rq
 }
 
 /*
+ * Set task nice penality at fork. This is a temporary penality set for both
+ * parent and child at fork, which is removed after a slice.
+ */
+static void task_fork_fair_set_penality(struct cfs_rq *cfs_rq,
+					struct sched_entity *curr,
+					struct sched_entity *se)
+{
+	if (!sched_feat(START_NICE))
+		return;
+
+	if (curr->fork_nice_penality && (s64)(curr->sum_exec_runtime
+					      - curr->fork_nice_timeout) > 0) {
+		curr->fork_nice_penality = 0;
+		curr->fork_nice_timeout = 0;
+	}
+
+	if (!curr->fork_nice_timeout)
+		curr->fork_nice_timeout = curr->sum_exec_runtime;
+	curr->fork_nice_timeout += sched_slice(cfs_rq, curr);
+	/*
+	 * Arbitrarily cap the nice penality to <<= 8, which is 256 times
+	 * lighter than the actual task weight. 256 is about 4 times lighter
+	 * than the range from nice 0 to nice 19, which is 68 times lighter.
+	 * This should be sufficient to gradually penalize fork-happy tasks
+	 * without risking to run into shift overflow problems on deltas which
+	 * are represented on a 64-bit unsigned integer.
+	 */
+	curr->fork_nice_penality = min_t(unsigned int,
+					 curr->fork_nice_penality + 1, 8);
+	/* Child sum_exec_runtime starts at 0 */
+	se->fork_nice_timeout = curr->fork_nice_timeout
+				- curr->sum_exec_runtime;
+	se->fork_nice_penality = curr->fork_nice_penality;
+}
+
+/*
  * called on fork with the child task as argument from the parent's context
  *  - child not yet on the tasklist
  *  - preemption disabled
@@ -3544,8 +3591,10 @@ static void task_fork_fair(struct task_s
 
 	update_curr(cfs_rq);
 
-	if (curr)
+	if (curr) {
 		se->vruntime = curr->vruntime;
+		task_fork_fair_set_penality(cfs_rq, curr, se);
+	}
 	place_entity(cfs_rq, se, 1);
 
 	if (sysctl_sched_child_runs_first && curr && entity_before(curr, se)) {
Index: linux-2.6-lttng.git/kernel/sched_debug.c
===================================================================
--- linux-2.6-lttng.git.orig/kernel/sched_debug.c
+++ linux-2.6-lttng.git/kernel/sched_debug.c
@@ -120,6 +120,10 @@ print_task(struct seq_file *m, struct rq
 		SEQ_printf(m, " %s", path);
 	}
 #endif
+
+	SEQ_printf(m, " %d", p->se.fork_nice_penality);
+	SEQ_printf(m, " %9Ld.%06ld", SPLIT_NS(p->se.fork_nice_timeout));
+
 	SEQ_printf(m, "\n");
 }
 
@@ -131,9 +135,10 @@ static void print_rq(struct seq_file *m,
 	SEQ_printf(m,
 	"\nrunnable tasks:\n"
 	"            task   PID         tree-key  switches  prio"
-	"     exec-runtime         sum-exec        sum-sleep\n"
-	"------------------------------------------------------"
-	"----------------------------------------------------\n");
+	"     exec-runtime         sum-exec        sum-sleep    nice-pen"
+	"     nice-pen-timeout\n"
+	"---------------------------------------------------------------"
+	"---------------------------------------------------------------\n");
 
 	read_lock_irqsave(&tasklist_lock, flags);
 
-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)
  2010-09-20 19:19 [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4) Mathieu Desnoyers
@ 2010-09-21 12:01 ` Mike Galbraith
  2010-09-21 14:04   ` Mike Galbraith
  2010-09-21 22:01 ` Jake Edge
  1 sibling, 1 reply; 5+ messages in thread
From: Mike Galbraith @ 2010-09-21 12:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

On Mon, 2010-09-20 at 15:19 -0400, Mathieu Desnoyers wrote:

> Kernel used: mainline 2.6.35.2 with smaller min_granularity and check_preempt
> vruntime vs runtime comparison patches applied.

My test kernel is fresh baked v2.6.36-rc5.

> - START_DEBIT (vanilla setting)
> 
> maximum latency: 26409.0 µs
> average latency: 6762.1 µs
> missed timer events: 0

Mine are much worse, as I have 0bf377bb reverted to keep my base test
numbers relevant while tinkering.  These are fresh though.

maximum latency: 69261.1 µs     130058.0 µs     106636.9 µs
average latency:  9169.6 µs       9456.4 µs       9281.7 µs
missed timer events: 0                0              0

pert vs make -j3, 30 sec sample time
pert/s:       70 >6963.47us:      857 min:  0.06 max:60738.53 avg:10738.03 sum/s:754884us overhead:75.49%
pert/s:       70 >10471.13us:      847 min:  0.12 max:73405.91 avg:10674.23 sum/s:753245us overhead:75.31%
pert/s:       72 >12703.37us:      790 min:  0.11 max:55287.48 avg:10299.53 sum/s:749463us overhead:74.84%
pert/s:       71 >14825.31us:      740 min:  0.11 max:57264.25 avg:10581.39 sum/s:751984us overhead:75.20%

> - NO_START_DEBIT, NO_START_NICE
> 
> maximum latency: 10001.8 µs
> average latency: 1618.7 µs
> missed timer events: 0

maximum latency: 19948.5 µs     19215.4 µs     19526.3 µs
average latency:  5000.1 µs      4712.2 µs      5005.5 µs
missed timer events:   0              0              0

pert vs make -j3, 30 sec sample time
pert/s:       61 >8928.57us:      743 min:  0.15 max:78659.33 avg:12895.03 sum/s:787026us overhead:78.64%
pert/s:       62 >12853.44us:      700 min:  0.12 max:83828.68 avg:12525.78 sum/s:778686us overhead:77.84%
pert/s:       61 >15566.82us:      675 min:  0.11 max:67289.16 avg:12685.47 sum/s:781002us overhead:78.07%
pert/s:       61 >18254.31us:      690 min:  1.40 max:72051.21 avg:12832.17 sum/s:782762us overhead:78.27%

> - START_NICE
> 
> maximum latency: 8351.2 µs
> average latency: 1597.7 µs
> missed timer events: 0

maximum latency: 34004.7 µs     34712.5 µs     46956.6 µs
average latency:  7886.9 µs      8099.8 µs      8060.3 µs
missed timer events: 0               0             0

pert vs make -j3, 30 sec sample time
pert/s:      104 >5610.69us:     1036 min:  0.05 max:56740.62 avg:6047.66 sum/s:628957us overhead:62.90%
pert/s:      104 >8617.90us:      884 min:  0.15 max:65410.85 avg:5954.64 sum/s:623253us overhead:62.25%
pert/s:      116 >11005.35us:      837 min:  0.14 max:60020.97 avg:4963.97 sum/s:577641us overhead:57.76%
pert/s:       99 >13632.91us:      863 min:  0.14 max:68019.67 avg:6542.21 sum/s:648987us overhead:64.86%


V4 seems to have lost some effectiveness wrt new thread latency, and
tilted the fairness scaled considerably further in the 100% hog's favor.

> @@ -481,6 +483,8 @@ static u64 sched_slice(struct cfs_rq *cf
>  			load = &lw;
>  		}
>  		slice = calc_delta_mine(slice, se->load.weight, load);
> +		if (se->fork_nice_penality)
> +			slice <<= se->fork_nice_penality;
>  	}
>  	return slice;
>  }

Hm.  Parent/child can exec longer (?), but also sit in the penalty box
longer, so pay through the nose.  That doesn't look right.  Why mess
with slice?  Neither effect seems logical.

Since this is mostly about reducing latencies for the non-fork
competition, maybe a kinder gentler START_DEBIT would work.  Let the
child inherit parent's vruntime, charge a fraction of the vruntime
equalizer bill _after_ it execs, until the bill has been paid, or
whatnot.

(I've tried a few things, and my bit-bucket runneth over, so this idea
probably sucks rocks too;)

	-Mike


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)
  2010-09-21 12:01 ` Mike Galbraith
@ 2010-09-21 14:04   ` Mike Galbraith
  2010-09-22 11:51     ` Mike Galbraith
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Galbraith @ 2010-09-21 14:04 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

On Tue, 2010-09-21 at 14:01 +0200, Mike Galbraith wrote:

> Since this is mostly about reducing latencies for the non-fork
> competition, maybe a kinder gentler START_DEBIT would work.  Let the
> child inherit parent's vruntime, charge a fraction of the vruntime
> equalizer bill _after_ it execs, until the bill has been paid, or
> whatnot.

One thing that comes to mind is that a lot of this problem is generated,
or rather exacerbated by, sleeper fairness.  For example, take vfork:
parent goes to sleep after passing the baton to the child.  When the
child exits, the parent wakes, gets credit for sleeping while it was
really running in drag, may preempt as a result, and is free to repeat
the procedure, gaining an edge.  START_DEBIT prevents that.

In the naked vfork/exec case, the parent _should_ resume at the relative
vruntime of the child, without further sleeper vruntime adjustment, but
that wouldn't help fix plain fork/clone which can do the same, nor would
it help kbuild's vfork->exec->clone[s]->parallel_stuff->exit unless you
add a lot of complexity... 

(...heh, to end up with something remarkably similar to cgroup container
per process.  Nope, don't want to re-invent the process scheduler to fix
a task scheduler corner case.  That's the core though, same old divide
and conquer problem we've always had.  We have a process scheduler
available these days though.  Back to our task scheduler problem...)

Perhaps take that vslice penalty, apply it to parent and child, and if
present, deduct sleep time from their debt, and only credit sleep time
if there's anything left over?  In any case, seems to me that sleep time
has to be included in the equation while the penalty is in effect.

Guess I'll try that, see what happens.

(but that will likely negate much of the benefit for your testcase,
where parent and child sleep)

	-Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)
  2010-09-20 19:19 [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4) Mathieu Desnoyers
  2010-09-21 12:01 ` Mike Galbraith
@ 2010-09-21 22:01 ` Jake Edge
  1 sibling, 0 replies; 5+ messages in thread
From: Jake Edge @ 2010-09-21 22:01 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Mike Galbraith, Linus Torvalds,
	Andrew Morton, Steven Rostedt, Thomas Gleixner, Tony Lindgren

Hi Mathieu,

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:

> +	u64			fork_nice_timeout;
> +	unsigned int		fork_nice_penality;

A pedantic nit from an editor-type: penalty doesn't have an 'i', so you
probably want 'fork_nice_penalty'.

jake

-- 
Jake Edge - jake@lwn.net - http://www.lwn.net

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4)
  2010-09-21 14:04   ` Mike Galbraith
@ 2010-09-22 11:51     ` Mike Galbraith
  0 siblings, 0 replies; 5+ messages in thread
From: Mike Galbraith @ 2010-09-22 11:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, LKML, Linus Torvalds, Andrew Morton,
	Steven Rostedt, Thomas Gleixner, Tony Lindgren

On Tue, 2010-09-21 at 16:04 +0200, Mike Galbraith wrote:

> Perhaps take that vslice penalty, apply it to parent and child, and if
> present, deduct sleep time from their debt, and only credit sleep time
> if there's anything left over?  In any case, seems to me that sleep time
> has to be included in the equation while the penalty is in effect.
> 
> Guess I'll try that, see what happens.

Nothing good.  Several flavors thereof.

> (but that will likely negate much of the benefit for your testcase,
> where parent and child sleep)

(yup, it did)

BTTODB.

	-Mike


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-09-22 11:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-20 19:19 [RFC PATCH] sched: START_NICE feature (temporarily niced forks) (v4) Mathieu Desnoyers
2010-09-21 12:01 ` Mike Galbraith
2010-09-21 14:04   ` Mike Galbraith
2010-09-22 11:51     ` Mike Galbraith
2010-09-21 22:01 ` Jake Edge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox