From: Mike Galbraith <efault@gmx.de>
To: linux-ia64@vger.kernel.org
Subject: [Linux-ia64] [case closed] Re: web page on O(1) scheduler
Date: Tue, 27 May 2003 15:16:28 +0000 [thread overview]
Message-ID: <marc-linux-ia64-105590723706043@msgid-missing> (raw)
[-- Attachment #1: Type: text/plain, Size: 2102 bytes --]
At 06:25 PM 5/22/2003 +0200, Mike Galbraith wrote:
>At 11:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>>At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
>>> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith
>>> <efault@gmx.de> said:
>>>
>>> Mike> The page mentions persistent starvation. My own explorations
>>> Mike> of this issue indicate that the primary source is always
>>> Mike> selecting the highest priority queue.
>>>
>>>My working assumption is that the problem is a bug with the dynamic
>>>prioritization. The task receiving the signals calls sleep() after
>>>handling a signal and hence it's dynamic priority should end up higher
>>>than the priority of the task sending signals (since the sender never
>>>relinquishes the CPU voluntarily).
>>>
>>>However, I haven't actually had time to look at the relevant code, so
>>>I may be missing something. If you understand the issue better,
>>>please explain to me why this isn't a dynamic priority issue.
>>
>>You're right, it looks like a corner case.
>
>Out of curiosity, is someone hitting that with a real program?
>
> -Mike
>
>aside:
>if so, I suppose nano-ticks may be needed....
Since I spent a bunch of time fumbling around with this problem, I may as
well post one last time and put it to bed. I bet Ingo has a _lot_ better
way to cure that problem, but...
Yes indeed, pedantic tracking of cpu usage does fix the problem. It also
makes for some interesting changes in contest numbers (well, _I_ find them
interesting;) If anyone wants to see the numbers, they're attached. If
anyone wants to try the attached diff, it's attached too, have fun... the
standard "you get to keep the pieces" lkml warrantee applies :) FWIW, it
works just fine here, and the only time someone now manages to steal ticks
(one) is (apparently) once in a long while when someone (whose initials
appear to be IDE:) keeps interrupts disabled for too long. Nobody has yet
managed to steal more than one tick.
The End,
-Mike
P.S. ok, so they're micro-ticks... what's a few orders of magnitude :)
[-- Attachment #2: xx.diff --]
[-- Type: application/octet-stream, Size: 7534 bytes --]
--- linux-2.5.69.virgin/include/linux/sched.h.org Fri May 23 07:14:23 2003
+++ linux-2.5.69.virgin/include/linux/sched.h Tue May 27 09:30:51 2003
@@ -328,7 +328,9 @@
prio_array_t *array;
unsigned long sleep_avg;
- unsigned long last_run;
+ unsigned long long last_run;
+ unsigned int run_nsecs;
+ unsigned int sleep_nsecs;
unsigned long policy;
unsigned long cpus_allowed;
--- linux-2.5.69.virgin/kernel/sched.c.org Sun May 25 06:05:42 2003
+++ linux-2.5.69.virgin/kernel/sched.c Tue May 27 12:58:02 2003
@@ -74,6 +74,12 @@
#define MAX_SLEEP_AVG (10*HZ)
#define STARVATION_LIMIT (10*HZ)
#define NODE_THRESHOLD 125
+#define SCHED_NANOSECOND 1
+#define SCHED_SECOND (1000000000 * SCHED_NANOSECOND)
+#define SCHED_TICK (SCHED_SECOND / HZ)
+#define TICKS_PER_SECOND (SCHED_SECOND / SCHED_TICK)
+
+extern unsigned long long monotonic_clock(void);
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -342,10 +348,23 @@
*/
static inline int activate_task(task_t *p, runqueue_t *rq)
{
- long sleep_time = jiffies - p->last_run - 1;
- int requeue_waker = 0;
+ unsigned long long now = monotonic_clock();
+ long long sleep = now - p->last_run + p->sleep_nsecs;
+ int ticks = 0, requeue_waker = 0;
+
+ if (sleep >= SCHED_TICK) {
+ while (sleep >= SCHED_SECOND) {
+ sleep -= SCHED_SECOND;
+ ticks += TICKS_PER_SECOND;
+ }
+ while (sleep >= SCHED_TICK) {
+ sleep -= SCHED_TICK;
+ ticks++;
+ }
+ p->sleep_nsecs = sleep;
+ } else p->sleep_nsecs += sleep;
- if (sleep_time > 0) {
+ if (ticks > 0) {
int sleep_avg;
/*
@@ -356,7 +375,7 @@
* spends sleeping, the higher the average gets - and the
* higher the priority boost gets as well.
*/
- sleep_avg = p->sleep_avg + sleep_time;
+ sleep_avg = p->sleep_avg + ticks;
/*
* 'Overflow' bonus ticks go to the waker as well, so the
@@ -364,8 +383,10 @@
* boosting tasks that are related to maximum-interactive
* tasks.
*/
- if (sleep_avg > MAX_SLEEP_AVG)
+ if (sleep_avg > MAX_SLEEP_AVG) {
sleep_avg = MAX_SLEEP_AVG;
+ p->sleep_nsecs = 0;
+ }
if (p->sleep_avg != sleep_avg) {
p->sleep_avg = sleep_avg;
p->prio = effective_prio(p);
@@ -571,6 +592,8 @@
current->sleep_avg = current->sleep_avg * PARENT_PENALTY / 100;
p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
p->prio = effective_prio(p);
+ p->run_nsecs = 0;
+ p->sleep_nsecs = 0;
set_task_cpu(p, smp_processor_id());
if (unlikely(!current->array))
@@ -1170,6 +1193,49 @@
(jiffies - (rq)->expired_timestamp >= \
STARVATION_LIMIT * ((rq)->nr_running) + 1)))
+inline void __scheduler_tick(runqueue_t *rq, task_t *p)
+{
+ unsigned long long now = monotonic_clock();
+ prio_array_t *array = rq->active;
+ int ticks;
+
+ p->run_nsecs += now - p->last_run;
+ /* Task might have expired already, but not scheduled off yet */
+ if (p->array != array) {
+ set_tsk_need_resched(p);
+ goto abort;
+ }
+ if (p->run_nsecs < SCHED_TICK || p->policy == SCHED_FIFO )
+ goto abort;
+
+ for (ticks = 0; p->run_nsecs >= SCHED_TICK; ticks++)
+ p->run_nsecs -= SCHED_TICK;
+ if (ticks > p->time_slice)
+ show_task(p);
+ if (p->sleep_avg > ticks)
+ p->sleep_avg -= ticks;
+ else
+ p->sleep_avg = 0;
+ p->time_slice -= ticks;
+
+ if (p->time_slice <= 0) {
+ dequeue_task(p, p->array);
+ p->prio = effective_prio(p);
+ p->time_slice = task_timeslice(p);
+ p->first_time_slice = 0;
+ set_tsk_need_resched(p);
+ if ((EXPIRED_STARVING(rq) && !rt_task(p)) ||
+ !TASK_INTERACTIVE(p)) {
+ array = rq->expired;
+ if (!rq->expired_timestamp)
+ rq->expired_timestamp = jiffies;
+ }
+ enqueue_task(p, array);
+ }
+abort:
+ p->last_run = monotonic_clock();
+}
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -1182,11 +1248,12 @@
int cpu = smp_processor_id();
runqueue_t *rq = this_rq();
task_t *p = current;
+ int idle = p == rq->idle;
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);
- if (p == rq->idle) {
+ if (idle) {
/* note: this timer irq context must be accounted for as well */
if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1194,8 +1261,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
- rebalance_tick(rq, 1);
- return;
+ goto out;
}
if (TASK_NICE(p) > 0)
kstat_cpu(cpu).cpustat.nice += user_ticks;
@@ -1203,11 +1269,6 @@
kstat_cpu(cpu).cpustat.user += user_ticks;
kstat_cpu(cpu).cpustat.system += sys_ticks;
- /* Task might have expired already, but not scheduled off yet */
- if (p->array != rq->active) {
- set_tsk_need_resched(p);
- return;
- }
spin_lock(&rq->lock);
/*
* The task was running during this tick - update the
@@ -1217,41 +1278,10 @@
* it possible for interactive tasks to use up their
* timeslices at their highest priority levels.
*/
- if (p->sleep_avg)
- p->sleep_avg--;
- if (unlikely(rt_task(p))) {
- /*
- * RR tasks need a special form of timeslice management.
- * FIFO tasks have no timeslices.
- */
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
- set_tsk_need_resched(p);
-
- /* put it at the end of the queue: */
- dequeue_task(p, rq->active);
- enqueue_task(p, rq->active);
- }
- goto out;
- }
- if (!--p->time_slice) {
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
-
- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- enqueue_task(p, rq->expired);
- } else
- enqueue_task(p, rq->active);
- }
-out:
+ __scheduler_tick(rq, p);
spin_unlock(&rq->lock);
- rebalance_tick(rq, 0);
+out:
+ rebalance_tick(rq, idle);
}
void scheduling_functions_start_here(void) { }
@@ -1286,8 +1316,8 @@
rq = this_rq();
release_kernel_lock(prev);
- prev->last_run = jiffies;
spin_lock_irq(&rq->lock);
+ __scheduler_tick(rq, prev);
/*
* if entering off of a kernel preemption go straight
@@ -1342,6 +1372,7 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
+ next->last_run = prev->last_run;
prepare_arch_switch(rq, next);
prev = context_switch(rq, prev, next);
--- linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c.org Sat May 24 08:55:05 2003
+++ linux-2.5.69.virgin/arch/i386/kernel/timers/timer_tsc.c Sat May 24 09:26:09 2003
@@ -102,12 +102,13 @@
static unsigned long long monotonic_clock_tsc(void)
{
unsigned long long last_offset, this_offset, base;
+ unsigned long flags;
/* atomically read monotonic base & last_offset */
- read_lock_irq(&monotonic_lock);
+ read_lock_irqsave(&monotonic_lock, flags);
last_offset = ((unsigned long long)last_tsc_high<<32)|last_tsc_low;
base = monotonic_base;
- read_unlock_irq(&monotonic_lock);
+ read_unlock_irqrestore(&monotonic_lock, flags);
/* Read the Time Stamp Counter */
rdtscll(this_offset);
--- linux-2.5.69.virgin/kernel/printk.c.org Sun May 25 09:10:38 2003
+++ linux-2.5.69.virgin/kernel/printk.c Sun May 25 09:11:04 2003
@@ -510,8 +510,10 @@
console_may_schedule = 0;
up(&console_sem);
spin_unlock_irqrestore(&logbuf_lock, flags);
+#if 0
if (wake_klogd && !oops_in_progress && waitqueue_active(&log_wait))
wake_up_interruptible(&log_wait);
+#endif
}
/** console_conditional_schedule - yield the CPU if required
[-- Attachment #3: contest.log --]
[-- Type: application/octet-stream, Size: 1557 bytes --]
no_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 153 94.8 0.0 0.0 1.00
2.5.69X 1 154 94.2 0.0 0.0 1.00
cacherun:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 146 98.6 0.0 0.0 0.95
2.5.69X 1 147 98.6 0.0 0.0 0.95
process_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 331 43.8 90.0 55.3 2.16
2.5.69X 1 344 41.9 94.0 57.0 2.23
ctar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 190 77.9 0.0 0.0 1.24
2.5.69X 1 187 79.1 0.0 0.0 1.21
xtar_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 196 75.0 0.0 3.1 1.28
2.5.69X 1 198 74.7 0.0 3.5 1.29
io_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 437 34.6 69.1 15.1 2.86
2.5.69X 1 349 43.3 53.3 14.9 2.27
io_other:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 387 38.8 69.3 17.3 2.53
2.5.69X 1 436 34.9 63.4 14.0 2.83
read_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 221 67.0 9.4 3.6 1.44
2.5.69X 1 208 71.2 9.0 3.4 1.35
list_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 203 71.4 99.0 20.2 1.33
2.5.69X 1 204 71.1 100.0 20.1 1.32
mem_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 256 57.8 34.0 1.2 1.67
2.5.69X 1 244 60.2 36.0 1.2 1.58
dbench_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 517 28.8 5.0 35.6 3.38
2.5.69X 1 444 33.3 3.0 28.8 2.88
ab_load:
Kernel [runs] Time CPU% Loads LCPU% Ratio
2.5.69 1 300 48.3 46.0 21.7 1.96
2.5.69X 1 293 49.5 44.0 21.5 1.90
reply other threads:[~2003-05-27 15:16 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=marc-linux-ia64-105590723706043@msgid-missing \
--to=efault@gmx.de \
--cc=linux-ia64@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox