public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
@ 2002-01-04  2:19 Ingo Molnar
  2002-01-04  4:27 ` Oliver Xymoron
                   ` (6 more replies)
  0 siblings, 7 replies; 65+ messages in thread
From: Ingo Molnar @ 2002-01-04  2:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: Linus Torvalds, Alan Cox


now that new-year's parties are over things are getting boring again. For
those who want to see and perhaps even try something more complex, i'm
announcing this patch that is a pretty radical rewrite of the Linux
scheduler for 2.5.2-pre6:

	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-A0.patch

for 2.4.17:

	http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-A0.patch

Goal
====

The main goal of the new scheduler is to keep all the good things we know
and love about the current Linux scheduler:

 - good interactive performance even during high load: if the user
   types or clicks then the system must react instantly and must execute
   the user tasks smoothly, even during considerable background load.

 - good scheduling/wakeup performance with 1-2 runnable processes.

 - fairness: no process should stay without any timeslice for any
   unreasonable amount of time. No process should get an unjustly high
   amount of CPU time.

 - priorities: less important tasks can be started with lower priority,
   more important tasks with higher priority.

 - SMP efficiency: no CPU should stay idle if there is work to do.

 - SMP affinity: processes which run on one CPU should stay affine to
   that CPU. Processes should not bounce between CPUs too frequently.

 - plus additional scheduler features: RT scheduling, CPU binding.

and the goal is also to add a few new things:

 - fully O(1) scheduling. Are you tired of the recalculation loop
   blowing the L1 cache away every now and then? Do you think the goodness
   loop is taking a bit too long to finish if there are lots of runnable
   processes? This new scheduler takes no prisoners: wakeup(), schedule(),
   the timer interrupt are all O(1) algorithms. There is no recalculation
   loop. There is no goodness loop either.

 - 'perfect' SMP scalability. With the new scheduler there is no 'big'
   runqueue_lock anymore - it's all per-CPU runqueues and locks - two
   tasks on two separate CPUs can wake up, schedule and context-switch
   completely in parallel, without any interlocking. All
   scheduling-relevant data is structured for maximum scalability. (see
   the benchmark section later on.)

 - better SMP affinity. The old scheduler has a particular weakness that
   causes the random bouncing of tasks between CPUs if/when higher
   priority/interactive tasks, this was observed and reported by many
   people. The reason is that the timeslice recalculation loop first needs
   every currently running task to consume its timeslice. But when this
   happens on eg. an 8-way system, then this property starves an
   increasing number of CPUs from executing any process. Once the last
   task that has a timeslice left has finished using up that timeslice,
   the recalculation loop is triggered and other CPUs can start executing
   tasks again - after having idled around for a number of timer ticks.
   The more CPUs, the worse this effect.

   Furthermore, this same effect causes the bouncing effect as well:
   whenever there is such a 'timeslice squeeze' of the global runqueue,
   idle processors start executing tasks which are not affine to that CPU.
   (because the affine tasks have finished off their timeslices already.)

   The new scheduler solves this problem by distributing timeslices on a
   per-CPU basis, without having any global synchronization or
   recalculation.

 - batch scheduling. A significant proportion of computing-intensive tasks
   benefit from batch-scheduling, where timeslices are long and processes
   are roundrobin scheduled. The new scheduler does such batch-scheduling
   of the lowest priority tasks - so nice +19 jobs will get
   'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
   in essence SCHED_IDLE, from an interactiveness point of view.

 - handle extreme loads more smoothly, without breakdown and scheduling
   storms.

 - O(1) RT scheduling. For those RT folks who are paranoid about the
   O(nr_running) property of the goodness loop and the recalculation loop.

 - run fork()ed children before the parent. Andrea has pointed out the
   advantages of this a few months ago, but patches for this feature
   do not work with the old scheduler as well as they should,
   because idle processes often steal the new child before the fork()ing
   CPU gets to execute it.


Design
======

(those who find the following design issues boring can skip to the next,
'Benchmarks' section.)

the core of the new scheduler are the following mechanizms:

 - *two*, priority-ordered 'priority arrays' per CPU. There is an 'active'
   array and an 'expired' array. The active array contains all tasks that
   are affine to this CPU and have timeslices left. The expired array
   contains all tasks which have used up their timeslices - but this array
   is kept sorted as well. The active and expired array is not accessed
   directly, it's accessed through two pointers in the per-CPU runqueue
   structure. If all active tasks are used up then we 'switch' the two
   pointers and from now on the ready-to-go (former-) expired array is the
   active array - and the empty active array serves as the new collector
   for expired tasks.

 - there is a 64-bit bitmap cache for array indices. Finding the highest
   priority task is thus a matter of two x86 BSFL bit-search instructions.

the split-array solution enables us to have an arbitrary number of active
and expired tasks, and the recalculation of timeslices can be done
immediately when the timeslice expires. Because the arrays are always
access through the pointers in the runqueue, switching the two arrays can
be done very quickly.

this is a hybride priority-list approach coupled with roundrobin
scheduling and the array-switch method of distributing timeslices.

 - there is a per-task 'load estimator'.

one of the toughest things to get right is good interactive feel during
heavy system load. While playing with various scheduler variants i found
that the best interactive feel is achieved not by 'boosting' interactive
tasks, but by 'punishing' tasks that want to use more CPU time than there
is available. This method is also much easier to do in an O(1) fashion.

to establish the actual 'load' the task contributes to the system, a
complex-looking but pretty accurate method is used: there is a 4-entry
'history' ringbuffer of the task's activities during the last 4 seconds.
This ringbuffer is operated without much overhead. The entries tell the
scheduler a pretty accurate load-history of the task: has it used up more
CPU time or less during the past N seconds. [the size '4' and the interval
of 4x 1 seconds was found by lots of experimentation - this part is
flexible and can be changed in both directions.]

the penalty a task gets for generating more load than the CPU can handle
is a priority decrease - there is a maximum amount to this penalty
relative to their static priority, so even fully CPU-bound tasks will
observe each other's priorities, and will share the CPU accordingly.

I've separated the RT scheduler into a different codebase, while still
keeping some of the scheduling codebase common. This does not look pretty
in certain places such as __sched_tail() or activate_task(), but i dont
think it can be avoided. RT scheduling is different, it uses a global
runqueue (and global spinlock) and it needs global decisions. To make RT
scheduling more instant, i've added a broadcast-reschedule message as
well, to make it absolutely sure that RT tasks of the right priority are
scheduled apropriately, even on SMP systems. The RT-scheduling part is
O(1) as well.

the SMP load-balancer can be extended/switched with additional parallel
computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
can be supported easily by changing the load-balancer. Right now it's
tuned for my SMP systems.

i skipped the prev->mm == next->mm advantage - no workload i know of shows
any sensitivity to this. It can be added back by sacrificing O(1)
schedule() [the current and one-lower priority list can be searched for a
that->mm == current->mm condition], but costs a fair number of cycles
during a number of important workloads, so i wanted to avoid this as much
as possible.

- the SMP idle-task startup code was still racy and the new scheduler
triggered this. So i streamlined the idle-setup code a bit. We do not call
into schedule() before all processors have started up fully and all idle
threads are in place.

- the patch also cleans up a number of aspects of sched.c - moves code
into other areas of the kernel where it's appropriate, and simplifies
certain code paths and data constructs. As a result, the new scheduler's
code is smaller than the old one.

(i'm sure there are other details i forgot to explain. I've commented some
of the more important code paths and data constructs. If you think some
aspect of this design is faulty or misses some important issue then please
let me know.)

(the current code is by no means perfect, my main goal right now, besides
fixing bugs is to make the code cleaner. Any suggestions for
simplifications are welcome.)

Benchmarks
==========

i've performed two major groups of benchmarks: first i've verified the
interactive performance (interactive 'feel') of the new scheduler on UP
and SMP systems as well. While this is a pretty subjective thing, i found
that the new scheduler is at least as good as the old one in all areas,
and in a number of high load workloads it feels visibly smoother. I've
tried a number of workloads, such as make -j background compilation or
1000 background processes. Interactive performance can also be verified
via tracing both schedulers, and i've done that and found no areas of
missed wakeups or imperfect SMP scheduling latencies in either of the two
schedulers.

the other group of benchmarks was the actual performance of the scheduler.
I picked the following ones (some were intentionally picked to load the
scheduler, others were picked to make the benchmark spectrum more
complete):

 - compilation benchmarks

 - thr chat-server workload simulator written by Bill Hartner

 - the usual components from the lmbench suite

 - a heavily sched_yield()-ing testcode to measure yield() performance.

 [ i can test any other workload too that anyone would find interesting. ]

i ran these benchmarks on a 1-CPU box using a UP kernel, a 2-CPU and a
8-CPU box as well, using the SMP kernel.

The chat-server simulator creates a number of processes that are connected
to each other via TCP sockets, the processes send messages to each other
randomly, in a way that simulates actual chat server designs and
workloads.

3 successive runs of './chat_c 127.0.0.1 10 1000' produce the following
message throughput:

vanilla-2.5.2-pre6:

 Average throughput : 110619 messages per second
 Average throughput : 107813 messages per second
 Average throughput : 120558 messages per second

O(1)-schedule-2.5.2-pre6:

 Average throughput : 131250 messages per second
 Average throughput : 116333 messages per second
 Average throughput : 179686 messages per second

this is a rougly 20% improvement.

To get all benefits of the new scheduler, i ran it reniced, which in
essence triggers round-robin batch scheduling for the chat server tasks:

3 successive runs of 'nice -n 19 ./chat_c 127.0.0.1 10 1000' produce the
following throughput:

vanilla-2.5.2-pre6:

 Average throughput : 77719 messages per second
 Average throughput : 83460 messages per second
 Average throughput : 90029 messages per second

O(1)-schedule-2.5.2-pre6:

 Average throughput : 609942 messages per second
 Average throughput : 610221 messages per second
 Average throughput : 609570 messages per second

throughput improved by more than 600%. The UP and 2-way SMP tests show a
similar edge for the new scheduler. Furthermore, during these chatserver
tests, the old scheduler doesnt handle interactive tasks very well, and
the system is very jerky. (which is a side-effect of the overscheduling
situation the scheduler gets into.)

the 1-CPU UP numbers are interesting as well:

vanilla-2.5.2-pre6:

 ./chat_c 127.0.0.1 10 100
 Average throughput : 102885 messages per second
 Average throughput : 95319 messages per second
 Average throughput : 99076 messages per second

 nice -n 19 ./chat_c 127.0.0.1 10 1000
 Average throughput : 161205 messages per second
 Average throughput : 151785 messages per second
 Average throughput : 152951 messages per second

O(1)-schedule-2.5.2-pre6:

 ./chat_c 127.0.0.1 10 100 # NEW
 Average throughput : 128865 messages per second
 Average throughput : 115240 messages per second
 Average throughput : 99034 messages per second

 nice -n 19 ./chat_c 127.0.0.1 10 1000 # NEW
 Average throughput : 163112 messages per second
 Average throughput : 163012 messages per second
 Average throughput : 163652 messages per second

this shows that while on UP we dont have the scalability improvements, the
O(1) scheduler is still slightly ahead.


another benchmark measures sched_yield() performance. (which the pthreads
code relies on pretty heavily.)

on a 2-way system, starting 4 instances of ./loop_yield gives the
following context-switch throughput:

vanilla-2.5.2-pre6

  # vmstat 5 | cut -c57-
     system         cpu
   in    cs  us  sy  id
  102 241247   6  94   0
  101 240977   5  95   0
  101 241051   6  94   0
  101 241176   7  93   0

O(1)-schedule-2.5.2-pre6

  # vmstat 5 | cut -c57-
     system         cpu
   in     cs  us  sy  id
  101 977530  31  69   0
  101 977478  28  72   0
  101 977538  27  73   0

the O(1) scheduler is 300% faster, and we do nearly 1 million context
switches per second!

this test is even more interesting on the 8-way system, running 16
instances of loop_yield:

vanilla-2.5.2-pre6:

   vmstat 5 | cut -c57-
      system         cpu
    in     cs  us  sy  id
   106 108498   2  98   0
   101 108333   1  99   0
   102 108437   1  99   0

100K/sec context switches - the overhead of the global runqueue makes the
scheduler slower than the 2-way box!

O(1)-schedule-2.5.2-pre6:

    vmstat 5 | cut -c57-
     system         cpu
    in      cs  us  sy  id
   102 6120358  34  66   0
   101 6117063  33  67   0
   101 6117124  34  66   0

this is more than 6 million context switches per second! (i think this is
a first, no Linux box in existence did so many context switches per second
before.) This is one workload where the per-CPU runqueues and scalability
advantages show up big time.

here are the lat_proc and lat_ctx comparisons (the results quoted here are
the best numbers from a series of tests):

vanilla-2.5.2-pre6:

  ./lat_proc fork
  Process fork+exit: 440.0000 microseconds
  ./lat_proc exec
  Process fork+execve: 491.6364 microseconds
  ./lat_proc shell
  Process fork+/bin/sh -c: 3545.0000 microseconds

O(1)-schedule-2.5.2-pre6:

  ./lat_proc fork
  Process fork+exit: 168.6667 microseconds
  ./lat_proc exec
  Process fork+execve: 279.6500 microseconds
  ./lat_proc shell
  Process fork+/bin/sh -c: 2874.0000 microseconds

the difference is pretty dramatic - it's mostly due to avoiding much of
the COW overhead that comes from fork()+execve(). The fork()+exit()
improvement is mostly due to better CPU affinity - parent and child are
running on the same CPU, while the old scheduler pushes the child to
another, idle CPU, which creates heavy interlocking traffic between the MM
structures.

the compilation benchmarks i ran gave very similar results for both
schedulers. The O(1) scheduler has a small 2% advantage in make -j
benchmarks (not accounting statistical noise - it's hard to produce
reliable compilation benchmarks) - probably due to better SMP affinity
again.

Status
======

i've tested the new scheduler under the aforementioned range of systems
and workloads, but it's still experimental code nevertheless. I've
developed it on SMP systems using the 2.5.2-pre kernels, so it has the
most testing there, but i did a fair number of UP and 2.4.17 tests as
well. NOTE! For the 2.5.2-pre6 kernel to be usable you should apply
Andries' latest 2.5.2pre6-kdev_t patch available at:

	http://www.kernel.org/pub/linux/kernel/people/aeb/

i also tested the RT scheduler for various situations such as
sched_yield()-ing of RT tasks, strace-ing RT tasks and other details, and
it's all working as expected. There might be some rough edges though.

Comments, bug reports, suggestions are welcome,

	Ingo


^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
@ 2002-01-04  3:56 Ed Tomlinson
  0 siblings, 0 replies; 65+ messages in thread
From: Ed Tomlinson @ 2002-01-04  3:56 UTC (permalink / raw)
  To: linux-kernel

Building against 2.4.17 (with rmap-10c applied) I get:

loop.c: In function `loop_thread':
loop.c:574: structure has no member named `nice'
make[2]: *** [loop.o] Error 1
make[2]: Leaving directory `/usr/src/linux/drivers/block'
make[1]: *** [_modsubdir_block] Error 2
make[1]: Leaving directory `/usr/src/linux/drivers'
make: *** [_mod_drivers] Error 2

and

md.c: in function `md_thread':
md.c:2934: structure has no member named `nice'
md.c: In function `md_do_sync_Rdc586d8a':
md.c:3387: structure has no member named `nice'
md.c:3466: structure has no member named `nice'
md.c:3475: structure has no member named `nice'
make[2]: *** [md.o] Error 1
make[2]: Leaving directory `/usr/src/linux/drivers/md'
make[1]: *** [_modsubdir_md] Error 2
make[1]: Leaving directory `/usr/src/linux/drivers'
make: *** [_mod_drivers] Error 2

Changing nice to __nice lets the compile work.  Is this correct?

When installing modules I get:

if [ -r System.map ]; then /sbin/depmod -ae -F System.map  2.4.17rmap10c+O1A0; fi
depmod: *** Unresolved symbols in /lib/modules/2.4.17rmap10c+O1A0/kernel/fs/jbd/jbd.o
depmod:         sys_sched_yield
depmod: *** Unresolved symbols in /lib/modules/2.4.17rmap10c+O1A0/kernel/fs/nfs/nfs.o
depmod:         sys_sched_yield
depmod: *** Unresolved symbols in /lib/modules/2.4.17rmap10c+O1A0/kernel/net/sunrpc/sunrpc.o
depmod:         sys_sched_yield

What needs to be done to fix these symbols?

I used the following correction to fix the rejects in A0 caused 
by rmap...

TIA,

Ed Tomlinson

--- x/include/linux/sched.h.orig	Thu Jan  3 21:23:48 2002
+++ x/include/linux/sched.h	Thu Jan  3 21:32:39 2002
@@ -295,34 +295,28 @@
 
 	int lock_depth;		/* Lock depth */
 
-/*
- * offset 32 begins here on 32-bit platforms. We keep
- * all fields in a single cacheline that are needed for
- * the goodness() loop in schedule().
- */
-	long counter;
-	long nice;
-	unsigned long policy;
-	struct mm_struct *mm;
-	int processor;
-	/*
-	 * cpus_runnable is ~0 if the process is not running on any
-	 * CPU. It's (1 << cpu) if it's running on a CPU. This mask
-	 * is updated under the runqueue lock.
-	 *
-	 * To determine whether a process might run on a CPU, this
-	 * mask is AND-ed with cpus_allowed.
-	 */
-	unsigned long cpus_runnable, cpus_allowed;
 	/*
-	 * (only the 'next' pointer fits into the cacheline, but
-	 * that's just fine.)
+	 * offset 32 begins here on 32-bit platforms.
 	 */
-	struct list_head run_list;
-	unsigned long sleep_time;
+	unsigned int cpu;
+	int prio;
+	long __nice;
+	list_t run_list;
+	prio_array_t *array;
+
+	unsigned int time_slice;
+	unsigned long sleep_timestamp, run_timestamp;
+
+	#define SLEEP_HIST_SIZE 4
+	int sleep_hist[SLEEP_HIST_SIZE];
+	int sleep_idx;
+
+	unsigned long policy;
+	unsigned long cpus_allowed;
 
 	struct task_struct *next_task, *prev_task;
-	struct mm_struct *active_mm;
+
+	struct mm_struct *mm, *active_mm;
 
 /* task state */
 	struct linux_binfmt *binfmt;
--- x/mm/page_alloc.c.orig	Thu Jan  3 21:20:57 2002
+++ x/mm/page_alloc.c	Thu Jan  3 21:21:57 2002
@@ -465,9 +465,7 @@
 		 * NFS: we must yield the CPU (to rpciod) to avoid deadlock.
 		 */
 		if (gfp_mask & __GFP_WAIT) {
-			__set_current_state(TASK_RUNNING);
-			current->policy |= SCHED_YIELD;
-			schedule();
+			yield();
 			if (!order || free_high(ALL_ZONES) >= 0) {
 				int progress = try_to_free_pages(gfp_mask);
 				if (progress || (gfp_mask & __GFP_FS))

^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
@ 2002-01-04  7:42 Dieter Nützel
  2002-01-04  8:02 ` David Lang
  0 siblings, 1 reply; 65+ messages in thread
From: Dieter Nützel @ 2002-01-04  7:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linux Kernel List

After the loop.c fix (nfs.o and sunrpc.o waiting) I got it running here, too.

2.4.17
sched-O1-2.4.17-A0.patch
00_nanosleep-5		(Andrea)
bootmem-2.4.17-pre6	(at all IBM)
elevator-fix		(Andrew, worth it for 2.4.18)

plus ReiserFS fixes
linux-2.4.17rc2-KLMN+exp_trunc+3fixes.patch
corrupt_items_checks.diff
kmalloc_cleanup.diff
O-inode-attrs.patch

To much trouble during 10_vm-21 (VM taken from 2.4.17rc2aa2) merge. So I 
skipped it this time. Much wanted for 2.4.18 (best I ever had).

All the above (with preempt and 10_vm-21 AA except sched-O1-2.4.17-A0.patch) 
didn't crashed before.

One system crash during the first X start up (kdeinit).

ksymoops 2.4.3 on i686 2.4.17-O1.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.17-O1/ (default)
     -m /boot/System.map (specified)

invalid operand: 0000
CPU:    0
EIP:    0010:[<c01194ae>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000   ebx: c1a7ca40   ecx: c028e658   edx: dea3002c
esi: dea30000   edi: 00000000   ebp: bffff19c   esp: dea31fac
ds: 0018   es: 0018   ss: 0018
Process kdeinit (pid: 961, stackpage=dea31000)
Stack: dea30000 40e1ed40 00000000 c01194ee 00000000 c0106d0b 00000000 
00000001 
       40e208c4 40e1ed40 00000000 bffff19c 00000001 0000002b 0000002b 
00000001 
       40da84dd 00000023 00000287 bffff170 0000002b 
Call Trace: [<c01194ee>] [<c0106d0b>] 
Code: 0f 0b e9 74 fe ff ff 8d 74 26 00 8d bc 27 00 00 00 00 8b 44 

>>EIP; c01194ae <do_exit+1ee/200>   <=====
Trace; c01194ee <sys_exit+e/10>
Trace; c0106d0a <system_call+32/38>
Code;  c01194ae <do_exit+1ee/200>
00000000 <_EIP>:
Code;  c01194ae <do_exit+1ee/200>   <=====
   0:   0f 0b                     ud2a      <=====
Code;  c01194b0 <do_exit+1f0/200>
   2:   e9 74 fe ff ff            jmp    fffffe7b <_EIP+0xfffffe7b> c0119328 
<do_exit+68/200>
Code;  c01194b4 <do_exit+1f4/200>
   7:   8d 74 26 00               lea    0x0(%esi,1),%esi
Code;  c01194b8 <do_exit+1f8/200>
   b:   8d bc 27 00 00 00 00      lea    0x0(%edi,1),%edi
Code;  c01194c0 <complete_and_exit+0/20>
  12:   8b 44 00 00               mov    0x0(%eax,%eax,1),%eax

System runs relatively smooth but is under full system load.

  7:57am  up  1:36,  1 user,  load average: 0.18, 0.18, 0.26
91 processes: 90 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  0.7% user, 99.2% system,  0.0% nice,  0.0% idle
Mem:   643064K av,  464212K used,  178852K free,       0K shrd,   89964K buff
Swap: 1028120K av,    3560K used, 1024560K free                  179928K 
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 1263 root      52   0  138M  42M  1728 S    33.0  6.7  16:46 X
 1669 nuetzel   62   0  7544 7544  4412 S    16.6  1.1  11:57 artsd
10891 nuetzel   46   0  113M  17M 12600 S    12.0  2.7   2:35 kmail
 1756 nuetzel   46   0  105M 9.9M  7288 S    10.8  1.5   7:45 kmix
 1455 nuetzel   45   0  109M  12M 10508 S     7.9  2.0   1:55 kdeinit
 1467 nuetzel   45   0  107M  10M  8456 S     5.5  1.7   0:55 kdeinit
 1414 nuetzel   45   0  105M 8916  7536 S     3.9  1.3   1:59 kdeinit
  814 squid     45   0  6856 6848  1280 S     2.3  1.0   0:52 squid
    6 root      45   0     0    0     0 SW    2.1  0.0   0:42 kupdated
 1458 nuetzel   45   0  109M  13M  9856 S     1.3  2.0   0:44 kdeinit
11099 nuetzel   47   0  1000 1000   776 R     1.3  0.1   0:00 top
  556 root      45   0  1136 1136   848 S     1.1  0.1   0:14 smpppd
 1678 nuetzel   45   0  7544 7544  4412 S     0.7  1.1   0:12 artsd
  494 root      45   0  3072 3072  1776 S     0.3  0.4   0:18 named
 1451 root      45   0  6860 6852  1416 S     0.1  1.0   0:14 xperfmon++
10871 nuetzel   45   0  111M  14M 11680 S     0.1  2.3   0:14 kdeinit
    1 root      46   0   212  212   176 S     0.0  0.0   0:06 init

/home/nuetzel> procinfo
Linux 2.4.17-O1 (root@SunWave1) (gcc 2.95.3 20010315 ) #1 1CPU [SunWave1.]

Memory:      Total        Used        Free      Shared     Buffers      Cached
Mem:        643064      452988      190076           0      100096      183148
Swap:      1028120        3560     1024560

Bootup: Fri Jan  4 06:21:02 2002    Load average: 0.14 0.32 0.31 
1894046082/90 11460

user  :       0:16:59.19  14.9%  page in :   404106  disk 1:    31792r   
70750w
nice  :       0:00:00.00   0.0%  page out:  2580336  disk 2:      137r     
225w
system:       1:19:41.05  70.0%  swap in :        2  disk 3:        1r       
0w
idle  :       0:17:11.67  15.1%  swap out:      695  disk 4:     1009r       
0w
uptime:       1:53:51.90         context :  2427806

irq  0:    683191 timer                 irq  8:    154583 rtc
irq  1:     12567 keyboard              irq  9:         0 acpi, usb-ohci
irq  2:         0 cascade [4]           irq 10:    103402 aic7xxx
irq  5:      9333 eth1                  irq 11:    310704 eth0, EMU10K1
irq  7:       115 parport0 [3]          irq 12:    136392 PS/2 Mouse

More processes die during second and third boot...
I have some more crash logs if needed.

Preempt plus lock-break is better for now.
latencytest0.42-png crash the system.

What next?
Maybe a combination of O(1) and preempt?

-- 
Dieter Nützel
Graduate Student, Computer Science

University of Hamburg
Department of Computer Science
@home: Dieter.Nuetzel@hamburg.de


^ permalink raw reply	[flat|nested] 65+ messages in thread
* Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
@ 2002-01-04 11:16 rwhron
  2002-01-04 13:20 ` Ingo Molnar
  0 siblings, 1 reply; 65+ messages in thread
From: rwhron @ 2002-01-04 11:16 UTC (permalink / raw)
  To: mingo; +Cc: linux-kernel

Ingo,

I tried the new scheduler with the kdev_t patch you mentioned and it
went very well for a long time.  dbench 64, 128, 192 all completed,
3 iterations of bonnie++ were fine too.

When I ran the syscalls test from the http://ltp.sf.net/ I had a 
few problems.

The first time the machine rebooted.  The last thing in the logfile
was that pipe11 test was running.  I believe it got past that point.
(the syscall tests run in alphabetic order).

The second time I ran it, when I was tailing the output the 
machine seemed to freeze.  (Usually the syscall tests complete
very quickly).  I got the oops below on a serial console, it
scrolled much longer and didn't seem to like the call trace
would ever complete, so i rebooted.

I ran it a third time trying to isolate which test triggered the oops,
but the behavior was different again.  The machine got very very
slow, but tests would eventually print their output.  The test that 
triggered the behavior was apparently between pipe11 and the setrlimit01 
command below.

Here is top in the locked state:

33 processes: 23 sleeping, 4 running, 6 zombie, 0 stopped
CPU states:  0.3% user,  0.0% system,  0.0% nice, 99.6% idle
Mem:   385344K av,   83044K used,  302300K free,       0K shrd,   51976K buff
Swap:  152608K av,       0K used,  152608K free                   16564K cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
   50 root      55   0  1620 1620  1404 R     0.1  0.4   0:05 sshd
 8657 root      47   0   948  948   756 R     0.1  0.2   0:00 top
    1 root      46   0   524  524   452 S     0.0  0.1   0:06 init
    2 root      63   0     0    0     0 SW    0.0  0.0   0:00 keventd
    3 root      63  19     0    0     0 RWN   0.0  0.0   0:00 ksoftirqd_CPU0
    4 root      63   0     0    0     0 SW    0.0  0.0   0:00 kswapd
    5 root      63   0     0    0     0 SW    0.0  0.0   0:00 bdflush
    6 root      47   0     0    0     0 SW    0.0  0.0   0:00 kupdated
    7 root      45   0     0    0     0 SW    0.0  0.0   0:00 kreiserfsd
   28 root      45   0   620  620   516 S     0.0  0.1   0:00 syslogd
   31 root      46   0   480  480   404 S     0.0  0.1   0:00 klogd
   35 root      47   0     0    0     0 SW    0.0  0.0   0:00 eth0
   40 root      49   0   816  816   664 S     0.0  0.2   0:00 iplog
   41 root      46   0   816  816   664 S     0.0  0.2   0:00 iplog
   42 root      45   0   816  816   664 S     0.0  0.2   0:00 iplog
   43 root      45   0   816  816   664 S     0.0  0.2   0:00 iplog
   44 root      58   0   816  816   664 S     0.0  0.2   0:01 iplog
   46 root      53   0  1276 1276  1156 S     0.0  0.3   0:00 sshd
   48 root      46   0   472  472   396 S     0.0  0.1   0:00 agetty
   51 hrandoz   49   0  1164 1164   892 S     0.0  0.3   0:00 bash
   59 root      46   0  1184 1184  1028 S     0.0  0.3   0:00 bash
  702 root      45   0  1224 1224  1028 S     0.0  0.3   0:01 bash
 8564 root      63   0  1596 1596  1404 S     0.0  0.4   0:00 sshd
 8602 root      46   0   472  472   396 S     0.0  0.1   0:00 agetty
 8616 hrandoz   63   0  1164 1164   892 S     0.0  0.3   0:00 bash
 8645 root      49   0  1152 1152   888 S     0.0  0.2   0:00 bash
 8647 root      46   0   468  468   388 R     0.0  0.1   0:00 setrlimit01
 8649 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>
 8650 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>
 8651 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>
 8658 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>
 8659 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>
 8660 root      46   0     0    0     0 Z     0.0  0.0   0:00 setrlimit01 <defunct>

No modules in ksyms, skipping objects
Warning (read_lsmod): no symbols in lsmod, is /proc/modules a valid lsmod file?
 Unable to handle kernel NULL pointer dereference at virtual address 00000000
*pde = c024be44
Oops: 0002
CPU:    0
EIP:    0010:[<c024c06d>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 00000000   ebx: d6184000   ecx: d6184290   edx: d6184290
esi: c024be3c   edi: d6184001   ebp: d6185f98   esp: c024c06c
ds: 0018   es: 0018   ss: 0018
Process  (pid: 0, stackpage=c024b000)
Stack: c024c06c c024c06c c024c074 c024c074 c024c07c c024c07c c024c084 c024c084
       c024c08c c024c08c c024c094 c024c094 c024c09c c024c09c c024c0a4 c024c0a4
       c024c0ac c024c0ac c024c0b4 c024c0b4 c024c0bc c024c0bc c024c0c4 c024c0c4
Code: c0 24 c0 6c c0 24 c0 74 c0 24 c0 74 c0 24 c0 7c c0 24 c0 7c

>>EIP; c024c06c <rt_array+28c/340>   <=====
Code;  c024c06c <rt_array+28c/340>
00000000 <_EIP>:
Code;  c024c06c <rt_array+28c/340>   <=====
   0:   c0 24 c0 6c               shlb   $0x6c,(%eax,%eax,8)   <=====
Code;  c024c070 <rt_array+290/340>
   4:   c0 24 c0 74               shlb   $0x74,(%eax,%eax,8)
Code;  c024c074 <rt_array+294/340>
   8:   c0 24 c0 74               shlb   $0x74,(%eax,%eax,8)
Code;  c024c078 <rt_array+298/340>
   c:   c0 24 c0 7c               shlb   $0x7c,(%eax,%eax,8)
Code;  c024c07c <rt_array+29c/340>
  10:   c0 24 c0 7c               shlb   $0x7c,(%eax,%eax,8)

CPU:    0
EIP:    0010:[<c01110b5>]    Not tainted
EFLAGS: 00010086
eax: c0000000   ebx: c0248000   ecx: c024ca58   edx: 01400000
esi: 00000000   edi: c0110c48   ebp: c1400000   esp: c0247fb8
ds: 0018   es: 0018   ss: 0018
Process $.#. (pid: 5373, stackpage=c0247000)
Stack: c0248000 00000000 c0110c48 c1400000 00000000 00000000 c0246000 00000000
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
       00000000 00000000
Call Trace: [<c0110c48>]
Code: f6 04 10 81 0f 84 02 fe ff ff 5b 5e 5f 5d 81 c4 8c 00 00 00

>>EIP; c01110b4 <do_page_fault+46c/488>   <=====
Trace; c0110c48 <do_page_fault+0/488>
Code;  c01110b4 <do_page_fault+46c/488>
00000000 <_EIP>:
Code;  c01110b4 <do_page_fault+46c/488>   <=====
   0:   f6 04 10 81               testb  $0x81,(%eax,%edx,1)   <=====
Code;  c01110b8 <do_page_fault+470/488>
   4:   0f 84 02 fe ff ff         je     fffffe0c <_EIP+0xfffffe0c> c0110ec0 <do_page_fault+278/488>
Code;  c01110be <do_page_fault+476/488>
   a:   5b                        pop    %ebx
Code;  c01110be <do_page_fault+476/488>
   b:   5e                        pop    %esi
Code;  c01110c0 <do_page_fault+478/488>
   c:   5f                        pop    %edi
Code;  c01110c0 <do_page_fault+478/488>
   d:   5d                        pop    %ebp
Code;  c01110c2 <do_page_fault+47a/488>
   e:   81 c4 8c 00 00 00         add    $0x8c,%esp

CPU:    0
EIP:    0010:[<c0180e93>]    Not tainted
EFLAGS: 00010002
eax: c1604120   ebx: 00000000   ecx: c0203cc0   edx: 00000000
esi: 00000000   edi: c0108e1c   ebp: c1400000   esp: c0247f30
ds: 0018   es: 0018   ss: 0018
Process $.#. (pid: 5373, stackpage=c0247000)
Stack: 0000000f 00000000 c0110bfb 00000000 c0108afe 00000000 c0247f84 c01dcccd
       c01dcd1f 00000000 00000001 c0247f84 c0108e70 c01dcd1f c0247f84 00000000
       c0246000 00000000 c0108684 c0247f84 00000000 c0248000 c024ca58 01400000
Call Trace: [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0110c48>]
   [<c01110b5>] [<c0110c48>]
Code: 80 78 04 00 75 7f 8b 15 a0 90 20 c0 c7 05 44 1c 26 c0 1c 0f

>>EIP; c0180e92 <unblank_screen+4e/d8>   <=====
Trace; c0110bfa <bust_spinlocks+22/4c>
Trace; c0108afe <die+42/50>
Trace; c0108e70 <do_double_fault+54/5c>
Trace; c0108684 <error_code+34/40>
Trace; c0110c48 <do_page_fault+0/488>
Trace; c01110b4 <do_page_fault+46c/488>
Trace; c0110c48 <do_page_fault+0/488>
Code;  c0180e92 <unblank_screen+4e/d8>
00000000 <_EIP>:
Code;  c0180e92 <unblank_screen+4e/d8>   <=====
   0:   80 78 04 00               cmpb   $0x0,0x4(%eax)   <=====
Code;  c0180e96 <unblank_screen+52/d8>
   4:   75 7f                     jne    85 <_EIP+0x85> c0180f16 <unblank_screen+d2/d8>
Code;  c0180e98 <unblank_screen+54/d8>
   6:   8b 15 a0 90 20 c0         mov    0xc02090a0,%edx
Code;  c0180e9e <unblank_screen+5a/d8>
   c:   c7 05 44 1c 26 c0 1c      movl   $0xf1c,0xc0261c44
Code;  c0180ea4 <unblank_screen+60/d8>
  13:   0f 00 00 

CPU:    0
EIP:    0010:[<c0180e93>]    Not tainted
EFLAGS: 00010002
eax: c1604120   ebx: 00000000   ecx: c0203cc0   edx: 00000000
esi: 00000000   edi: c0108e1c   ebp: c1400000   esp: c0247ea8
ds: 0018   es: 0018   ss: 0018
Process $.#. (pid: 5373, stackpage=c0247000)
Stack: 0000000f 00000000 c0110bfb 00000000 c0108afe 00000000 c0247efc c01dcccd
       c01dcd1f 00000000 00000001 c0247efc c0108e70 c01dcd1f c0247efc 00000000
       c0246000 00000000 c0108684 c0247efc 00000000 00000000 c0203cc0 00000000
Call Trace: [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0108e1c>]
   [<c0180e93>] [<c0110bfb>] [<c0108afe>] [<c0108e70>] [<c0108684>] [<c0110c48>]
   [<c01110b5>] [<c0110c48>]
Code: 80 78 04 00 75 7f 8b 15 a0 90 20 c0 c7 05 44 1c 26 c0 1c 0f


It looks like you already have an idea where the problem is.
Looking forward to the next patch.  :)
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 65+ messages in thread
[parent not found: <20020104074239.94E016DAA6@mail.elte.hu>]
* Re: [announce] [patch] ultra-scalable O(1) SMP and UP scheduler
@ 2002-01-07 17:34 Rene Rebe
  0 siblings, 0 replies; 65+ messages in thread
From: Rene Rebe @ 2002-01-07 17:34 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel

Hi all.

I tried the sched-O1-2.4.17-C1.patch on a 2.4.17 kernel running on a
UP Athlon-XP with 1466Mhz, 512MB RAM, SiS 735 board and an IBM IDE
disks.

I works (no crashes) including XFree-4.1 and ALSA modules loaded.

But during higher load (normal gcc compilations are enough) my system
gets really unresponsive and my mouse-cursor (USB-mouse, XFree-4.1,
Matrox G450) flickers with ~ 5fps over the screen ... :-((

I'll retry with the D0 patch ;-)

k33p h4ck1n6
  René Rebe

-- 
René Rebe (Registered Linux user: #248718 <http://counter.li.org>)

eMail:    rene.rebe@gmx.net
          rene@rocklinux.org

Homepage: http://www.tfh-berlin.de/~s712059/index.html

Anyone sending unwanted advertising e-mail to this address will be
charged $25 for network traffic and computing time. By extracting my
address from this message or its header, you agree to these terms.


^ permalink raw reply	[flat|nested] 65+ messages in thread
[parent not found: <20020107.191742.730580837.rene.rebe@gmx.net>]

end of thread, other threads:[~2002-01-07 19:53 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-01-04  2:19 [announce] [patch] ultra-scalable O(1) SMP and UP scheduler Ingo Molnar
2002-01-04  4:27 ` Oliver Xymoron
2002-01-04  5:34 ` Ian Morgan
2002-01-04 10:30 ` Anton Blanchard
2002-01-04 12:53   ` Ingo Molnar
2002-01-04 14:18 ` Thomas Cataldo
2002-01-04 14:46 ` dan kelley
2002-01-04 17:07   ` Ingo Molnar
2002-01-04 15:22     ` Nikita Danilov
2002-01-05  4:33 ` Davide Libenzi
2002-01-05 20:24   ` Ingo Molnar
2002-01-05 19:49     ` Mika Liljeberg
2002-01-05 22:00       ` Ingo Molnar
2002-01-05 21:04         ` Mika Liljeberg
2002-01-05 23:04     ` Davide Libenzi
2002-01-05 23:41       ` Alan Cox
2002-01-05 23:46         ` Davide Libenzi
2002-01-06  0:47       ` Linus Torvalds
2002-01-06  2:57         ` Ingo Molnar
2002-01-06  1:27           ` Linus Torvalds
2002-01-06  1:45             ` Davide Libenzi
2002-01-06  3:55               ` Ingo Molnar
2002-01-06  2:16                 ` Davide Libenzi
2002-01-06  3:41             ` Ingo Molnar
2002-01-06  2:02               ` Davide Libenzi
2002-01-06  2:13                 ` Alan Cox
2002-01-06  2:12                   ` Davide Libenzi
2002-01-06  2:20                     ` Alan Cox
2002-01-06  2:17                       ` Davide Libenzi
2002-01-06  3:30                   ` 2.4.17 kernel without modules...was " Vikram
2002-01-06  4:07                   ` [announce] [patch] ultra-scalable O(1) " Ingo Molnar
2002-01-06  2:23                     ` Davide Libenzi
2002-01-06  2:30                     ` Alan Cox
2002-01-06  4:19                       ` Ingo Molnar
2002-01-07  3:08                   ` Richard Henderson
2002-01-07  3:16                     ` Linus Torvalds
2002-01-07  3:31                       ` Davide Libenzi
2002-01-07  3:34                         ` Linus Torvalds
2002-01-07  3:49                           ` Davide Libenzi
2002-01-06  4:01                 ` Ingo Molnar
2002-01-06  4:08               ` Ingo Molnar
2002-01-06  4:16                 ` Ingo Molnar
2002-01-06  3:55                   ` Luc Van Oostenryck
2002-01-07  8:00             ` Jens Axboe
2002-01-06  2:49       ` Ingo Molnar
2002-01-07  2:58 ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2002-01-04  3:56 Ed Tomlinson
2002-01-04  7:42 Dieter Nützel
2002-01-04  8:02 ` David Lang
2002-01-04 11:44   ` Ingo Molnar
2002-01-04 11:33     ` David Lang
2002-01-04 13:39       ` Andrea Arcangeli
2002-01-04 14:04       ` Ingo Molnar
2002-01-04 13:36     ` Andrea Arcangeli
2002-01-04 15:44       ` Ingo Molnar
2002-01-04 14:45         ` Andrea Arcangeli
2002-01-04 11:16 rwhron
2002-01-04 13:20 ` Ingo Molnar
     [not found] <20020104074239.94E016DAA6@mail.elte.hu>
2002-01-04 11:42 ` Ingo Molnar
2002-01-05  0:28   ` Roger Larsson
2002-01-05  7:53     ` george anzinger
2002-01-05 16:54       ` Robert Love
2002-01-05 12:42     ` Ingo Molnar
2002-01-07 17:34 Rene Rebe
     [not found] <20020107.191742.730580837.rene.rebe@gmx.net>
     [not found] ` <Pine.LNX.4.33.0201072124380.14212-100000@localhost.localdomain>
2002-01-07 19:53   ` Rene Rebe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox