public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* RFC for a new Scheduling policy/class in the Linux-kernel
@ 2009-07-10 21:50 Henrik Austad
  2009-07-11 18:28 ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Henrik Austad @ 2009-07-10 21:50 UTC (permalink / raw)
  To: LKML; +Cc: Ingo Molnar, Peter Zijlstra, Bill Huey, Linux RT

Hi all!

This is a proposal for a global [1], deadline driven scheduler for
real-time tasks in the Linux kernel. I thought I should send out an RFC to
gather some feedback instead of wildy hack away at it.

This proposed scheduler is a modified MLLF (modified Least Laxity First)
called Earliest Failure First (EFF) as it orders tasks according to when
they will miss their deadlines, not when the actual deadline is.

== Motivation and background ==

Deadlines will give the developer greater flexibility and expressiveness
when creating real-time applications. Compared to a priority scheme,
this simplifies the process considerably as it removes the need for
calculating the priorities off-line in order to find the priority-map
that will order the tasks in the correct order. Yet another important
aspect with deadlines instead of priorities, are systems too dynamic to
analyze (a large application with 100s of processes/threads or a system
running more than one rt-application).

* In very large systems, it becomes very difficult to find the correct
  set of priorities, even with sophisticated tools, and a slight change
  will require a re-calculation of all priorities.

* In very dynamic systems, it can be impossible to analyze the system
  off-line, reducing the calculated priorities to best-effort only

* If more than one application run at the same time, this become even
  worse.


As a final point, a schedule of tasks with their priorities, is in
almost all scenarios, a result of all deadlines for all tasks. This also
goes for non-rt tasks, even though the concept of deadlines are a bit
more fuzzy here. The problem is that this mapping is a one-way function,
--you cannot determine the deadlines from a set of priorities.

The problem is, how to implement this efficiently in a priority-oriented
kernel, and more importantly, how to extend this to multi-core
systems. For single-core systems, EDF is optimal and can accept tasks up
to 100% utilization and still guarantee that all deadlines are
met. However, on multi-core, this breaks down and the utilization bound
must be set very low if any guarantee should be given (known as "Dhall's
effect").

== Related Work ==

Recently, I've been working away on a pfair-based scheduler (PD^2 to be
exact), but this failed for several reasons [2]. The most prominent being
the sensitivity for timer inaccuracies and very frequent task
preemption. pfair has several good qualities, as it reduces jitter,
scales to many cores and achieves high sustained utilization. However,
the benefits do not outweigh the added overhead and strict requirements
placed on the timer infrastructure.

This latter point is what haunts EDF on multi-core platforms. A global
EDF-US[1/2] cannot exceed (m+1)/2, standard EDF is much
worse. Partitioned can reach higher limits, but is very susceptible to
the bin-packing heuristics. Going fully partitioned will also introduce
other issues like the need for load balancing and more complex deadline
inheritance logic. However, one clear benefit with EDF, is that it will
minimize the number of task-switches, clearly something desirable.

== Proposed algorithm ==

So, with this in mind, my motivation was to find a way to extend the a
deadline driver scheduler scheduler to battle Dhall's effect. This can
be done if you look at time in a more general sense than just
deadlines. What you must do, is look at how the expected computation
time needed by a task with respect to the tasks deadline compares to
other tasks.

=== Notation ===

- Take a set of tasks with corresponding attributes.  This set and their
  attributes are called the schedule, 'S' and contains *all* tasks for
  the given scheduling class (i.e. all EFF-tasks).

- Consider a multi-core system with 'm' processors.

- Let the i'th task in the schedule be denoted tau_i. [3]

- Each task will run in intervals, each 'round' is called a job. A task
  consists of an infinite sequence of jobs. The k'th job of tau_i is
  called tau_{i,k}

- Each task has a set of (relative) attributes supplied when the task is
  inserted into the scheduler (passed via syscall)
  * Period T_i
  * Deadline D_i
  * WCET C_i

- Each job (tau_{i,k}) has absolute attributes (computed from the relative
  tasks-attributes coupled with physical time).
  * Release-time r_{i,k}
  * Deadline d_{i,k}
  * Allocated time so for a job, C_a(t, tau_{i,k})
    When C_a equals WCET, the jobs budget is exhausted and it should
    start a new cycle. This is tested (see below) by the scheduler.
  * Remaining time for the job, C_r(t, tau_{i,nk})

- The acceptance function for EFF screens new tasks on their expected
  utilization. Depending on the mode and implementation, it can be based
  on the period, or on the deadline. The latter will cause firmer
  restraints, but may lead to wasted resources.

	U = C_i / T_i		For SRT (bounded deadline tardiness)
	U = C_i / D_i		For HRT

- A relative measure, time to failure, ttf, indicates how much time is
  left before a job must be scheduled to run in order to avoid a
  deadline-miss. This will decrease as time progresses and the job is
  not granted CPU time. For tasks currently running on a CPU, this value
  will be constant.

	Take a job with a WCET of 10ms, it has been allowed to run for 4
	ms so far. The deadline is 8 ms away. Then the job must be
	scheduled to run within the next 4 ms, otherwise it will not be
	able to finish in time.

- An absolute value, time of failure (tof) can also be computed in a
  static manner. For tasks not running on a CPU, the allocated time is
  static. That means you can take the absolute deadline, subtract the
  allocated time and you have the absolute point in time when a given
  job will fail to meet its deadline.

=== Outline of scheduler ===

Store tasks in 2 queues. One of size m, containing all the tasks
currently running on the CPUs (queue R). The other will hold all
currently active tasks waiting to execute (queue W).

queue R is sorted based on ttf (time to failure, the relative time left
until a task will miss it's deadline). As the tasks approaches the
absolute time of failure at the same rate C_a increases, ttf is
constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
(i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
be quite fluent.

queue W is sorted based on absolute time of failure (tof). Since this is
a fixed point in time, and the tasks in W are not running (C_a is
unchanged), this value is constant.

When a task is scheduled to run, a timer is set at the point in time
where it has exhausted it's budget (t_now + WCET - C_a). This is to
ensure that a runaway task does not grab the CPU.

When a new task arrives, it is handled according the following rules:
- The system has one or more CPUs not running EFF-tasks. Pick any of the
  free CPUs and assign the new job there. Set a timer to

- All CPUs are busy, the new task has greater time to failure than the
  head of W. The task is inserted into W at the appropriate place.

- All CPUs are busy and the new task has smaller time to failure than
  the head of W. The new task is compared to the last task in Q. If time
  to failure is larger than the task at the tail, it is added to the
  head of W.

- If all CPUs are busy, and time to failure is smaller than the tail of
  Q, the new task is a candidate for insertion. At this point the tasks
  must be compared to see if picking one or the other will cause a
  deadline-miss. If both will miss the deadline if the other is
  scheduled, keep the existing running and place the new at the head of
  W (as you will have a deadline-miss anyway unless the the task is
  picked up by another CPU soon).

- A task running on a CPU with ttf=0 should *never* be preempted with
  another task. If all tasks in R have ttf=0, and a newly arrived task
  has ttf=0, a deadline-miss is inevitable and switching tasks will only
  waste resources.

When a task in R finish (or is stopped due to the timer-limit), it is
removed from R, and the head of W is added to R, inserted at the
appropriate place.

It has been some discussion lately (in particular on #linux-rt) about
the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
should be possible to extend EFF to handle both. As a side note, if
anyone has some good information about PEP, I'd like a copy :)

Based on this, I think the utilization can be set as high as M
(i.e. full utilization of all CPUs), but the jitter can probably be
quite bad, so for jitter-sensitive tasks, a short period/deadline should
be used.

There are still some issues left to solve, for instance how to best
handle sporadic tasks, and whether or not deadline-miss should be allow,
or just 'bounded deadline tardiness'. Either way, EFF should be able to
handle it. Then, there are problems concerning blocking of tasks. One
solution would be BWI or PEP, but I have not had the time to read
properly through those, but from what I've gathered a combination of BWI
and PEP looks promising (anyone with good info about BWI and PEP - feel
free to share! (-: ).



	Henrik


1) Before you freeze at 'global' and get all caught up on "This won't
   ever scale to X", or "He will be haunted by Y" - I do not want to
   extend a global algorithm to 2000 cores. I would like to scale to a
   single *chip* and then we can worry about 2-way and 4-way systems
   later. For the record, I've donned my asbestos suit anyway.

2) http://austad.us/kernel/thesis_henrikau.pdf

3) Anyone want to include LaTeX-notation into an email-rfc?



-- 
 -> henrik

^ permalink raw reply	[flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel
@ 2009-07-16 17:54 Raj Rajkumar
  0 siblings, 0 replies; 82+ messages in thread
From: Raj Rajkumar @ 2009-07-16 17:54 UTC (permalink / raw)
  To: linux-kernel; +Cc: Raj Rajkumar

Dear Jim, Ted and others:

Let me first introduce myself.  I work on real-time systems at Carnegie 
Mellon Univ and am one of the authors on the papers that introduced the 
priority inheritance and priority ceiling protocols, and some of their 
follow-ons.  I was also a co-founder of TimeSys in 1996.  My PhD 
student, Karthik Lakshmanan, and an SEI Sr. Staff Member, Dr. Dionisio 
de Niz, and I have begun to revisit multiprocessor scheduling and 
synchronization thanks in part to the surging interest in multicore 
processors.

I just came to know about this message thread.  I hope, as Jim did, that 
I can add something to the ongoing discussion - if not, please bear with me.

First of all, let me agree completely with Jim that simplicity goes a 
very long way.  Two relevant examples follow.   First, fixed-priority 
scheduling has dominated most real-time systems over dynamic-priority 
scheduling) since (a) the practical performance differences are not 
significant (b) overheads are lower and (c) perhaps above all, less 
information is required.  Secondly, priority ceiling/ceiling emulation 
protocols have much better properties than basic priority inheritance 
protocols but the latter are more widely used.  The reason appears to be 
that PCP requires additional information which can also change at 
run-time.   Overall, simplicity does win.   

Secondly, let me fully agree with Ted in that Linux-RT extensions should 
support bandwidth control & isolation.   My group's work on "reserves" 
and "resource kernels" looked at isolating both temporal and spatial 
resources (please see 
http://www.ece.cmu.edu/~raj/publications.html#ResourceKernels) in the 
context of Linux.  Karthik's work extended Linux/RK to distributed 
systems (Distributed Linux/RK).

Thirdly, I also agree with Jim that non-preemptive locking and FIFO 
queueing are fine for mutexes in the kernel.  The primary reason is that 
critical sections in the kernel are written by experts and tend to be 
short.  And as it turns out, this is exactly how Linux SMP support has 
been for quite a few years.  

It looks to me like Jim and Bjoern name the kernel-mutex locking scheme 
(of non-preemption and FIFO queueing) as FMLP and advocate it for 
user-level mutexes.  Jim: Please correct me if my interpretation is 
incorrect.

I would like to propose a solution with 2 components.  First, priority 
inheritance protocols not just prevent (potentially) unbounded priority 
inversion but offer less blocking for higher-priority tasks with 
typically tighter timing constraints.   It is also well known that 
non-preemptive execution is an extreme/simplified form of the priority 
ceiling protocol, where every task is assumed to access every shared 
resource/mutex and hence every priority ceiling is the highest priority 
in the system.    There are cases when this is fine (e.g. when all 
critical sections are *relatively* small as in the Linux kernel) and 
there are cases when this is not (e.g. when some user-level critical 
sections accessed only by lower-criticality tasks are *relatively* long 
compared to the timing constraints of higher priority tasks.

Component 1: Let a priority ceiling be associated with each user-level 
mutex.   When the mutex is locked, the corresponding critical section is 
executed at the priority ceiling value.   The developer can choose this 
to be the highest priority in the system in which case it becomes a 
non-preemptive critical section.    In addition, we could allow mutexes 
to either pick basic priority inheritance (desirable for local mutexes?) 
or the priority ceiling version (desirable for global mutexes shared 
across processors/cores).

Component 2: The queueing discipline for tasks pending on locked mutexes 
is the second policy under discussion.   Jim argues that it should be 
FIFO.  Imagine a bunch of lower-priority tasks stuck in front of a 
higher-priority task, and the "priority inversion" time can be 
considerable.  (Some including me would argue that it goes against the 
grain of using priority-based scheduling for real-time systems in the 
first place.  Why not just use FIFO scheduling?).    However, a 
practical compromise is easy to reach as well.  Let the queueing policy 
(FIFO or priority-based) on mutexes be an option.  FIFO for a "local" 
mutex would not be very helpful.  And priority-queueing for a "global" 
mutex would help tasks with tight timing constraints.

Proposal Summary

   1. Associate a priority ceiling (priority at which critical sections 
are executed) with each (global) mutex.   A macro HIGHEST_CEILING could 
represent a value that is higher than any other application-level priority.
   2. Allow the developer to specify the queueing discipline on a 
(global) mutex.   MUTEX_FIFO_QUEUEING and MUTEX_PRIORITY_QUEUEING are 
allowed options.

I would appreciate any comments.   Thanks for reading through a lot (if 
you reached this far :-)

---
Raj
http://www.ece.cmu.edu/~raj


P.S.  Some relevant references from a couple of decades back.

[1] Rajkumar, R., "Real-Time Synchronization Protocols for Shared Memory 
Multiprocessors". The Tenth International Conference on Distributed 
Computing Systems, 1990.
[2] Rajkumar, R., Sha, L., and Lehoczky J.P. "Real-Time Synchronization 
Protocols for Multiprocessors". Proceedings of the
IEEE Real-Time Systems Symposium, December 1988, pp. 259-269.

Global locking is costly - global critical sections could be moved to a 
"global synchronization processor" (and is described in one of the 
articles above).


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2009-07-18 20:20 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-10 21:50 RFC for a new Scheduling policy/class in the Linux-kernel Henrik Austad
2009-07-11 18:28 ` Peter Zijlstra
2009-07-12  2:40   ` Douglas Niehaus
2009-07-12 15:31     ` Peter Zijlstra
2009-07-13 15:44       ` Raistlin
2009-07-13 16:33         ` Chris Friesen
2009-07-14 10:47           ` Raistlin
2009-07-14 11:03             ` Peter Zijlstra
2009-07-14 18:19               ` Raistlin
2009-07-14 14:48             ` Chris Friesen
2009-07-14 15:19               ` James H. Anderson
2009-07-14 16:31                 ` Peter Zijlstra
2009-07-14 16:54                   ` James H. Anderson
2009-07-14 19:28                     ` Henrik Austad
2009-07-14 19:33                       ` James H. Anderson
2009-07-15 21:53                       ` Ted Baker
2009-07-17  7:40                         ` Henrik Austad
2009-07-17 13:37                           ` Ted Baker
2009-07-15  4:25                     ` Bjoern B. Brandenburg
2009-07-15 20:55                     ` Ted Baker
2009-07-15 21:53                       ` Chris Friesen
2009-07-15 22:34                         ` Ted Baker
2009-07-15 22:39                           ` Dhaval Giani
2009-07-15 23:16                             ` Ted Baker
2009-07-16  8:58                               ` Peter Zijlstra
2009-07-16  9:11                                 ` Peter Zijlstra
2009-07-17  0:32                                 ` Raistlin
2009-07-17  0:43                                 ` Raistlin
2009-07-16 12:17                               ` Raistlin
2009-07-16 23:29                       ` Raistlin
2009-07-18 20:12                         ` Michal Sojka
2009-07-14 17:16                   ` James H. Anderson
2009-07-15 21:19                     ` Ted Baker
2009-07-14 19:54                   ` Raistlin
2009-07-14 16:48               ` Raistlin
2009-07-14 18:24                 ` Chris Friesen
2009-07-14 19:14                   ` Raistlin
2009-07-15 22:14                   ` Ted Baker
2009-07-16  7:17                     ` Henrik Austad
2009-07-16 23:13                       ` Ted Baker
2009-07-17  0:19                         ` Raistlin
2009-07-17  7:31                         ` Henrik Austad
2009-07-16 14:46                     ` Chris Friesen
2009-07-16 22:34                       ` Ted Baker
2009-07-16 23:07                         ` Raistlin
2009-07-15 21:45               ` Ted Baker
2009-07-15 22:12                 ` Chris Friesen
2009-07-15 22:52                   ` Ted Baker
2009-07-17 13:35             ` Giuseppe Lipari
2009-07-13 17:25         ` Peter Zijlstra
2009-07-13 18:14           ` Noah Watkins
2009-07-13 20:13             ` Ted Baker
2009-07-13 21:45               ` Chris Friesen
2009-07-14 11:16                 ` Raistlin
2009-07-15 23:11                 ` Ted Baker
2009-07-16  7:58                   ` Peter Zijlstra
2009-07-16  8:52                     ` Thomas Gleixner
2009-07-16 12:17                     ` Raistlin
2009-07-16 12:59                       ` James H. Anderson
2009-07-16 13:37                         ` Peter Zijlstra
2009-07-16 22:15                     ` Ted Baker
2009-07-16 22:34                       ` Karthik Singaram Lakshmanan
2009-07-16 23:38                         ` Ted Baker
2009-07-17  1:44                           ` Karthik Singaram Lakshmanan
2009-07-16 15:17                   ` Chris Friesen
2009-07-16 21:26                     ` Ted Baker
2009-07-16 22:08                       ` Chris Friesen
2009-07-16 23:54                         ` Ted Baker
2009-07-14  9:15             ` Peter Zijlstra
2009-07-14 19:07               ` Raistlin
2009-07-13 17:28         ` Peter Zijlstra
2009-07-14 19:47           ` Raistlin
     [not found]     ` <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net>
2009-07-13 23:47       ` Douglas Niehaus
2009-07-14  7:27         ` Chris Friesen
2009-07-14  7:44           ` Douglas Niehaus
2009-07-12  6:17   ` Henrik Austad
2009-07-13  9:55   ` Raistlin
2009-07-13 10:14     ` Peter Zijlstra
2009-07-13 16:06       ` Raistlin
2009-07-14  8:42         ` Peter Zijlstra
2009-07-14  9:36           ` Raistlin
  -- strict thread matches above, loose matches on Subject: below --
2009-07-16 17:54 Raj Rajkumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox