* RFC for a new Scheduling policy/class in the Linux-kernel
@ 2009-07-10 21:50 Henrik Austad
2009-07-11 18:28 ` Peter Zijlstra
0 siblings, 1 reply; 82+ messages in thread
From: Henrik Austad @ 2009-07-10 21:50 UTC (permalink / raw)
To: LKML; +Cc: Ingo Molnar, Peter Zijlstra, Bill Huey, Linux RT
Hi all!
This is a proposal for a global [1], deadline driven scheduler for
real-time tasks in the Linux kernel. I thought I should send out an RFC to
gather some feedback instead of wildy hack away at it.
This proposed scheduler is a modified MLLF (modified Least Laxity First)
called Earliest Failure First (EFF) as it orders tasks according to when
they will miss their deadlines, not when the actual deadline is.
== Motivation and background ==
Deadlines will give the developer greater flexibility and expressiveness
when creating real-time applications. Compared to a priority scheme,
this simplifies the process considerably as it removes the need for
calculating the priorities off-line in order to find the priority-map
that will order the tasks in the correct order. Yet another important
aspect with deadlines instead of priorities, are systems too dynamic to
analyze (a large application with 100s of processes/threads or a system
running more than one rt-application).
* In very large systems, it becomes very difficult to find the correct
set of priorities, even with sophisticated tools, and a slight change
will require a re-calculation of all priorities.
* In very dynamic systems, it can be impossible to analyze the system
off-line, reducing the calculated priorities to best-effort only
* If more than one application run at the same time, this become even
worse.
As a final point, a schedule of tasks with their priorities, is in
almost all scenarios, a result of all deadlines for all tasks. This also
goes for non-rt tasks, even though the concept of deadlines are a bit
more fuzzy here. The problem is that this mapping is a one-way function,
--you cannot determine the deadlines from a set of priorities.
The problem is, how to implement this efficiently in a priority-oriented
kernel, and more importantly, how to extend this to multi-core
systems. For single-core systems, EDF is optimal and can accept tasks up
to 100% utilization and still guarantee that all deadlines are
met. However, on multi-core, this breaks down and the utilization bound
must be set very low if any guarantee should be given (known as "Dhall's
effect").
== Related Work ==
Recently, I've been working away on a pfair-based scheduler (PD^2 to be
exact), but this failed for several reasons [2]. The most prominent being
the sensitivity for timer inaccuracies and very frequent task
preemption. pfair has several good qualities, as it reduces jitter,
scales to many cores and achieves high sustained utilization. However,
the benefits do not outweigh the added overhead and strict requirements
placed on the timer infrastructure.
This latter point is what haunts EDF on multi-core platforms. A global
EDF-US[1/2] cannot exceed (m+1)/2, standard EDF is much
worse. Partitioned can reach higher limits, but is very susceptible to
the bin-packing heuristics. Going fully partitioned will also introduce
other issues like the need for load balancing and more complex deadline
inheritance logic. However, one clear benefit with EDF, is that it will
minimize the number of task-switches, clearly something desirable.
== Proposed algorithm ==
So, with this in mind, my motivation was to find a way to extend the a
deadline driver scheduler scheduler to battle Dhall's effect. This can
be done if you look at time in a more general sense than just
deadlines. What you must do, is look at how the expected computation
time needed by a task with respect to the tasks deadline compares to
other tasks.
=== Notation ===
- Take a set of tasks with corresponding attributes. This set and their
attributes are called the schedule, 'S' and contains *all* tasks for
the given scheduling class (i.e. all EFF-tasks).
- Consider a multi-core system with 'm' processors.
- Let the i'th task in the schedule be denoted tau_i. [3]
- Each task will run in intervals, each 'round' is called a job. A task
consists of an infinite sequence of jobs. The k'th job of tau_i is
called tau_{i,k}
- Each task has a set of (relative) attributes supplied when the task is
inserted into the scheduler (passed via syscall)
* Period T_i
* Deadline D_i
* WCET C_i
- Each job (tau_{i,k}) has absolute attributes (computed from the relative
tasks-attributes coupled with physical time).
* Release-time r_{i,k}
* Deadline d_{i,k}
* Allocated time so for a job, C_a(t, tau_{i,k})
When C_a equals WCET, the jobs budget is exhausted and it should
start a new cycle. This is tested (see below) by the scheduler.
* Remaining time for the job, C_r(t, tau_{i,nk})
- The acceptance function for EFF screens new tasks on their expected
utilization. Depending on the mode and implementation, it can be based
on the period, or on the deadline. The latter will cause firmer
restraints, but may lead to wasted resources.
U = C_i / T_i For SRT (bounded deadline tardiness)
U = C_i / D_i For HRT
- A relative measure, time to failure, ttf, indicates how much time is
left before a job must be scheduled to run in order to avoid a
deadline-miss. This will decrease as time progresses and the job is
not granted CPU time. For tasks currently running on a CPU, this value
will be constant.
Take a job with a WCET of 10ms, it has been allowed to run for 4
ms so far. The deadline is 8 ms away. Then the job must be
scheduled to run within the next 4 ms, otherwise it will not be
able to finish in time.
- An absolute value, time of failure (tof) can also be computed in a
static manner. For tasks not running on a CPU, the allocated time is
static. That means you can take the absolute deadline, subtract the
allocated time and you have the absolute point in time when a given
job will fail to meet its deadline.
=== Outline of scheduler ===
Store tasks in 2 queues. One of size m, containing all the tasks
currently running on the CPUs (queue R). The other will hold all
currently active tasks waiting to execute (queue W).
queue R is sorted based on ttf (time to failure, the relative time left
until a task will miss it's deadline). As the tasks approaches the
absolute time of failure at the same rate C_a increases, ttf is
constant. R is only a 'map' of tasks to the CPUs. Position 0 in R
(i.e. smallest ttf) does not result in CPU#0, as the position->CPU will
be quite fluent.
queue W is sorted based on absolute time of failure (tof). Since this is
a fixed point in time, and the tasks in W are not running (C_a is
unchanged), this value is constant.
When a task is scheduled to run, a timer is set at the point in time
where it has exhausted it's budget (t_now + WCET - C_a). This is to
ensure that a runaway task does not grab the CPU.
When a new task arrives, it is handled according the following rules:
- The system has one or more CPUs not running EFF-tasks. Pick any of the
free CPUs and assign the new job there. Set a timer to
- All CPUs are busy, the new task has greater time to failure than the
head of W. The task is inserted into W at the appropriate place.
- All CPUs are busy and the new task has smaller time to failure than
the head of W. The new task is compared to the last task in Q. If time
to failure is larger than the task at the tail, it is added to the
head of W.
- If all CPUs are busy, and time to failure is smaller than the tail of
Q, the new task is a candidate for insertion. At this point the tasks
must be compared to see if picking one or the other will cause a
deadline-miss. If both will miss the deadline if the other is
scheduled, keep the existing running and place the new at the head of
W (as you will have a deadline-miss anyway unless the the task is
picked up by another CPU soon).
- A task running on a CPU with ttf=0 should *never* be preempted with
another task. If all tasks in R have ttf=0, and a newly arrived task
has ttf=0, a deadline-miss is inevitable and switching tasks will only
waste resources.
When a task in R finish (or is stopped due to the timer-limit), it is
removed from R, and the head of W is added to R, inserted at the
appropriate place.
It has been some discussion lately (in particular on #linux-rt) about
the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It
should be possible to extend EFF to handle both. As a side note, if
anyone has some good information about PEP, I'd like a copy :)
Based on this, I think the utilization can be set as high as M
(i.e. full utilization of all CPUs), but the jitter can probably be
quite bad, so for jitter-sensitive tasks, a short period/deadline should
be used.
There are still some issues left to solve, for instance how to best
handle sporadic tasks, and whether or not deadline-miss should be allow,
or just 'bounded deadline tardiness'. Either way, EFF should be able to
handle it. Then, there are problems concerning blocking of tasks. One
solution would be BWI or PEP, but I have not had the time to read
properly through those, but from what I've gathered a combination of BWI
and PEP looks promising (anyone with good info about BWI and PEP - feel
free to share! (-: ).
Henrik
1) Before you freeze at 'global' and get all caught up on "This won't
ever scale to X", or "He will be haunted by Y" - I do not want to
extend a global algorithm to 2000 cores. I would like to scale to a
single *chip* and then we can worry about 2-way and 4-way systems
later. For the record, I've donned my asbestos suit anyway.
2) http://austad.us/kernel/thesis_henrikau.pdf
3) Anyone want to include LaTeX-notation into an email-rfc?
--
-> henrik
^ permalink raw reply [flat|nested] 82+ messages in thread* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-10 21:50 RFC for a new Scheduling policy/class in the Linux-kernel Henrik Austad @ 2009-07-11 18:28 ` Peter Zijlstra 2009-07-12 2:40 ` Douglas Niehaus ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-11 18:28 UTC (permalink / raw) To: Henrik Austad Cc: LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote: > Hi all! > > This is a proposal for a global [1], deadline driven scheduler for > real-time tasks in the Linux kernel. I thought I should send out an RFC to > gather some feedback instead of wildy hack away at it. > > This proposed scheduler is a modified MLLF (modified Least Laxity First) > called Earliest Failure First (EFF) as it orders tasks according to when > they will miss their deadlines, not when the actual deadline is. <snip> Everybody agrees we want a deadline scheduler, we'll probably put a user interface into -rt shortly which should work for all the involved research groups so that we can share tests and have better comparisons. The only thing (aside from an utter lack of time to work on things recently) that has been holding us back is a proper solution to the priority inversion issue. I haven't fully read through the proposed algorithm below, and left it in place for the new people on CC. As already mentioned on IRC, the fact that you push the work to the last possible moment indicates that this algorithm will utterly fall apart on overload and would thus be unsuited for soft-rt loads, but I guess we could implement things like EDF-fm and keep this as a hard-rt class. > === Notation === > > - Take a set of tasks with corresponding attributes. This set and their > attributes are called the schedule, 'S' and contains *all* tasks for > the given scheduling class (i.e. all EFF-tasks). > > - Consider a multi-core system with 'm' processors. > > - Let the i'th task in the schedule be denoted tau_i. [3] > > - Each task will run in intervals, each 'round' is called a job. A task > consists of an infinite sequence of jobs. The k'th job of tau_i is > called tau_{i,k} > > - Each task has a set of (relative) attributes supplied when the task is > inserted into the scheduler (passed via syscall) > * Period T_i > * Deadline D_i > * WCET C_i > > - Each job (tau_{i,k}) has absolute attributes (computed from the relative > tasks-attributes coupled with physical time). > * Release-time r_{i,k} > * Deadline d_{i,k} > * Allocated time so for a job, C_a(t, tau_{i,k}) > When C_a equals WCET, the jobs budget is exhausted and it should > start a new cycle. This is tested (see below) by the scheduler. > * Remaining time for the job, C_r(t, tau_{i,nk}) > > - The acceptance function for EFF screens new tasks on their expected > utilization. Depending on the mode and implementation, it can be based > on the period, or on the deadline. The latter will cause firmer > restraints, but may lead to wasted resources. > > U = C_i / T_i For SRT (bounded deadline tardiness) > U = C_i / D_i For HRT > > - A relative measure, time to failure, ttf, indicates how much time is > left before a job must be scheduled to run in order to avoid a > deadline-miss. This will decrease as time progresses and the job is > not granted CPU time. For tasks currently running on a CPU, this value > will be constant. > > Take a job with a WCET of 10ms, it has been allowed to run for 4 > ms so far. The deadline is 8 ms away. Then the job must be > scheduled to run within the next 4 ms, otherwise it will not be > able to finish in time. > > - An absolute value, time of failure (tof) can also be computed in a > static manner. For tasks not running on a CPU, the allocated time is > static. That means you can take the absolute deadline, subtract the > allocated time and you have the absolute point in time when a given > job will fail to meet its deadline. > > === Outline of scheduler === > > Store tasks in 2 queues. One of size m, containing all the tasks > currently running on the CPUs (queue R). The other will hold all > currently active tasks waiting to execute (queue W). > > queue R is sorted based on ttf (time to failure, the relative time left > until a task will miss it's deadline). As the tasks approaches the > absolute time of failure at the same rate C_a increases, ttf is > constant. R is only a 'map' of tasks to the CPUs. Position 0 in R > (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will > be quite fluent. > > queue W is sorted based on absolute time of failure (tof). Since this is > a fixed point in time, and the tasks in W are not running (C_a is > unchanged), this value is constant. > > When a task is scheduled to run, a timer is set at the point in time > where it has exhausted it's budget (t_now + WCET - C_a). This is to > ensure that a runaway task does not grab the CPU. > > When a new task arrives, it is handled according the following rules: > - The system has one or more CPUs not running EFF-tasks. Pick any of the > free CPUs and assign the new job there. Set a timer to > > - All CPUs are busy, the new task has greater time to failure than the > head of W. The task is inserted into W at the appropriate place. > > - All CPUs are busy and the new task has smaller time to failure than > the head of W. The new task is compared to the last task in Q. If time > to failure is larger than the task at the tail, it is added to the > head of W. > > - If all CPUs are busy, and time to failure is smaller than the tail of > Q, the new task is a candidate for insertion. At this point the tasks > must be compared to see if picking one or the other will cause a > deadline-miss. If both will miss the deadline if the other is > scheduled, keep the existing running and place the new at the head of > W (as you will have a deadline-miss anyway unless the the task is > picked up by another CPU soon). > > - A task running on a CPU with ttf=0 should *never* be preempted with > another task. If all tasks in R have ttf=0, and a newly arrived task > has ttf=0, a deadline-miss is inevitable and switching tasks will only > waste resources. > > When a task in R finish (or is stopped due to the timer-limit), it is > removed from R, and the head of W is added to R, inserted at the > appropriate place. > > It has been some discussion lately (in particular on #linux-rt) about > the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It > should be possible to extend EFF to handle both. As a side note, if > anyone has some good information about PEP, I'd like a copy :) > > Based on this, I think the utilization can be set as high as M > (i.e. full utilization of all CPUs), but the jitter can probably be > quite bad, so for jitter-sensitive tasks, a short period/deadline should > be used. > > There are still some issues left to solve, for instance how to best > handle sporadic tasks, and whether or not deadline-miss should be allow, > or just 'bounded deadline tardiness'. Either way, EFF should be able to > handle it. Then, there are problems concerning blocking of tasks. One > solution would be BWI or PEP, but I have not had the time to read > properly through those, but from what I've gathered a combination of BWI > and PEP looks promising (anyone with good info about BWI and PEP - feel > free to share! (-: ). Our SSSUP friends have a BWI paper here: http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm not sure if he has any papers on it. Also, when talking about it at OSPERT last week Ted Baker (also on CC) said it reminded him of something else of which I seem to have forgotten the name. Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor but SMP leaves things to be desired. Dhaval is currently working on a PEP implementation that will migrate all the blocked tasks to the owner's cpu, basically reducing it to the UP problem. > 1) Before you freeze at 'global' and get all caught up on "This won't > ever scale to X", or "He will be haunted by Y" - I do not want to > extend a global algorithm to 2000 cores. I would like to scale to a > single *chip* and then we can worry about 2-way and 4-way systems > later. For the record, I've donned my asbestos suit anyway. My preferred approach here is to find a distributed algorithm that converges to the global one. > 2) http://austad.us/kernel/thesis_henrikau.pdf > > 3) Anyone want to include LaTeX-notation into an email-rfc? Not unheard of ;-) ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-11 18:28 ` Peter Zijlstra @ 2009-07-12 2:40 ` Douglas Niehaus 2009-07-12 15:31 ` Peter Zijlstra [not found] ` <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net> 2009-07-12 6:17 ` Henrik Austad 2009-07-13 9:55 ` Raistlin 2 siblings, 2 replies; 82+ messages in thread From: Douglas Niehaus @ 2009-07-12 2:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group Peter: Perhaps you could expand on what you meant when you said: Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor but SMP leaves things to be desired. Dhaval is currently working on a PEP implementation that will migrate all the blocked tasks to the owner's cpu, basically reducing it to the UP problem. What is left to be desired with PEP on SMP? I am not saying it is perfect, as I can think of a few things I would like to improve or understand better, but I am curious what you have in mind. Absent a clearer idea of what you had in mind, I can certainly discuss the tradeeoffs Noah and I have considered over time, and which we think motivates our approach. When Noah and I have talked about this topic over the quite extended time, several years, we have been working on it, there have always seemed two choices: 1) Move the proxy (the resource owner) to the CPU with the blocked task 2) Move the "scheduling profile" of the blocked task to the CPU where the proxy is. For Proxy Execution under Group Scheduling we have considered both over time. Consider the situation where thread A on CPU0 is blocked on a resource held by thread B on CPU1. When we considered (1), it has the advantage of ensuring that B will run on CPU0, unblocking A, if A (or B) is still the best choice at the time it has been successfully moved from CPU1 -> CPU0. That might not be true after the delay of moving the process. We decided to emphasize (2) because it was more interesting in our view because it was cheaper and seemed no more complicated although its complications are different than (1). Its complication is, of course, that while we have worked out how to add the "avatar" of A to the set considered by the GS hierarchy on CPU1, it depends on the scheduling semantics as configured whether the avatar of A is chosen as "best" and thus how long it will be until B runs long enough to release the resource and unblock A on CPU1. We have always viewed that as complicated, but appropriately so to the problem. It depends inherently on the semantics of threads A, B, and all other threads on CPU1 that are ready to run, among whom the "best" is chosen by the GS hierarchy. We think it is inherently the problem of the scheduling configuration to take this trade-off into account. We have also thought being able to do both (1) and (2) is best, but which is best to use in a given situation depends on the comparative cost of (X) running B on CPU1 long enough to unblock A and (Y) the cost of moving B from CPU1->CPU0 to run long enough to unblock A, and then move it back from CPU0->CPU1 since its designed CPU assigned is on CPU1. Our decision after many hours of discussion over many months has been that the cost of (X) seems a lot more attractive than (Y). Part of our preference is that we are still working with semaphores as resources. Since most critical sections are supposed to be "short", then scheduling semantics taking the proxy execution periods into account on the "foreign" CPUs would actually be easier/better than the double thread movement. Both problems seem quite hard and I do not think I have yet "completely figured it out". While the "mass migration" you say Dhaval is working on would "reduce" the problem to the UP case, I think it would create more complexity for analysis than it eliminates. A form of thrashing seems a real danger. In this case, that threads would be moving from CPU to CPU so much it would be a real drain on resources and constraint on system performance. However, Dhaval my well understand the cost implications of thread migration better than I do. The real core of the problem, it seems to me, is that the proxy relationship among threads depends on what resources can be held by them. I think that problem is "relatively easy" in the set of locks associated with a multi-threaded application. When the resources causing blocking can be *any* lock in the kernel associated with *any* system service that might be used by *any* thread is is complicated enough to make my brain hurt. However, we thing the GS framework makes it relatively easy, perhaps that would be better said as "as easy as it can be", to implement any combination of thread migration and avatars desired by a given scheduling semantics. In that sense Noah and I feel that GS is a "complete" framework in that it is possible to configure any semantics desired as easily as it can be done by any framework. Obviously that does not resolve the question of what semantics it is best to desire for a given system which remains quite complicated and highly dependent on the specific application semantics. Noah and I thought the relatively low cost of creating the avatar was quite attractive, and so we decided on a GS configuration using it to experiment with in specifying the scheduling semantics. The first two approaches we want to experiment with are (*) to view the composite scheduling hierarchy for all CPUs as a whole, and let the avatar of A take its chances on CPU1, and (**) to view resolution of blocking as most important system wide, so we have the avatar viewed as "best" long enough for its proxy to release the resource. The bottom line, in out view, is that no single semantics will be viewed as either "best" or even acceptable for all applications as is the case with schedulers, so we wanted to emphasize configurability. We have performed basic tests showing the avatars can be chosen and resolve the blocking relationship. More complex tests await the completion of our port of GS and the other KUSP subsystems to 2.6.29. Doug Peter Zijlstra wrote: > On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote: > >> Hi all! >> >> This is a proposal for a global [1], deadline driven scheduler for >> real-time tasks in the Linux kernel. I thought I should send out an RFC to >> gather some feedback instead of wildy hack away at it. >> >> This proposed scheduler is a modified MLLF (modified Least Laxity First) >> called Earliest Failure First (EFF) as it orders tasks according to when >> they will miss their deadlines, not when the actual deadline is. >> > > <snip> > > Everybody agrees we want a deadline scheduler, we'll probably put a user > interface into -rt shortly which should work for all the involved > research groups so that we can share tests and have better comparisons. > > The only thing (aside from an utter lack of time to work on things > recently) that has been holding us back is a proper solution to the > priority inversion issue. > > I haven't fully read through the proposed algorithm below, and left it > in place for the new people on CC. > > As already mentioned on IRC, the fact that you push the work to the last > possible moment indicates that this algorithm will utterly fall apart on > overload and would thus be unsuited for soft-rt loads, but I guess we > could implement things like EDF-fm and keep this as a hard-rt class. > > >> === Notation === >> >> - Take a set of tasks with corresponding attributes. This set and their >> attributes are called the schedule, 'S' and contains *all* tasks for >> the given scheduling class (i.e. all EFF-tasks). >> >> - Consider a multi-core system with 'm' processors. >> >> - Let the i'th task in the schedule be denoted tau_i. [3] >> >> - Each task will run in intervals, each 'round' is called a job. A task >> consists of an infinite sequence of jobs. The k'th job of tau_i is >> called tau_{i,k} >> >> - Each task has a set of (relative) attributes supplied when the task is >> inserted into the scheduler (passed via syscall) >> * Period T_i >> * Deadline D_i >> * WCET C_i >> >> - Each job (tau_{i,k}) has absolute attributes (computed from the relative >> tasks-attributes coupled with physical time). >> * Release-time r_{i,k} >> * Deadline d_{i,k} >> * Allocated time so for a job, C_a(t, tau_{i,k}) >> When C_a equals WCET, the jobs budget is exhausted and it should >> start a new cycle. This is tested (see below) by the scheduler. >> * Remaining time for the job, C_r(t, tau_{i,nk}) >> >> - The acceptance function for EFF screens new tasks on their expected >> utilization. Depending on the mode and implementation, it can be based >> on the period, or on the deadline. The latter will cause firmer >> restraints, but may lead to wasted resources. >> >> U = C_i / T_i For SRT (bounded deadline tardiness) >> U = C_i / D_i For HRT >> >> - A relative measure, time to failure, ttf, indicates how much time is >> left before a job must be scheduled to run in order to avoid a >> deadline-miss. This will decrease as time progresses and the job is >> not granted CPU time. For tasks currently running on a CPU, this value >> will be constant. >> >> Take a job with a WCET of 10ms, it has been allowed to run for 4 >> ms so far. The deadline is 8 ms away. Then the job must be >> scheduled to run within the next 4 ms, otherwise it will not be >> able to finish in time. >> >> - An absolute value, time of failure (tof) can also be computed in a >> static manner. For tasks not running on a CPU, the allocated time is >> static. That means you can take the absolute deadline, subtract the >> allocated time and you have the absolute point in time when a given >> job will fail to meet its deadline. >> >> === Outline of scheduler === >> >> Store tasks in 2 queues. One of size m, containing all the tasks >> currently running on the CPUs (queue R). The other will hold all >> currently active tasks waiting to execute (queue W). >> >> queue R is sorted based on ttf (time to failure, the relative time left >> until a task will miss it's deadline). As the tasks approaches the >> absolute time of failure at the same rate C_a increases, ttf is >> constant. R is only a 'map' of tasks to the CPUs. Position 0 in R >> (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will >> be quite fluent. >> >> queue W is sorted based on absolute time of failure (tof). Since this is >> a fixed point in time, and the tasks in W are not running (C_a is >> unchanged), this value is constant. >> >> When a task is scheduled to run, a timer is set at the point in time >> where it has exhausted it's budget (t_now + WCET - C_a). This is to >> ensure that a runaway task does not grab the CPU. >> >> When a new task arrives, it is handled according the following rules: >> - The system has one or more CPUs not running EFF-tasks. Pick any of the >> free CPUs and assign the new job there. Set a timer to >> >> - All CPUs are busy, the new task has greater time to failure than the >> head of W. The task is inserted into W at the appropriate place. >> >> - All CPUs are busy and the new task has smaller time to failure than >> the head of W. The new task is compared to the last task in Q. If time >> to failure is larger than the task at the tail, it is added to the >> head of W. >> >> - If all CPUs are busy, and time to failure is smaller than the tail of >> Q, the new task is a candidate for insertion. At this point the tasks >> must be compared to see if picking one or the other will cause a >> deadline-miss. If both will miss the deadline if the other is >> scheduled, keep the existing running and place the new at the head of >> W (as you will have a deadline-miss anyway unless the the task is >> picked up by another CPU soon). >> >> - A task running on a CPU with ttf=0 should *never* be preempted with >> another task. If all tasks in R have ttf=0, and a newly arrived task >> has ttf=0, a deadline-miss is inevitable and switching tasks will only >> waste resources. >> >> When a task in R finish (or is stopped due to the timer-limit), it is >> removed from R, and the head of W is added to R, inserted at the >> appropriate place. >> >> It has been some discussion lately (in particular on #linux-rt) about >> the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It >> should be possible to extend EFF to handle both. As a side note, if >> anyone has some good information about PEP, I'd like a copy :) >> >> Based on this, I think the utilization can be set as high as M >> (i.e. full utilization of all CPUs), but the jitter can probably be >> quite bad, so for jitter-sensitive tasks, a short period/deadline should >> be used. >> >> There are still some issues left to solve, for instance how to best >> handle sporadic tasks, and whether or not deadline-miss should be allow, >> or just 'bounded deadline tardiness'. Either way, EFF should be able to >> handle it. Then, there are problems concerning blocking of tasks. One >> solution would be BWI or PEP, but I have not had the time to read >> properly through those, but from what I've gathered a combination of BWI >> and PEP looks promising (anyone with good info about BWI and PEP - feel >> free to share! (-: ). >> > > Our SSSUP friends have a BWI paper here: > > http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf > > The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm > not sure if he has any papers on it. > > Also, when talking about it at OSPERT last week Ted Baker (also on CC) > said it reminded him of something else of which I seem to have forgotten > the name. > > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor > but SMP leaves things to be desired. Dhaval is currently working on a > PEP implementation that will migrate all the blocked tasks to the > owner's cpu, basically reducing it to the UP problem. > > >> 1) Before you freeze at 'global' and get all caught up on "This won't >> ever scale to X", or "He will be haunted by Y" - I do not want to >> extend a global algorithm to 2000 cores. I would like to scale to a >> single *chip* and then we can worry about 2-way and 4-way systems >> later. For the record, I've donned my asbestos suit anyway. >> > > My preferred approach here is to find a distributed algorithm that > converges to the global one. > > >> 2) http://austad.us/kernel/thesis_henrikau.pdf >> >> 3) Anyone want to include LaTeX-notation into an email-rfc? >> > > Not unheard of ;-) > > ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-12 2:40 ` Douglas Niehaus @ 2009-07-12 15:31 ` Peter Zijlstra 2009-07-13 15:44 ` Raistlin [not found] ` <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net> 1 sibling, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-12 15:31 UTC (permalink / raw) To: Douglas Niehaus Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group On Sat, 2009-07-11 at 21:40 -0500, Douglas Niehaus wrote: > Peter: > Perhaps you could expand on what you meant when you said: > > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor > but SMP leaves things to be desired. Dhaval is currently working on a > PEP implementation that will migrate all the blocked tasks to the > owner's cpu, basically reducing it to the UP problem. > > What is left to be desired with PEP on SMP? I am not saying it is > perfect, as I can think of a few things I would like to improve or > understand better, but I am curious what you have in mind. Right, please don't take this as a critism against PEP, any scheme I know of has enormous complications on SMP ;-) But the thing that concerns me most, there seem to be a few O(n) consequences. Suppose that for each resource (or lock) R_i, there is a block graph G_i, which consists of n nodes and would be m deep. Functionally (generalized) PIP and PEP are identical, their big difference is that PIP uses waitqueues to encode the block graph G, whereas PEP leaves everybody on the runqueue and uses the proxy field to encode the block graph G. The downside of PIP is that the waitqueue needs to re-implement the full schedule function in order to evaluate the highest prio task on the waitqueue. Ttraditionally this was rather easy, since you'd only consider the limited SCHED_FIFO static prio range, leaving you with a O(1) evaluation, when you add more complex scheduling functions things get considerably more involved. Let's call this cost S. So for PIP you get O(m*S) evaluations whenever you get a change to the block graph. Now for PEP, you get an increased O(m) cost on schedule, which can be compared to the PIP cost. However PEP on SMP needs to ensure all n tasks in G_i are on the same cpu, because otherwise we can end up wanting to execute the resource owner on multiple cpus at the same time, which is bad. This can of course be amortized, but you end up having to migrate the task (or an avatar thereof) to the owner's cpu (if you'd want to migrate the owner to the blocker's cpu, then you're quickly into trouble when there's multiple blockers), but any way around this ends up being O(n). Also, when the owner gets blocked on something that doesn't have an owner (io completion, or a traditional semaphore), you have to take all n tasks from the runqueue (and back again when they do become runnable). PIP doesn't suffer this, but does suffer the pain from having to reimplement the full schedule function on the waitqueues, which when you have hierarchical scheduling means you have to replicate the full hierarchy per waitqueue. Furthermore we cannot assume locked sections are short, and we must indeed assume that its any resource in the kernel associated with any service which can be used by any thread, worse, it can be any odd userspace resource/thread too, since we expose the block graph to userspace processes through PI-futexes. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-12 15:31 ` Peter Zijlstra @ 2009-07-13 15:44 ` Raistlin 2009-07-13 16:33 ` Chris Friesen ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Raistlin @ 2009-07-13 15:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 3937 bytes --] On Sun, 2009-07-12 at 17:31 +0200, Peter Zijlstra wrote: > But the thing that concerns me most, there seem to be a few O(n) > consequences. Suppose that for each resource (or lock) R_i, there is a > block graph G_i, which consists of n nodes and would be m deep. > > [...] > > However PEP on SMP needs to ensure all n tasks in G_i are on the same > cpu, because otherwise we can end up wanting to execute the resource > owner on multiple cpus at the same time, which is bad. > That's from where we are trying to start evaluating the various possibilities to extend the idea to SMP. What we would like to achieve is some set of rules that, extending the UP ones, yield a situation which is both: - analyzable from the real-time theorist's point of view... Which is (un?)fortunately what we are :-) - possible to implement... Which is not always (un!)fortunately obvious here :-) I hope you don't mind if I share some thoughts with you about the solution I was wondering about... In case you do, just ignore and excuse me. Very basically: from the analysis point of view one easy and effective solution would be to have the blocked-running tasks --i.e., the tasks blocked on some lock that have been left on the rq to proxy-execute the lock owner-- busy waiting while the lock owner is running. This allows for retaining a lot of nice properties BWI already has, as far as analyzability is concerned. On the other hand, from the practical end efficiency point of view, it would be not that difficult to block these otherwise-spinning tasks, in order to avoid wasting CPU time too much... The only important thing is to properly account the budget of the correct server/group (which probably must be the otherwise-spinning task's one), or the analysis is gone again! :-O Also, this will probably imply some amount of block/wakeup overhead which could be of some impact especially in involved, maybe rare, but possible, locking schemes... which we would like to evaluate thoroughly... > This can of course be amortized, but you end up having to migrate the > task (or an avatar thereof) to the owner's cpu (if you'd want to migrate > the owner to the blocker's cpu, then you're quickly into trouble when > there's multiple blockers), but any way around this ends up being O(n). > I agree... Computational complexity is also an issue, and something to whom we have to validate against. > Also, when the owner gets blocked on something that doesn't have an > owner (io completion, or a traditional semaphore), you have to take all > n tasks from the runqueue (and back again when they do become runnable). > > PIP doesn't suffer this, but does suffer the pain from having to > reimplement the full schedule function on the waitqueues, which when you > have hierarchical scheduling means you have to replicate the full > hierarchy per waitqueue. > And, further than this, at least from my point of view, if you have server/group based scheduling, and in general some kind of budgeting or bandwidth enforcing mechanism in place, PIP is far from being a solution... > Furthermore we cannot assume locked sections are short, and we must > indeed assume that its any resource in the kernel associated with any > service which can be used by any thread, worse, it can be any odd > userspace resource/thread too, since we expose the block graph to > userspace processes through PI-futexes. > Agree, 100%. :-) Again, sorry for bothering with if someone of you is not interested... If instead you are, any comments is more than welcome. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 15:44 ` Raistlin @ 2009-07-13 16:33 ` Chris Friesen 2009-07-14 10:47 ` Raistlin 2009-07-13 17:25 ` Peter Zijlstra 2009-07-13 17:28 ` Peter Zijlstra 2 siblings, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-13 16:33 UTC (permalink / raw) To: Raistlin Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Raistlin wrote: > Very basically: from the analysis point of view one easy and effective > solution would be to have the blocked-running tasks --i.e., the tasks > blocked on some lock that have been left on the rq to proxy-execute the > lock owner-- busy waiting while the lock owner is running. This allows > for retaining a lot of nice properties BWI already has, as far as > analyzability is concerned. > > On the other hand, from the practical end efficiency point of view, it > would be not that difficult to block these otherwise-spinning tasks, in > order to avoid wasting CPU time too much... The only important thing is > to properly account the budget of the correct server/group (which > probably must be the otherwise-spinning task's one), or the analysis is > gone again! :-O Could you elaborate on this "proper accounting"? If task A is blocked waiting for a lock (held by a task B on another cpu) and we run task C instead, how would you propose that the accounting be handled? Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 16:33 ` Chris Friesen @ 2009-07-14 10:47 ` Raistlin 2009-07-14 11:03 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 10:47 UTC (permalink / raw) To: Chris Friesen Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 5816 bytes --] On Mon, 2009-07-13 at 10:33 -0600, Chris Friesen wrote: > > On the other hand, from the practical end efficiency point of view, it > > would be not that difficult to block these otherwise-spinning tasks, in > > order to avoid wasting CPU time too much... The only important thing is > > to properly account the budget of the correct server/group (which > > probably must be the otherwise-spinning task's one), or the analysis is > > gone again! :-O > > Could you elaborate on this "proper accounting"? > Well, I can try, but that's exactly the trickiest part! :-O > If task A is blocked waiting for a lock (held by a task B on another > cpu) and we run task C instead, how would you propose that the > accounting be handled? > Ok, here we are. From the point of view of real-time theory, all is about having the (m, on an m-CPU system) highest priority task(s) always running. That's something a real-time theorist could kill to achieve! :-D That is basically --well, as far as I've understood it since now! :-)-- because if you are sure you always have the highest task(s) up and running, the schedulability analysis is "simple", and there are zillions of schedulability test that can be applied and that will work more or less out of the box! The capability of a task to block complicates things in the sense that, if (one of) your highest priority task(s) blocks, you end up in executing someone else, invalidating our previous assumption. Moreover, if also priority inversions can happen, well, guys, we are screwed! :-( I think it is quite easy to figure out that blocking protocols, e.g. PI, P(C)P, SRP, etc., act in the direction of mitigating right that kind of issue, agree? Now, if you came to server/group based scheduling[*] this is even nicer. In fact, if you quickly go through the paper Peter pointed out (which is our implementation of BWI --an incarnation of PEP, actually--) or to the more theoretical one that describes the protocol in details[1], you will find out that having proxying in place completely removes the blocking! How is that possible? Well, when a task A blocks on B, then B executes when and *where* A should: i.e., B is temporary added to A's group, uses its priority/deadline and *consumes* its budget. This means that A's group is always ready to run whenever it becomes the highest priority group, even if A is blocked, which is why we say there is no more room for blocking. Now, all the above is true on UP setups. Extending to SMP (m CPUs) and considering the first part of my draft proposal, i.e., having the proxying tasks busy waiting (would say "spinning", but that busy wait is interruptible, preemptable, etc.) on some CPU while their proxy is being scheduled, we are back to the case of having the m highest entity running... And thus we are happy! This is because, again, given the holding of that assumption, all the existent schedulability analysis automagically start working again, without the need of accounting for any blocking term or blocking time duration. What we like most with this is that it means you can decide the bandwidth (e.g., budget/period) you want to assign to each task group, run one of the available scheduling tests --with and only with that values!!--to see if it fits, turn the light on and _have_it_properly_enforced_, without the need of clairvoyance on task blocking patterns and durations. Moreover, if you hare an hard real-time guy, and you have the knowledge of the length of your critical sections you can use them to properly dimension the budget of your server/group... But I don't think this is the Linux case, is it? :-P And so we are done! Well, so and so. I fact, if you want (and we want! :-D) to go a step further, and consider how to remove the --possibly quite log on Linux as Peter said-- wasting of CPU time due to busy waiting, what you can do is to actually block a proxying task, instead of having it "spinning", so that some other task in some other group, which may not be one of the m highest prio ones, can reclaim that bandwidth... But... Ladies and gentlemen, here it is: BLOCKING. Again!! :-( That's where we are right now, trying to find some possible solution to avoid reintroducing from the window what we are trying to kick out from the door, striving for a good compromise between pure theory and practical viability. A draft solution could be to act as said right above, but also pretend that the correct groups/tasks have executed, such that the effects of the blocking would be, let's say, absorbed... So, yes, something similar to what you are arguing: decrease the budget of B (B's group?) when C is being run. I agree, this rises issues as well, e.g., is another kind of "inheritance" to take care of, also, how many and which task/group's budget are we going to affect?, and so on... But it's just a first shot in the dark. Ok, I see I've definitely written too much... Sorry about that... I Hope I at least managed in making my point a little bit clearer... If not, feel free to ask again. As I repeat, we are still quite near to the starting blocks with this thing, but I still am glad to share thoughts with all of you! :-) Regards, Dario [*] but a more real-timish group scheduling than the one present in Linux right at the moment, i.e., something where groups have (fixed or dynamic) priorities. [1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 10:47 ` Raistlin @ 2009-07-14 11:03 ` Peter Zijlstra 2009-07-14 18:19 ` Raistlin 2009-07-14 14:48 ` Chris Friesen 2009-07-17 13:35 ` Giuseppe Lipari 2 siblings, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-14 11:03 UTC (permalink / raw) To: Raistlin Cc: Chris Friesen, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Tue, 2009-07-14 at 12:47 +0200, Raistlin wrote: > [1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz That link seems to require a user/pass combination. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 11:03 ` Peter Zijlstra @ 2009-07-14 18:19 ` Raistlin 0 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 18:19 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Friesen, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 990 bytes --] On Tue, 2009-07-14 at 13:03 +0200, Peter Zijlstra wrote: > On Tue, 2009-07-14 at 12:47 +0200, Raistlin wrote: > > [1] http://feanor.sssup.it/%7Elipari/papers/rtss2001-bwi.ps.gz > > That link seems to require a user/pass combination. > Oh, damn, sorry about that! I thought I downloaded the paper once from Peppe's website, but I suppose I'm wrong. Now I see there's also a disclaimer about that IEEE copyright stuff there, but I did not notice it! :-( Anyway, here are two working links to all the BWI details for UPs: http://feanor.sssup.it/~faggioli/lipari_lamastra_abeni_bwi.pdf http://feanor.sssup.it/~faggioli/rtss2001-bwi.ps.gz Sorry again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 10:47 ` Raistlin 2009-07-14 11:03 ` Peter Zijlstra @ 2009-07-14 14:48 ` Chris Friesen 2009-07-14 15:19 ` James H. Anderson ` (2 more replies) 2009-07-17 13:35 ` Giuseppe Lipari 2 siblings, 3 replies; 82+ messages in thread From: Chris Friesen @ 2009-07-14 14:48 UTC (permalink / raw) To: Raistlin Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Raistlin wrote: > Now, all the above is true on UP setups. Extending to SMP (m CPUs) and > considering the first part of my draft proposal, i.e., having the > proxying tasks busy waiting (would say "spinning", but that busy wait is > interruptible, preemptable, etc.) on some CPU while their proxy is being > scheduled, we are back to the case of having the m highest entity > running... And thus we are happy! > Well, so and so. I fact, if you want (and we want! :-D) to go a step > further, and consider how to remove the --possibly quite log on Linux as > Peter said-- wasting of CPU time due to busy waiting, what you can do is > to actually block a proxying task, instead of having it "spinning", so > that some other task in some other group, which may not be one of the m > highest prio ones, can reclaim that bandwidth... But... Ladies and > gentlemen, here it is: BLOCKING. Again!! :-( Let's call the highest priority task A, while the task holding the lock (that A wants) is called B. Suppose we're on an dual-cpu system. According to your proposal above we would have A busy-waiting while B runs with A's priority. When B gives up the lock it gets downgraded and A acquires it and continues to run. Alternately, we could have A blocked waiting for the lock, a separate task C running, and B runs with A's priority on the other cpu. When B gives up the lock it gets downgraded and A acquires it and continues to run. >From an overall system perspective we allowed C to make additional forward progress, increasing the throughput. On the other hand, there is a small window where A isn't running and it theoretically should be. If we can bound it, would this window really cause so much difficulty to the schedulability analysis? Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 14:48 ` Chris Friesen @ 2009-07-14 15:19 ` James H. Anderson 2009-07-14 16:31 ` Peter Zijlstra 2009-07-14 16:48 ` Raistlin 2009-07-15 21:45 ` Ted Baker 2 siblings, 1 reply; 82+ messages in thread From: James H. Anderson @ 2009-07-14 15:19 UTC (permalink / raw) To: Chris Friesen Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg Hi all, I've been reading these email messages for a few days now and wanted to make some comments... maybe what I have to say will be useful to you... maybe not. First, I thought I should introduce myself because I think some of you may not know me. I lead the LITMUS^RT project at UNC (http://www.cs.unc.edu/~anderson/litmus-rt/). I think it is important to mention this because the goals of our LITMUS^RT-related research mean that I'm coming at all of this from a rather different place than (I think) most of you. The main goal of our research has been to identify those real-time scheduling and synchronization algorithms that work best on multicore platforms. We define "work best" in terms *schedulability* with *real overheads* from *real implementations* considered. There are two important aspects of this that will influence my comments (so take them with a grain of salt): 1. By emphasizing schedulability, we are clearly defining "real-time" to mean predictable, and not real-fast. 2. We use Linux as a platform for implementing the algorithms we test and getting overheads. It has never been among our goals to actually change mainline Linux (although this may become a goal in the future). This means that we haven't focused too much on legacy-related hindrances. I'm using the word "hindrances" with intention. I think in particular anything that is POSIX-based is going to be fraught with problems in a multicore world. Such legacy issues seem to be at the heart of many of the problematic scenarios you are discussing. Anyway, maybe these things are a fact of life. The first email in this thread that I was cc'ed on talked about implementing global EDF, but the discussion since has been entirely about synchronization issues. To the best of my knowledge, the only published real-time locking protocol that can be used under global EDF (and many other algorithms) without restricting critical sections (like not allowing certain resources to be accessed in a nested manner) is a protocol we designed called the flexible multiprocessor locking protocol (FMLP). Please see the following papers at http://www.cs.unc.edu/~anderson/papers.html: B. Brandenburg and J. Anderson, "Reader-Writer Synchronization for Shared-Memory Multiprocessor Real-Time Systems", Proceedings of the 21st Euromicro Conference on Real-Time Systems, pp. 184-193, July 2009. B. Brandenburg and J. Anderson, "A Comparison of the M-PCP, D-PCP, and FMLP on LITMUSRT", Proceedings of the 12th International Conference on Principles of Distributed Systems, pp. 105-124, December 2008. B. Brandenburg and J. Anderson, "An Implementation of the PCP, SRP, D-PCP, M-PCP, and FMLP Real-Time Synchronization Protocols in LITMUSRT", Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 185-194, August 2008. B. Brandenburg, J. Calandrino, A. Block, H. Leontyev, and J. Anderson, "Real-Time Synchronization on Multiprocessors: To Block or Not to Block, to Suspend or Spin?", Proceedings of the 14th IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 342-353, April 2008. A. Block, H. Leontyev, B. Brandenburg, and J. Anderson, "A Flexible Real-Time Locking Protocol for Multiprocessors", Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 47-57, August 2007. (BTW, most papers on this page with "Bjoern Brandenburg" as a co-author are LITMUS^RT-related. I've added Bjoern to the cc list.) The underlying philosophy of our synchronization-related work is "simplicity wins". Simplicity is obviously a good thing from an implementation perspective, but it is also good from an *analysis* perspective. Correctly computing task blocking terms for multiprocessor real-time locking protocols is *really* hard. There are many mistakes in analysis presented in the literature (in fact, I think a new one was pointed out ECRTS two weeks ago). It is really hard to nail down all the corner cases, and simple mechanisms have far fewer corner cases. In our work, we first carefully write up all blocking analysis and then have a multi-hour group meeting where 6 to 8 people review and discuss the analysis. The FMLP is really embarrassingly simple. It's main mechanisms are: 1. Require critical sections to execute non-preemptively (I know, non-preemption violates the "real-fast" religion -- I've been in these discussion before :-)). 2. Execute critical sections in FIFO order (I know, FIFO violates the "real-time" religion). 3. Deal with nested lock accesses by requiring a higher-level, coarser lock to be acquired. While this many seem "too dumb", traces we have taken of real systems (including Linux) indicate that two-level nesting is not that common and deep nesting is quite rare. Why design complicated mechanisms that are both hard to implement and analyze to deal with something that is not so common? In the FMLP, waiting can be implemented via either spinning or suspension (technically, critical sections are categorized as "short" or "long" and spinning is used for short CSs and suspension for long ones -- this categorization is entirely up to the user). Spinning is done non-preemptively (suspending obviously is not). The beauty of non-preemptive, FIFO is that on an m-processor system, blocking behavior is easy to analyze. For example, with spin-based waiting, the blocking per critical-section access is (m-1) times the cost of one critical section. Someone remarked in an earlier email that we're really thinking of m being somewhat small (4, maybe 8). As such, this is a blocking term that is really hard to beat. As I said, simplicity wins. With suspension-based waiting, things are a bit more complicated, but not much. In our experience, the blocking analysis of any multiprocessor protocol that uses priority-inheritance-related mechanisms is a nightmare. The problem is that, to handle all the corner cases, pessimistic assumptions have to be made. This pessimism can really add up and lead to huge blocking terms. This email thread has also touched on group-based scheduling. We proposed a very simple scheme several years ago called "skipping" in the context of Pfair scheduling that makes this easy: P. Holman and J. Anderson, "Locking under Pfair Scheduling", ACM Transactions on Computer Systems , Volume 24, Number 2, pp. 140-174, May 2006. (An earlier version in appeared in RTSS 2002). In this approach, critical-section lengths must be known, and any lock request that occurs when a task doesn't have sufficient budget is simply denied -- the request is done later when that task receives additional budget. This avoids a task in one group from holding a lock while it is preempted (which could severely hurt lock-requesting tasks in other groups). This scheme is really easy to implement in conjunction with the FMLP and it doesn't require complicated budget tracking. Its effects on blocking terms are also easy to analyze. Thomas Nolte and colleagues (in Sweden) have written some papers where they've used skip-based locks in hierarchically-scheduled systems. I'll stop rambling now. I guess my overall point is that these legacy issues like POSIX seem to be forcing you into complex solutions. Unless there is a way to shed this baggage, I think it'll be a matter of where you put the complexity -- you'll never be able to eliminate it (and again, I think complexity is *bad*). I hope you find my comments worthwhile (and if not, sorry for sending such a long message). -Jim Anderson ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 15:19 ` James H. Anderson @ 2009-07-14 16:31 ` Peter Zijlstra 2009-07-14 16:54 ` James H. Anderson ` (2 more replies) 0 siblings, 3 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-14 16:31 UTC (permalink / raw) To: James H. Anderson Cc: Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg Hi Jim, On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote: > The first email in this thread that I was cc'ed on talked about > implementing global EDF, Hendrik says that its a modified Modified-Least-Laxity-First, so something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too, but now see I failed to actually do so) in the hope that you would have some thoughts on his scheduling algorithm, since that is what you do a lot of ;-) It could well be he modified M-LLF into G-EDF, but I'm behind in my reading here, so I'll have to leave that up to others. > but the discussion since has been entirely about synchronization issues. Right, this seems to be a very hot topic. > To the best of my > knowledge, the only published real-time locking protocol that > can be used under global EDF (and many other algorithms) without > restricting critical sections (like not allowing certain resources > to be accessed in a nested manner) is a protocol we designed called > the flexible multiprocessor locking protocol (FMLP). > The FMLP is really embarrassingly simple. It's main mechanisms are: > > 1. Require critical sections to execute non-preemptively (I know, > non-preemption violates the "real-fast" religion -- I've been in > these discussion before :-)). Lets simply state that industry would like to be able to deal with very short deadlines :-) > 2. Execute critical sections in FIFO order (I know, FIFO violates > the "real-time" religion). For those locks that are indeed non-preempt busy wait (raw_spinlock_t) we generally use a FIFO-fair implementation (x86 at least, but I thought several other platforms were looking at, or have already, implemented a similar spinlock). > 3. Deal with nested lock accesses by requiring a higher-level, > coarser lock to be acquired. While this many seem "too dumb", traces > we have taken of real systems (including Linux) indicate that > two-level nesting is not that common and deep nesting is quite rare. > Why design complicated mechanisms that are both hard to implement > and analyze to deal with something that is not so common? > > In the FMLP, waiting can be implemented via either spinning or > suspension (technically, critical sections are categorized as > "short" or "long" and spinning is used for short CSs and suspension > for long ones -- this categorization is entirely up to the user). > Spinning is done non-preemptively (suspending obviously is not). > > The beauty of non-preemptive, FIFO is that on an m-processor system, > blocking behavior is easy to analyze. For example, with spin-based > waiting, the blocking per critical-section access is (m-1) times the > cost of one critical section. Someone remarked in an earlier email > that we're really thinking of m being somewhat small (4, maybe 8). > As such, this is a blocking term that is really hard to beat. As I > said, simplicity wins. With suspension-based waiting, things are a > bit more complicated, but not much. Our problems are many (and I think you know about most if not all). - Linux is _huge_: o and while we seem to do a reasonable job of getting people to cope with basic concurrency, asking them to take RT constraints into account when doing so will surely be too much -- worse, we might not ever convince people its a worthy goal to begin with. o auditing each and every lock in the kernel simply isn't possible. - Linux needs to provide isolation: o we must assume userspace is hostile and try to minimize the impact of such tasks on others. - POSIX: o we're stuck with the legacy here, but we're looking fwd outside of POSIX. (I'm probably forgetting half) > In our experience, the blocking analysis of any multiprocessor protocol > that uses priority-inheritance-related mechanisms is a nightmare. The > problem is that, to handle all the corner cases, pessimistic assumptions > have to be made. This pessimism can really add up and lead to huge > blocking terms. Right, Ted holds similar opinions. Practically though, active Priority Inheritance has enormous benefits for us simple people that need to get things done :-) It has allowed us to convert this huge mass of code into something that is real-time enough for a lot of practical uses, including industrial robots and the like without the need to analyze each and every lock out there. Now I fully appreciate the theoretical trouble full PI and its ilk gives, so maybe we can do something with lockdep/stat to track and qualify the lock nesting depths and critical section lengths and use those to improve upon the situation. At worst we could use the data to qualify the usability of certain system calls/traps wrt RT applications. Also, can't we employ the PI framework to act as a debug/help-guide in validating/figuring out the PCP levels? On that, how does the priority ceiling/protection protocol extend to deadline schedulers? > This email thread has also touched on group-based scheduling. We > proposed a very simple scheme several years ago called "skipping" > in the context of Pfair scheduling that makes this easy: > In this approach, critical-section lengths must be known, and any > lock request that occurs when a task doesn't have sufficient > budget is simply denied -- the request is done later when that task > receives additional budget. This avoids a task in one group from > holding a lock while it is preempted (which could severely hurt > lock-requesting tasks in other groups). This scheme is really easy > to implement in conjunction with the FMLP and it doesn't require > complicated budget tracking. Its effects on blocking terms are > also easy to analyze. Thomas Nolte and colleagues (in Sweden) have > written some papers where they've used skip-based locks in > hierarchically-scheduled systems. Not granting locks when the contender doesn't have enough budget to finish them seems like a very attractive solution, however the cost of having to analyze the critical section seems prohibitive. That said, we could maybe experiment with this for a few key lock sites. Anyway, our goal (the linux-rt folks) is to make Linux a better RT OS. I'm sure we'll never get hard enough for some people, but I'm sure we can do better than we do today. Furthermore we're limited by the existing legacy (both spec and code wise), but I think we have been able to make good progress through tools that help us, such as lockdep and the latency tracer (now ftrace), which help us find trouble in an active manner. And I think discussion such as these help us find new directions to explore, so carry on! ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:31 ` Peter Zijlstra @ 2009-07-14 16:54 ` James H. Anderson 2009-07-14 19:28 ` Henrik Austad ` (2 more replies) 2009-07-14 17:16 ` James H. Anderson 2009-07-14 19:54 ` Raistlin 2 siblings, 3 replies; 82+ messages in thread From: James H. Anderson @ 2009-07-14 16:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg, James H. Anderson On Tue, 14 Jul 2009, Peter Zijlstra wrote: > > Hi Jim, > > On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote: >> The first email in this thread that I was cc'ed on talked about >> implementing global EDF, > > Hendrik says that its a modified Modified-Least-Laxity-First, so > something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too, > but now see I failed to actually do so) in the hope that you would have > some thoughts on his scheduling algorithm, since that is what you do a > lot of ;-) > > It could well be he modified M-LLF into G-EDF, but I'm behind in my > reading here, so I'll have to leave that up to others. I meant to say something about this algorithm earlier that may be worthwhile. If I understand Hendrik's algorithm correctly, it falls within a class of global schedulers for which bounded deadline tardiness is guaranteed (as long as total utilization doesn't exceed the system's capacity). This follows from results in: H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time Systems Symposium, pp. 413-422, December 2007. This is a positive thing w.r.t. wanting to support soft real-time workloads. However, global EDF also has this property and (I would think) is easier to implement (I'm just guessing here -- we haven't actually implemented any algorithm that involves laxities in LITMUS^RT). Ted Baker did some work on an algorithm called EDZL (EDF until zero laxity) that is essentially a hybrid of these two algorithms. In simulations we've done (not based on real implementations), EDZL exhibits really low tardiness (almost non-existent). EDZL may give you the benefits of using laxities where useful and be easier to implement (but that's just my guess) Regarding all these "legacy" issues, I think it would be helpful to write them all down carefully and formally, and then think about what are the *simplest* mechanisms that will do the job in the common case. It is not worthwhile (in my opinion) to introduce tons of complexity to nip at corner cases that aren't that common. Of course, this begs the question of what is "common" and what is not: something that may also be worth trying to find evidence for and writing down. Or better yet, someone needs to define a new standard for multicore... but I'm not sure how that gets done. Gotta go.... -Jim ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:54 ` James H. Anderson @ 2009-07-14 19:28 ` Henrik Austad 2009-07-14 19:33 ` James H. Anderson 2009-07-15 21:53 ` Ted Baker 2009-07-15 4:25 ` Bjoern B. Brandenburg 2009-07-15 20:55 ` Ted Baker 2 siblings, 2 replies; 82+ messages in thread From: Henrik Austad @ 2009-07-14 19:28 UTC (permalink / raw) To: James H. Anderson Cc: Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Tuesday 14 July 2009 18:54:13 James H. Anderson wrote: > On Tue, 14 Jul 2009, Peter Zijlstra wrote: > > Hi Jim, > > > > On Tue, 2009-07-14 at 11:19 -0400, James H. Anderson wrote: > >> The first email in this thread that I was cc'ed on talked about > >> implementing global EDF, > > > > Hendrik says that its a modified Modified-Least-Laxity-First, so > > something like M^2-LLF, and I cc'ed you (and I meant to add Bjorn too, > > but now see I failed to actually do so) in the hope that you would have > > some thoughts on his scheduling algorithm, since that is what you do a > > lot of ;-) > > > > It could well be he modified M-LLF into G-EDF, but I'm behind in my > > reading here, so I'll have to leave that up to others. Using deadlines as the only measure on multi-core will fail as you must consider time (of deadline-miss) in a different way. In UP, this is given implicitly as you only have a single processor. In MC you need to do this the hard way, namely compute the point in time not when the task misses the deadline, but when it will *eventually* fail a dealine. By doing that, you combine deadline, wcet and granted time in one variable, and you have a *single* variable to compare. By doing this in a global sense, you can avoid bin-packing, load-balancing and a lot of the issues that seem to be keeping the PI-people busy. But I think I'm going to stay away from that topic now (as I haven't had the time to give those emails their deserved attention. I think M^2-LLF would be as good a name as EFF, but I wanted to emphasize the usage of the two failure-variables (the static and the dynamic). > I meant to say something about this algorithm earlier that may be > worthwhile. If I understand Hendrik's algorithm correctly, it falls > within a class of global schedulers for which bounded deadline > tardiness is guaranteed (as long as total utilization doesn't exceed > the system's capacity). the tardiness is either 0 or bounded, depending on what you want (and what you will do with tasks that do miss their deadlines when they block on locks). > This follows from results in: > > H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global > Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time > Systems Symposium, pp. 413-422, December 2007. Is this available outside locked portals? > This is a positive thing w.r.t. wanting to support soft real-time > workloads. However, global EDF also has this property and (I would > think) is easier to implement (I'm just guessing here -- we haven't > actually implemented any algorithm that involves laxities in > LITMUS^RT). Ted Baker did some work on an algorithm called EDZL > (EDF until zero laxity) that is essentially a hybrid of these two > algorithms. In simulations we've done (not based on real implementations), > EDZL exhibits really low tardiness (almost non-existent). EDZL may > give you the benefits of using laxities where useful and be easier > to implement (but that's just my guess) -- henrik ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 19:28 ` Henrik Austad @ 2009-07-14 19:33 ` James H. Anderson 2009-07-15 21:53 ` Ted Baker 1 sibling, 0 replies; 82+ messages in thread From: James H. Anderson @ 2009-07-14 19:33 UTC (permalink / raw) To: Henrik Austad Cc: Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg >> This follows from results in: >> >> H. Leontyev and J. Anderson, "Generalized Tardiness Bounds for Global >> Multiprocessor Scheduling", Proceedings of the 28th IEEE Real-Time >> Systems Symposium, pp. 413-422, December 2007. > > Is this available outside locked portals? Yes, all my stuff is freely available at: http://www.cs.unc.edu/~anderson/papers.html -Jim ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 19:28 ` Henrik Austad 2009-07-14 19:33 ` James H. Anderson @ 2009-07-15 21:53 ` Ted Baker 2009-07-17 7:40 ` Henrik Austad 1 sibling, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-15 21:53 UTC (permalink / raw) To: Henrik Austad Cc: James H. Anderson, Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Tue, Jul 14, 2009 at 09:28:47PM +0200, Henrik Austad wrote: > ... In MC you need to do this the hard way, namely compute the > point in time not when the task misses the deadline, but when it > will *eventually* fail a deadline. By doing that, you combine > deadline, wcet and granted time in one variable, and you have a > *single* variable to compare. This is true in a theoretical sense, and is the basis of some "optimal" scheduling algorithms, including the "throwforward scheduling" algorithm. It makes sense in some environments, where you actually know the WCET of the task in advance. However, I don't believe a Linux system can expect all applications to provide this kind of information. In a system programmed using process and threads, the decision to sleep or wake is embedded in the internal logic of the thread, and implemented by system calls. The existing system calls do not convey how long the thread needs to execute before it reaches its next suspension point. Therefore, without a new API you cannot use WCET. If you create a new API for this, you are limiting this form of scheduling to threads that choose to use that API, and are able to provide the needed WCET information. This seems like a small number of cases among the full range of real-time Linux applications. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 21:53 ` Ted Baker @ 2009-07-17 7:40 ` Henrik Austad 2009-07-17 13:37 ` Ted Baker 0 siblings, 1 reply; 82+ messages in thread From: Henrik Austad @ 2009-07-17 7:40 UTC (permalink / raw) To: Ted Baker Cc: James H. Anderson, Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 2149 bytes --] On Wednesday 15 July 2009 23:53:05 Ted Baker wrote: > On Tue, Jul 14, 2009 at 09:28:47PM +0200, Henrik Austad wrote: > > ... In MC you need to do this the hard way, namely compute the > > point in time not when the task misses the deadline, but when it > > will *eventually* fail a deadline. By doing that, you combine > > deadline, wcet and granted time in one variable, and you have a > > *single* variable to compare. > > This is true in a theoretical sense, and is the basis of some > "optimal" scheduling algorithms, including the "throwforward > scheduling" algorithm. It makes sense in some environments, where > you actually know the WCET of the task in advance. However, I > don't believe a Linux system can expect all applications to > provide this kind of information. Why cannot you expect real-time tasks using a deadline scheduler to provide some estimate of the execution cost? How can you ever hope to run a deadline scheduler without this? > In a system programmed using process and threads, the decision to > sleep or wake is embedded in the internal logic of the thread, and > implemented by system calls. The existing system calls do not > convey how long the thread needs to execute before it reaches its > next suspension point. Therefore, without a new API you cannot > use WCET. Yes, you would need to introduce a new set of syscalls. 2 in fact. When working with PD^2, I added 3 (as reweighing was a special case), but: sched_dl_update(pid, wcet, period, deadline) sched_dl_release(pid, abs_releease_time) How can you use deadlines based on priorities? A priority is a one-way mapping of deadlines for a set of tasks. > If you create a new API for this, you are limiting this > form of scheduling to threads that choose to use that API, and are > able to provide the needed WCET information. This seems like a > small number of cases among the full range of real-time Linux > applications. Are we going to place all tasks in the kernel into rt-deadline tasks? I had the impression that we wanted a class for a special set of tasks. > Ted -- henrik [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-17 7:40 ` Henrik Austad @ 2009-07-17 13:37 ` Ted Baker 0 siblings, 0 replies; 82+ messages in thread From: Ted Baker @ 2009-07-17 13:37 UTC (permalink / raw) To: Henrik Austad Cc: James H. Anderson, Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg, stanovic On Fri, Jul 17, 2009 at 09:40:59AM +0200, Henrik Austad wrote: > Why cannot you expect real-time tasks using a deadline scheduler to provide > some estimate of the execution cost? How can you ever hope to run a deadline > scheduler without this? This depends on what you mean by a "deadline scheduler". Personally, I don't think it makes much sense for an OS to support scheduling of aperiodic "jobs" on job deadlines. If you have chunks of application work that logically should be viewed as jobs, with approximately known WCET and deadlines, the right way to handle that is to have server thread(s), with associated job queue(s) that order their own queues by job deadline. The servers are provisioned with enough CPU capacity to provide the desired level of service for the anticipated load. If load-shedding is required at times, that is the responsibility of the server, which is part of the application and has the application-specific knowledge needed to make such decisions. (Alternatively, the server could negotiated for a higher CPU share if it starts missing deadlines.) What it makes sense for the OS to provide is what I'll loosely call "CPU shares". Deadline scheduling is a good way to implement this, since the schedulability analysis is relatively clean (including max. tardiness). That is each server thread has a "period" and is entitled to a certain budgeted amount of time each period, and the period is the deadline for getting that much CPU time. Constant Bandwidth and Total Bandwidth are two such policies. (I recently reviewed a paper that worked the kinks out of the published Constant Bandwidth in a very nice way. If I can find that it has been since published, or if there is a public pre-print technical report, I'll try to send the reference in another e-mail.) With CPU shares, we do have something like a WCET, but it is really a maximum allowed execution time. In this case, I'm not sure it makes much (any?) sense to worry about laxity, though. This should not be a hard-deadline situation (indeed, I don't think it makes sense for an OS as complicated as Linux to talk about truly hard deadlines). It may be enough to know the maximum lag or tardiness. Of course, if you want to put in special support for periodic tasks (say for sensor polling or poriodic actuator output), you can do that, but to me the right model is probably not a thread. It would make more sense to me for such tasks to implemented as event handlers. > How can you use deadlines based on priorities? A priority is a > one-way mapping of deadlines for a set of tasks. I had in mind several different ways. 1) You can preserve a range of nominal "deadlines" that are above and below the range of real deadlines used in scheduling. For example: A) reserve the maximum representable time value for tasks that should never run (suspended). This value is useful for bandwidth limiting aperiodic server scheduling, if you really want to keep a temporarily out-of-budget server from competing with other tasks. Note that the pure constant-bandwith server is not enough in this respect, since it would still have a deadline earlier than non-realtime - tasks in the B range. B) Reserve a few values below that for tasks that are fixed-priority and lower in priority than all true deadline tasks. Some of these "deadlines" can be used for SCHED_FIFO and SCHED_RR that we want to be below the deadline-scheduler and also for SCHED_OTHER. C) The main band of deadlines is used for real deadline scheduling. (I don't believe it would technicall violate the POSIX standard to have a hidden "hole" between SCHED_FIFO and SCHED_RR values, but if there are seriou objections, one could pick a priority in the middle of the RT range and say that these deadline scheduled tasks are executing at that priority.) D) Reserve a few values close to zero for tasks that are fixed-priority and higher in priority than all true deadline tasks. This is useful for SCHED_FIFO and SCHED_RR that we want to be above the deadline-scheduler, and also for hybrid EDZL and EDF scheduling algorithms. EDZL would use such a value when a task reaches zero laxity. Hybrid EDF uses such a value for tasks that have high enough processor utilizations (> 50%), to achieve a higher schedulable system utilization than pure EDF. You have lost nothing in deadline representation, since the values used for these two fixed-priority ranges are useless for real deadlines. 2) Within the range (C) ("real" deadline scheduling), you can also implement something close enough to priority to serve the purposes for which priority is typically used, by using an aperiodic-server type scheduling algorithm and giving "high priority" tasks short replenishment periods (and so shorter relative deadlines). > Are we going to place all tasks in the kernel into rt-deadline > tasks? I had the impression that we wanted a class for a special > set of tasks. I think it could be done. See my scheme above for translating everthing into deadlines. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:54 ` James H. Anderson 2009-07-14 19:28 ` Henrik Austad @ 2009-07-15 4:25 ` Bjoern B. Brandenburg 2009-07-15 20:55 ` Ted Baker 2 siblings, 0 replies; 82+ messages in thread From: Bjoern B. Brandenburg @ 2009-07-15 4:25 UTC (permalink / raw) To: James H. Anderson Cc: Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg, shinpei, yamasaki Quoting "James H. Anderson" <anderson@cs.unc.edu>: > Ted Baker did some work on an algorithm called EDZL > (EDF until zero laxity) that is essentially a hybrid of these two > algorithms. In simulations we've done (not based on real implementations), > EDZL exhibits really low tardiness (almost non-existent). EDZL may > give you the benefits of using laxities where useful and be easier > to implement (but that's just my guess) EDZL has nice properties but may be hard to implement efficiently because it requires (a lot of) fine-grained timers. However, Shinpei Kato and Nobuyuki Yamasaki (CC'ed) presented an EDZL-derived scheduler called EDCL at RTCSA'08 that has similar performance guarantees but only does priority promotions at the release and completion times of jobs (hence, no additional timers are required). Global EDF-based Scheduling with Efficient Priority Promotion Shinpei Kato and Nobuyuki Yamasaki Proceedings of the 14th International Conference on Embedded and Real-Time Computing Systems and Applications, pages 197-206. IEEE, 2008. http://www.ny.ics.keio.ac.jp/members/shinpei/papers/rtcsa08.pdf I know that Shinpei and friends implemented EDCL on top of LITMUS^RT, but I never got around to incorporating the patches into the "official" LITMUS^RT patch. (Sorry, Shinpei!) Regarding the comments about the availability of academic papers (which were also voiced repeatedly at OSPERT'09), I'd like to point out that you can find almost any recent paper by searching for the title with scholar.google.com and clicking on the "All XYZ versions" link... most researches make PDFs available on their homepages. - Björn ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:54 ` James H. Anderson 2009-07-14 19:28 ` Henrik Austad 2009-07-15 4:25 ` Bjoern B. Brandenburg @ 2009-07-15 20:55 ` Ted Baker 2009-07-15 21:53 ` Chris Friesen 2009-07-16 23:29 ` Raistlin 2 siblings, 2 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 20:55 UTC (permalink / raw) To: James H. Anderson Cc: Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Tue, Jul 14, 2009 at 12:54:13PM -0400, James H. Anderson wrote: > > ... Ted Baker did some work on an algorithm called EDZL > (EDF until zero laxity) that is essentially a hybrid of these two > algorithms. In simulations we've done (not based on real implementations), > EDZL exhibits really low tardiness (almost non-existent). EDZL may > give you the benefits of using laxities where useful and be easier > to implement (but that's just my guess) Yes, EDZL is a simple extension of EDF. If you have a WCET estimate for a task, you just apply EDF until/unless the task reaches zero laxity. If it reaches zero laxity you bump it to a fixed priority level that is above all EDF tasks. It seems very good in simulations, in terms of schedulabilty, number of context switches, and number of task migration. It never deviates from EDF except in cases where a deadline would otherwise be missed. Therefore all EDF schedulability tests are valid for EDZL. In addition, we have some tighter tests. EDZL never causes more than one context switch more than EDF, when a task goes from positive laxity to zero laxity. This does not happen often. It is much lower overhead than least-laxity-first, since there is no possibility of getting into a time-slicing situtation among tasks with equal laxity. It is also nice because in many cases the scheduler will not know the WCET of a task, and so it is impossible to compute laxity. In those cases, you just use EDF. The two policies (EDF and EDZL) are completely compatible and interoperable. I hope that Linux will not pursue EDF or EDZL in only a narrow sense, as simple priority scheduling algorithms, without providing for bandwidth guaranteees and limits. By bandwith "guarantee" I mean that the task/scheduling entity is guaranteed to be able to at least *compete* at a certain level for a certain percentage of the CPU, if we cannot (better) provide an admission control mechanism that guarantees it will actually get a certain percentage of the CPU. By bandwidth "limit" I mean that there is an enforced bound on the percentage of processor time a task/scheduling entity may consume. For example, in the fixed-priority domain, we have Sporadic Server. This guarantees the server a chance to compete at its top priority for at least sched_ss_init_budget time in every sched_ss_repl_period time, at sched_priority, within some tolerance. It also should not be permitted to use more than sched_ss_init_budget time in every sched_ss_repl_period time, at sched_priority, within some tolerance. In the deadline scheduling domain, we have algorithms like Constant Bandwidth Server (and some improved variants) that provide similar gurantees and limites, in terms of deadlines. (I recall seeing one very good version in a paper I recently saw, which I could seek out and provide to the list if there is interest.) These so-called "aperiodic server scheduling" algorithms are critically needed in a system like Linux that is open (in the sense of being able to host a dynamic mix of uncoordinated applications), and based on a thread programming model (as copared to the job/task models used in scheduling theory). With the thread abstraction one generally cannot give the scheduler WCET bounds, and sometimes cannot give deadlines, since the threads will wake up and go to sleep according to their own internal logic (all the OS sees is the system calls). However, if the application is guaranteed a chance to at least compete at a high enough priority for a certain bandwidth (and, preferably, actually be guaranteed that bandwidth, by admission control), it is possible to write real-time applications. In order to be able to make such a guarantee, one must be able to limit the CPU usage of all other threads. This means, in effect, that *all* tasks/scheduling entities above a certain priority level must be scheduled according to a bandwidth-limiting algorithm. The present Linux "RT throttling" mechanism seems to be an attempt to achieve something similar (preventing RT tasks from shutting out other work, by reserving some bandwidth for non-RT tasks). However, it is done so grossly as to be unsatisfactory for real-time applications. The better way to do this would be to enforce a bandwidth limit on each task with real-time priority -- using the budget amount and replenishment interval required by that task -- and enforce by admission control that the real-time tasks do not exceed some configurable percent of the total system utilization. Of course, the POSIX standard does not recognize any budget and replenishment interval for policies other than SCHED_SPORADIC, but the same problem exists with respect to the present "RT throttling" mechanism. It would be possible to provide legal implementation-defined extensions for the operations that take a struct sched_param to look at these additional sched_param fields for other policies such as SCHED_FIFO and SCHED_OTHER, and to provide default values when they are not specified. The amount of capacity to be reserved for system internal use could be configured, using some combination of kernel compile-time macros, boot parameters, and the /proc interface. User defaults could be configured using extensions to the setrlimit() interface. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 20:55 ` Ted Baker @ 2009-07-15 21:53 ` Chris Friesen 2009-07-15 22:34 ` Ted Baker 2009-07-16 23:29 ` Raistlin 1 sibling, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-15 21:53 UTC (permalink / raw) To: Ted Baker Cc: James H. Anderson, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg Ted Baker wrote: > The present Linux "RT throttling" mechanism seems to be an attempt > to achieve something similar (preventing RT tasks from shutting > out other work, by reserving some bandwidth for non-RT tasks). > However, it is done so grossly as to be unsatisfactory for > real-time applications. The better way to do this would be to > enforce a bandwidth limit on each task with real-time priority -- > using the budget amount and replenishment interval required by > that task -- and enforce by admission control that the real-time > tasks do not exceed some configurable percent of the total system > utilization. >From an API standpoint, the "group scheduling" functionality in linux allows for the creation of an arbitrary hierarchy of groups, each of which may contain zero or more tasks. (Each task is associated with exactly one group.) There is a distinction between groups containing realtime tasks, and groups containing non-realtime tasks. For realtime groups, each group is allocated a specific amount of cpu time. For non-realtime groups, each group is allocated a specific weight. A realtime group may use up to its specified amount of cpu time. Any cpu time not used by a realtime group is distributed to the non-realtime groups according to their relative weights. This does add a whole different API to the mix, but allows for controls to be set by the administrator on existing POSIX apps without needing to recompile them. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 21:53 ` Chris Friesen @ 2009-07-15 22:34 ` Ted Baker 2009-07-15 22:39 ` Dhaval Giani 0 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-15 22:34 UTC (permalink / raw) To: Chris Friesen Cc: James H. Anderson, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Wed, Jul 15, 2009 at 03:53:33PM -0600, Chris Friesen wrote: > >From an API standpoint, the "group scheduling" functionality in linux > allows for the creation of an arbitrary hierarchy of groups, each of > which may contain zero or more tasks. (Each task is associated with > exactly one group.) > > There is a distinction between groups containing realtime tasks, and > groups containing non-realtime tasks. For realtime groups, each group > is allocated a specific amount of cpu time. For non-realtime groups, > each group is allocated a specific weight. > > A realtime group may use up to its specified amount of cpu time. Any > cpu time not used by a realtime group is distributed to the non-realtime > groups according to their relative weights. > > This does add a whole different API to the mix, but allows for controls > to be set by the administrator on existing POSIX apps without needing to > recompile them. This is in the right direction, but there is a lot about Linux groups that I either do not understand or which falls short of what is needed. Perhaps you can point me to an up to date detailed explanation of how they work? >From what I've been able to infer from my brief foray into that part of the kernel code (a year ago), there seemed to be several aspects of the group scheduling that did not seem to admit schedulability analysis. (I admit that I may have read it wrong, and these statements are false.) 1) The priority of a group seemed to be defined by the priority of the highest-priority thread in the group's run-queue, which means it varies dynamically according to which threads in the group are contending. 2) Budget enforcement seemed to only occur at system tick boundaries, which means precision can only be achieved at the cost of frequent clock interrupts. 3) It seemed that a thread could belong to more than one group, and so distributed charges arbitrarily between groups. If so, budget allocation would seem very difficult. 4) On an SMP, more than one thread could be running against the same budget at the same time, resulting in budget over-charges. I am particularly concerned about the latter. The published analyses of hierarchical generalizations of bandwidth limiting/guaranteeing aperiodic server scheduling algorithms I have seen so far all seem to require allocating bandwidth/budget to groups on a per-processor basis, so as to void concurrent charges to the same budget. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 22:34 ` Ted Baker @ 2009-07-15 22:39 ` Dhaval Giani 2009-07-15 23:16 ` Ted Baker 0 siblings, 1 reply; 82+ messages in thread From: Dhaval Giani @ 2009-07-15 22:39 UTC (permalink / raw) To: Ted Baker Cc: Chris Friesen, James H. Anderson, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Thu, Jul 16, 2009 at 4:04 AM, Ted Baker<baker@cs.fsu.edu> wrote: > On Wed, Jul 15, 2009 at 03:53:33PM -0600, Chris Friesen wrote: > >> >From an API standpoint, the "group scheduling" functionality in linux >> allows for the creation of an arbitrary hierarchy of groups, each of >> which may contain zero or more tasks. (Each task is associated with >> exactly one group.) >> >> There is a distinction between groups containing realtime tasks, and >> groups containing non-realtime tasks. For realtime groups, each group >> is allocated a specific amount of cpu time. For non-realtime groups, >> each group is allocated a specific weight. >> >> A realtime group may use up to its specified amount of cpu time. Any >> cpu time not used by a realtime group is distributed to the non-realtime >> groups according to their relative weights. >> >> This does add a whole different API to the mix, but allows for controls >> to be set by the administrator on existing POSIX apps without needing to >> recompile them. > > This is in the right direction, but there is a lot about Linux groups > that I either do not understand or which falls short of what is needed. > Perhaps you can point me to an up to date detailed explanation of > how they work? > > From what I've been able to infer from my brief foray into that > part of the kernel code (a year ago), there seemed to be several > aspects of the group scheduling that did not seem to admit > schedulability analysis. (I admit that I may have read it wrong, > and these statements are false.) > > 1) The priority of a group seemed to be defined by the priority of > the highest-priority thread in the group's run-queue, which means > it varies dynamically according to which threads in the group are > contending. > This is true, but it also ensures that the time allocated to the group is also consumed by group if it wants to. > 2) Budget enforcement seemed to only occur at system tick > boundaries, which means precision can only be achieved at the cost > of frequent clock interrupts. > > 3) It seemed that a thread could belong to more than one group, > and so distributed charges arbitrarily between groups. > If so, budget allocation would seem very difficult. > No. A thread can only be a part of one group at a time. > 4) On an SMP, more than one thread could be running against > the same budget at the same time, resulting in budget over-charges. > The rt group scheduler does split the budget per cpu. On expiring the budget, it tries to borrow from other CPUs if possible. Thanks, Dhaval -- Spike Milligan - "All I ask is the chance to prove that money can't make me happy." - http://www.brainyquote.com/quotes/authors/s/spike_milligan.html ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 22:39 ` Dhaval Giani @ 2009-07-15 23:16 ` Ted Baker 2009-07-16 8:58 ` Peter Zijlstra 2009-07-16 12:17 ` Raistlin 0 siblings, 2 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 23:16 UTC (permalink / raw) To: Dhaval Giani Cc: Chris Friesen, James H. Anderson, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg > > 1) The priority of a group seemed to be defined by the priority of > > the highest-priority thread in the group's run-queue, which means > > it varies dynamically according to which threads in the group are > > contending. > > > > This is true, but it also ensures that the time allocated to the group > is also consumed by group if it wants to. I don't see how schedulability analysis can be done with this model, since a single budget is being expended at varying priorities/deadlines. > > 4) On an SMP, more than one thread could be running against > > the same budget at the same time, resulting in budget over-charges. > > > > The rt group scheduler does split the budget per cpu. On expiring the > budget, it tries to borrow from other CPUs if possible. First, how is the splitting of the budget between CPU's controlled by the application? Second, I don't see how schedulabiliyt analysis could be done if CPU's can "borrow" budget from other CPUs, unless there is some mechanism in place to "pay it back". How do you do the analysis? Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 23:16 ` Ted Baker @ 2009-07-16 8:58 ` Peter Zijlstra 2009-07-16 9:11 ` Peter Zijlstra ` (2 more replies) 2009-07-16 12:17 ` Raistlin 1 sibling, 3 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-16 8:58 UTC (permalink / raw) To: Ted Baker Cc: Dhaval Giani, Chris Friesen, James H. Anderson, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote: > > > 1) The priority of a group seemed to be defined by the priority of > > > the highest-priority thread in the group's run-queue, which means > > > it varies dynamically according to which threads in the group are > > > contending. > > > > > > > This is true, but it also ensures that the time allocated to the group > > is also consumed by group if it wants to. > > I don't see how schedulability analysis can be done with this model, > since a single budget is being expended at varying priorities/deadlines. > > > > 4) On an SMP, more than one thread could be running against > > > the same budget at the same time, resulting in budget over-charges. > > > > > > > The rt group scheduler does split the budget per cpu. On expiring the > > budget, it tries to borrow from other CPUs if possible. > > First, how is the splitting of the budget between CPU's controlled > by the application? > > Second, I don't see how schedulabiliyt analysis could be done if > CPU's can "borrow" budget from other CPUs, unless there is some > mechanism in place to "pay it back". How do you do the analysis? Right so control-groups (cgroups for short) are a form of virtualization. Each controller is specific to a resource. We have memory controllers, namespace controllers and also a scheduler controller. If you would apply all controllers to the same task groups you get a result like chroot on steroids, or jails etc. But you can also use them individually to control resources in creative ways. In order to manage RT resources you want: - a minimum bandwidth guarantee - isolation So ideally you want a CBS server that schedules your RT (FIFO/RR) tasks. Now, let me first state that the current code is a hack, and I know its nowhere near proper. But it was the best I could come up with on a short notice -- and Fabio is now looking at doing better :-) Furthermore the whole feature is still marked EXPERIMENTAL, basically because I do recognize it for the hack it is -- that said, some people might find it useful. So a task can only belong to 1 group of any 1 controller (that is, it can belong to multiple groups but only of different controllers) and since there is only 1 scheduler controller, we can, for the purpose of this discussion say it can only belong to a single group. These groups get assigned a bandwidth through the use of a period and runtime group parameter -- the documentation states that using different periods is currently broken and would require a deadline server. Therefore we can assume all periods are equal, for the rest of this description -- and set it to 1s. So what does it do, its a hierarchical FIFO scheduler, with the prio of a group being that of the max prio of its children. If a group runs out of quota it will be dequeued. When the period timer comes along and refreshes the quota it will be requeued. R / \ A B / \ |\ 1 4 C 3 | 2 groups in letters, tasks in digits. If we assume tasks being hogs and have descending priority relative to their numbering, and A has 40% and B has 30% bandwidth and C has 20%. Lets first look at UP. Then since 1 would be the highest priority task, A would have the priority of 1, and we'd select A->1 to run. This would continue for 400ms, after that the whole of A will be dequeued. Next in line would be B->C->2, which can run for 200ms before C gets dequeued, leaving B->3 as only option, which has a remaining 100ms of B's budget left to run. Once the refresh timer comes along things get replenished and can run again. SMP the cgroup interface specified bandwidth as per a single cpu, when more are present in the load-balance domain (see cpusets) the total bandwidth available to the group is the specified multiplied by the number of available cpus. The initial distribution is such that each cpu gets equal bandwidth. Now on 2 cpus, we'd want A->1 and B->C->2 running, since those are the highest prio tasks on the system. Since we have 2 cpus the budget for C is 200ms per cpu, 400ms total. For the first 200ms we'll run 1 on cpu0 and 2 on cpu1. At that point we'll find that cpu1's budget for C is depleted. It will then look at other cpus in the load-balance domain (cpu0) and transfer half of C's unused budget over to itself (with rounding up so we can indeed leave the other cpus with 0). This way C->2 can get to run for up to 400ms on cpu1. After that C gets dequeued and B->3 will take over as the next highest task. Now, after 600ms total B will have depleted its quota and the only tasks left are A->{1,4}. A will have consumed 600 of its 800ms, and will now have to spread this over 2 cpus. [ basically cpu0 gets to transfer less of off cpu1 because 4 will be consuming C's quota on it ] Locks. Suppose they're not hogs and behave like proper tasks and complete their work within bugdet. In this case nobody will get throttled and we all live happily together. Now suppose one application, say 3, has a bug and runs out of quota whilst holding a lock. Further suppose that 1 contends on that lock. The implemented behaviour is that we PI boost 3. The group is temporarily placed back on the runqueue and we allow 3 to overcommit on its budget in order to release the lock. This overcommit it accounted and only once the budget turns positive again (due to sufficient refresh) will the group be requeued. Now, why I did things like this. Because doing a deadline CBS server - is non trivial. - would mean we'd have to deal with deadline inheritenace. - would significantly complicate the load-balancing. Is any of that an excuse? No not really but it is something useful for some people, esp in the case of normal usage where you'd not hit the limits, and in case you do, you only penalize the one who does. So as a first approximation it seems to work in practise. I'm very glad Fabio is working on improving things. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 8:58 ` Peter Zijlstra @ 2009-07-16 9:11 ` Peter Zijlstra 2009-07-17 0:32 ` Raistlin 2009-07-17 0:43 ` Raistlin 2 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-16 9:11 UTC (permalink / raw) To: Ted Baker Cc: Dhaval Giani, Chris Friesen, James H. Anderson, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote: > On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote: > > > > 1) The priority of a group seemed to be defined by the priority of > > > > the highest-priority thread in the group's run-queue, which means > > > > it varies dynamically according to which threads in the group are > > > > contending. > > > > > > > > > > This is true, but it also ensures that the time allocated to the group > > > is also consumed by group if it wants to. > > > > I don't see how schedulability analysis can be done with this model, > > since a single budget is being expended at varying priorities/deadlines. > > > > > > 4) On an SMP, more than one thread could be running against > > > > the same budget at the same time, resulting in budget over-charges. > > > > > > > > > > The rt group scheduler does split the budget per cpu. On expiring the > > > budget, it tries to borrow from other CPUs if possible. > > > > First, how is the splitting of the budget between CPU's controlled > > by the application? > > > > Second, I don't see how schedulabiliyt analysis could be done if > > CPU's can "borrow" budget from other CPUs, unless there is some > > mechanism in place to "pay it back". How do you do the analysis? > > Right so control-groups (cgroups for short) are a form of > virtualization. Each controller is specific to a resource. We have > memory controllers, namespace controllers and also a scheduler > controller. > > If you would apply all controllers to the same task groups you get a > result like chroot on steroids, or jails etc. But you can also use them > individually to control resources in creative ways. To add to this, it is by no means a way of introducing deadline scheduling to tasks, for that you're quite right in that we should extend sched_setscheduler() and struct sched_param. Somewhere before RTLWS we'll add: struct sched_param2; sys_sched_setscheduler2(pid_t pid, int policy, struct sched_param2 *param); sys_sched_setparam2(pid_t pid, struct sched_param2 *param); sys_sched_getparam2(pid_t pid, struct sched_param2 *param); SCHED_SOFT_DEADLINE (bounded tardiness) SCHED_DEADLINE (no tardiness) and hopefully enough hooks to make people happy :-) The intention is to add these to -rt only for now, so that we can still poke at the interfaces and only move then to mainline once the dust settles down. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 8:58 ` Peter Zijlstra 2009-07-16 9:11 ` Peter Zijlstra @ 2009-07-17 0:32 ` Raistlin 2009-07-17 0:43 ` Raistlin 2 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-17 0:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Ted Baker, Dhaval Giani, Chris Friesen, James H. Anderson, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 3325 bytes --] On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote: > Right so control-groups (cgroups for short) are a form of > virtualization. Each controller is specific to a resource. We have > memory controllers, namespace controllers and also a scheduler > controller. > > If you would apply all controllers to the same task groups you get a > result like chroot on steroids, or jails etc. But you can also use them > individually to control resources in creative ways. > > In order to manage RT resources you want: > > - a minimum bandwidth guarantee > - isolation > > So ideally you want a CBS server that schedules your RT (FIFO/RR) tasks. > Or, maybe, an RT fix-priority RT scheduling class (for FIFO/RR tasks) with some kind of bandwidth limiting fix-priority algorithm for groups, such as deferrable server (which is basically what you have now) or, sporadic server. What do you think about this? The only thing you need to have this working is adding the capability of explicitly assigning priorities to groups! I'm sending some code for this in the next days... It has some (even serious) issues, or at least some features that would need discussing about, but it's a start. Obviously, you can also have EDF in place, maybe as a different scheduling class than sched-rt, able to accomodate EDF tasks, if wanted, or again FIFO and RR task sets ("ghosts"), as in Fabio's proposal. To me, this seems the very best solution both for real-time people (which will get FP and EDF!) and for other users... And it also makes it easier to retain POSIX compatibility (if the system is properly configured) than if we add deadlines to sched-rt groups. But that's only my very humble opinion. :-D > Furthermore the whole feature is still marked EXPERIMENTAL, basically > because I do recognize it for the hack it is -- that said, some people > might find it useful. > Hey, it is very useful for me actually!! Without it I would be sleeping right now... And not here coding and answering mails at this late time in the night!! :-P > These groups get assigned a bandwidth through the use of a period and > runtime group parameter -- the documentation states that using different > periods is currently broken and would require a deadline server. > Or a priority, since we are in the fixed priority scheduling class? > Therefore we can assume all periods are equal, for the rest of this > description -- and set it to 1s. > > > So what does it do, its a hierarchical FIFO scheduler, with the prio of > a group being that of the max prio of its children. If a group runs out > of quota it will be dequeued. When the period timer comes along and > refreshes the quota it will be requeued. > > R > / \ > A B > / \ |\ > 1 4 C 3 > | > 2 > > groups in letters, tasks in digits. > > [...] > WoW!! Very nice, thorough and clear explanation... Consider adding it to Documentation/! :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 8:58 ` Peter Zijlstra 2009-07-16 9:11 ` Peter Zijlstra 2009-07-17 0:32 ` Raistlin @ 2009-07-17 0:43 ` Raistlin 2 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-17 0:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Ted Baker, Dhaval Giani, Chris Friesen, James H. Anderson, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 1684 bytes --] On Thu, 2009-07-16 at 10:58 +0200, Peter Zijlstra wrote: > Now, let me first state that the current code is a hack, and I know its > nowhere near proper. But it was the best I could come up with on a short > notice -- and Fabio is now looking at doing better :-) > If I can say something (more!) there is quite a lot of work in our lab on this stuff, starting from different perspective and aiming at different goals. I've been involved, and continuing being, on sporadic server implementation, EDF implementation in separate scheduling class (with Michael), while Fabio is doing a lot (and a lot better!) work on deadlines in sched-rt. Independently from what concerns priorities or deadlines, the way the global scheduling is implemented, i.e., with distributed ready queues, raises a lot of issues with the implementation of _global_ _hierarchical_ scheduling policies of both kind, EDF and FP. It is being very hard to figure out, and much more to implement, how things should go when you have push, pull, affinity, and so on! :-( I'll be more precise when I have the code ready, but here the question is, do you think the push and pull approach "is there to stay", or is there room, maybe after trials, errors, experiments and exhaustive search for the correct data structure, to migrate to something that would make global scheduling easier? Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 23:16 ` Ted Baker 2009-07-16 8:58 ` Peter Zijlstra @ 2009-07-16 12:17 ` Raistlin 1 sibling, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-16 12:17 UTC (permalink / raw) To: Ted Baker Cc: Dhaval Giani, Chris Friesen, James H. Anderson, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 1575 bytes --] On Wed, 2009-07-15 at 19:16 -0400, Ted Baker wrote: > > > 1) The priority of a group seemed to be defined by the priority of > > > the highest-priority thread in the group's run-queue, which means > > > it varies dynamically according to which threads in the group are > > > contending. > > > > > > > This is true, but it also ensures that the time allocated to the group > > is also consumed by group if it wants to. > > I don't see how schedulability analysis can be done with this model, > since a single budget is being expended at varying priorities/deadlines. > Yes, I agree... Right in these days I'm looking at this, and I have some stub code to provide rt groups with a priority the user can, if he wants, set through the cgroupfs. The main problem is dealing with the "distributed" scheduling with push and pull based migration mechanism. Equally interesting, to me, is trying to figure out what kind of analysis (if any!) could be inferred by the current implementation, which could be an hack --as Peter say-- but, has some features I like... Working on it as well, but I progress slowly, I'm not that good at theoretical stuff yet! :-D But that's another thread... I'll let all you know, if interested, soon I hope. :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 20:55 ` Ted Baker 2009-07-15 21:53 ` Chris Friesen @ 2009-07-16 23:29 ` Raistlin 2009-07-18 20:12 ` Michal Sojka 1 sibling, 1 reply; 82+ messages in thread From: Raistlin @ 2009-07-16 23:29 UTC (permalink / raw) To: Ted Baker Cc: James H. Anderson, Peter Zijlstra, Chris Friesen, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 2363 bytes --] On Wed, 2009-07-15 at 16:55 -0400, Ted Baker wrote: > I hope that Linux will not pursue EDF or EDZL in only a narrow > sense, as simple priority scheduling algorithms, without providing > for bandwidth guaranteees and limits. > Yes, me to... > By bandwith "guarantee" I mean that the task/scheduling entity is > guaranteed to be able to at least *compete* at a certain level for > a certain percentage of the CPU, if we cannot (better) provide an > admission control mechanism that guarantees it will actually get a > certain percentage of the CPU. > Again, I agree... But giving a group an explicit priority assignment, although very simple conceptually, raises a lot of issues when the current implementation of global task scheduling in Linux, with "distributed" (one per CPU) runqueue, is concerned. > For example, in the fixed-priority domain, we have Sporadic > Server. This guarantees the server a chance to compete at its top > priority for at least sched_ss_init_budget time in every > sched_ss_repl_period time, at sched_priority, within some > tolerance. It also should not be permitted to use more than > sched_ss_init_budget time in every sched_ss_repl_period time, at > sched_priority, within some tolerance. > And that's why I'm trying what I said before... For, e.g., the sporadic server policy to extend and be applied to group scheduling, groups need to have priorities for the standard, already existent, analysis to work. > In the deadline scheduling domain, we have algorithms like > Constant Bandwidth Server (and some improved variants) that > provide similar gurantees and limites, in terms of deadlines. (I > recall seeing one very good version in a paper I recently saw, > which I could seek out and provide to the list if there is > interest.) > Well, if it does not bother you too much, I'm curious about that solution... When you find some time, even via private mail, if I'm the only one, it would be nice if you can point that paper out to me. Thanks in advice. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 23:29 ` Raistlin @ 2009-07-18 20:12 ` Michal Sojka 0 siblings, 0 replies; 82+ messages in thread From: Michal Sojka @ 2009-07-18 20:12 UTC (permalink / raw) To: Raistlin Cc: Ted Baker, James H. Anderson, Peter Zijlstra, Chris Friesen, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Friday 17 of July 2009 01:29:11 Raistlin wrote: > On Wed, 2009-07-15 at 16:55 -0400, Ted Baker wrote: > > In the deadline scheduling domain, we have algorithms like > > Constant Bandwidth Server (and some improved variants) that > > provide similar gurantees and limites, in terms of deadlines. (I > > recall seeing one very good version in a paper I recently saw, > > which I could seek out and provide to the list if there is > > interest.) > > Well, if it does not bother you too much, I'm curious about that > solution... When you find some time, even via private mail, if I'm the > only one, it would be nice if you can point that paper out to me. Hi, I'm also interested in this, so please, send the pointers publically, Michal ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:31 ` Peter Zijlstra 2009-07-14 16:54 ` James H. Anderson @ 2009-07-14 17:16 ` James H. Anderson 2009-07-15 21:19 ` Ted Baker 2009-07-14 19:54 ` Raistlin 2 siblings, 1 reply; 82+ messages in thread From: James H. Anderson @ 2009-07-14 17:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg, James H. Anderson On Tue, 14 Jul 2009, Peter Zijlstra wrote: > On that, how does the priority ceiling/protection protocol extend to > deadline schedulers? Sorry -- I should have responded to this. These things are pretty easy to extend to deadline scheduling if partitioning is assumed, and this has been done. However, it is not easy to do under global scheduling. Having tasks pinned to processors (partitioning) makes it much easier to compute blocking terms (it's much easier to predict what is happening on each processor at any time). Without this, it is *really* hard to compute reasonable blocking terms. Even doing something as mundane as avoiding deadlocks without a lot of overhead may not be trivial. At the time we developed the FMLP (a couple of years ago), there was *no* published locking protocol (to my knowledge) that worked under GEDF. To my knowledge, the FMLP is still the only one that works under GEDF. (BTW, I should say that I am not familiar with the PEP protocol that has been discussed in this thread. I assume it doesn't work under GEDF, or you wouldn't have asked the question.) BTW, FIFO waiting makes blocking terms in the global case much easier to compute: once a lock request is initiated, the set of blocking lock requests (onces initiated earlier) is fixed. (This is actually a bit of an over-simplification if waiting is by suspending.) With priority-based waiting, higher-priority requests can come later. Determining a reasonable bound on the number of such requests is hard. You can find references to papers by others in our FMLP papers. -Jim ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 17:16 ` James H. Anderson @ 2009-07-15 21:19 ` Ted Baker 0 siblings, 0 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 21:19 UTC (permalink / raw) To: James H. Anderson Cc: Peter Zijlstra, Chris Friesen, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg On Tue, Jul 14, 2009 at 01:16:52PM -0400, James H. Anderson wrote: > ... BTW, I should say that I am not > familiar with the PEP protocol that has been discussed in this thread. > I assume it doesn't work under GEDF, or you wouldn't have asked the > question... I have not seen the definition of PEP, but from the context of this discussion I infer that it refers to an implementation of priority inheritance. As such, with pretty much any global scheduling policy, the set of other tasks whose critical sections could stack up is bounded only by the number of tasks in the system. In any case, I have misunderstood what PEP is, let me attempt to summarize what I have inferred: A high priority running task that would otherwise become blocked waiting for a lower-priority lock-holding task to release the lock can give up its prority/shot at execution to the lower-priority task that is blocking it. That is, when a task A is "blocked" for a lock it can stay in the run-queue so long as the task B that is (ultimately, transitively) blocking it is in (the same?) run-queue. At any point where the scheduler would choose to execute A it instead finds B, by traversing wait-for links between tasks, and executes B. The priority order of the run-queue can be based on any (partial) ordering relation, including deadlines. A slight complexity is that if B is allowed to suspend itself while holding a lock, and does so, one must run back and also remove the tasks like A from the run-queue, and when B wakes up, one must do the revers. However, the frequency of deep nesting of wait-for relationships seems small. For GEDF on SMP, a question is how to handle the case where A is blocked on one processor and B may be running on a different one. This seems to require removing A from the run-queue when it is detected. Of course, the current Linux model appears not to fully support GEDF, since run-queues are per-processor, subject to explicit migration. So, as infer from the preceding messages, the question above transforms into whether to migrate A to B's processor run-queue or to migrate B to A's processor run-queue? Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:31 ` Peter Zijlstra 2009-07-14 16:54 ` James H. Anderson 2009-07-14 17:16 ` James H. Anderson @ 2009-07-14 19:54 ` Raistlin 2 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 19:54 UTC (permalink / raw) To: Peter Zijlstra Cc: James H. Anderson, Chris Friesen, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern Brandenburg [-- Attachment #1: Type: text/plain, Size: 2840 bytes --] On Tue, 2009-07-14 at 18:31 +0200, Peter Zijlstra wrote: > > but the discussion since has been entirely about synchronization issues. > > Right, this seems to be a very hot topic. > Indeed! :-D > Right, Ted holds similar opinions. > > Practically though, active Priority Inheritance has enormous benefits > for us simple people that need to get things done :-) > > It has allowed us to convert this huge mass of code into something that > is real-time enough for a lot of practical uses, including industrial > robots and the like without the need to analyze each and every lock out > there. > As said to you personally, I put quite a lot of efforts trying to find some code using PI-futexes directly or PTHREAD_PRIO_INHERIT POSIX mutexes, for personal curiosity and research interest... But I did not manage for now. :-( Do you think it would be possible to have some pointers? > > In this approach, critical-section lengths must be known, and any > > lock request that occurs when a task doesn't have sufficient > > budget is simply denied -- the request is done later when that task > > receives additional budget. This avoids a task in one group from > > holding a lock while it is preempted (which could severely hurt > > lock-requesting tasks in other groups). This scheme is really easy > > to implement in conjunction with the FMLP and it doesn't require > > complicated budget tracking. Its effects on blocking terms are > > also easy to analyze. Thomas Nolte and colleagues (in Sweden) have > > written some papers where they've used skip-based locks in > > hierarchically-scheduled systems. > > Not granting locks when the contender doesn't have enough budget to > finish them seems like a very attractive solution, however the cost of > having to analyze the critical section seems prohibitive. > I've always thought so... We have some work related to this as well (can give pointers if interested), but I personally like PI/BWI-PEP etc. because such a knowledge is not required at all... Anyway... > That said, we could maybe experiment with this for a few key lock sites. > ... This would be nice! > Furthermore we're limited by the existing legacy (both spec and code > wise), but I think we have been able to make good progress through tools > that help us, such as lockdep and the latency tracer (now ftrace), which > help us find trouble in an active manner. > Well, in my very humble opinion you're definitely doing great job! :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 14:48 ` Chris Friesen 2009-07-14 15:19 ` James H. Anderson @ 2009-07-14 16:48 ` Raistlin 2009-07-14 18:24 ` Chris Friesen 2009-07-15 21:45 ` Ted Baker 2 siblings, 1 reply; 82+ messages in thread From: Raistlin @ 2009-07-14 16:48 UTC (permalink / raw) To: Chris Friesen Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 2628 bytes --] On Tue, 2009-07-14 at 08:48 -0600, Chris Friesen wrote: > Let's call the highest priority task A, while the task holding the lock > (that A wants) is called B. Suppose we're on an dual-cpu system. > Ok. > According to your proposal above we would have A busy-waiting while B > runs with A's priority. When B gives up the lock it gets downgraded and > A acquires it and continues to run. > Yes. The first part of my proposal --the "theoretically safe" one. > Alternately, we could have A blocked waiting for the lock, a separate > task C running, and B runs with A's priority on the other cpu. When B > gives up the lock it gets downgraded and A acquires it and continues to run. > Right. And this is in my proposal as well. I'm just saying that the analysis gets more complicated, that some of the nice properties may be lost, and suggested a first draft of idea to turn it back to be easy again... With, unfortunately, a lot of flaws in there yet. :-( In fact, I'm suggesting that, in the scenario you described, C should be able to run, but B's budget have to be affected somehow. > From an overall system perspective we allowed C to make additional > forward progress, increasing the throughput. As said, I agre with this from the very beginning of the whole thing! :-) > On the other hand, there > is a small window where A isn't running and it theoretically should be. > If we can bound it, would this window really cause so much difficulty > to the schedulability analysis? > Well, I'm not sure right now... I would need to do some math to be! Remember that all my points are concerned with budgets, i.e., a scenario where you have some mean to limit the capability of a task to ask for CPU time over some kind of period. And here it is where the problem comes since running C instead of having A busy waiting means: - that A is actually blocked, as said before; - that A's budget is not diminished. And this certainly improves system throughput but may affect its predictability and, in this particular case, task's isolation among each other... Anyway, if you are saying that this could be a possible theory-practice compromise, well, could be... And I'm open and ready to (try to) contribute even in a discussion of such kind. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 16:48 ` Raistlin @ 2009-07-14 18:24 ` Chris Friesen 2009-07-14 19:14 ` Raistlin 2009-07-15 22:14 ` Ted Baker 0 siblings, 2 replies; 82+ messages in thread From: Chris Friesen @ 2009-07-14 18:24 UTC (permalink / raw) To: Raistlin Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Raistlin wrote: > Remember that all my points are concerned with budgets, i.e., a scenario > where you have some mean to limit the capability of a task to ask for > CPU time over some kind of period. > And here it is where the problem comes since running C instead of having > A busy waiting means: > - that A is actually blocked, as said before; Why does it make any difference that A is blocked rather than busy waiting? In either case A cannot make forward progress. > - that A's budget is not diminished. If we're running B with A's priority, presumably it will get some amount of cpu time above and beyond what it would normally have gotten during a particular scheduling interval. Perhaps it would make sense to charge B what it would normally have gotten, and charge the excess amount to A? Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 18:24 ` Chris Friesen @ 2009-07-14 19:14 ` Raistlin 2009-07-15 22:14 ` Ted Baker 1 sibling, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 19:14 UTC (permalink / raw) To: Chris Friesen Cc: Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 1305 bytes --] On Tue, 2009-07-14 at 12:24 -0600, Chris Friesen wrote: > > - that A is actually blocked, as said before; > > Why does it make any difference that A is blocked rather than busy > waiting? In either case A cannot make forward progress. > I think it's not a problem of A, but of the overall schedule, from a system predictability perspective. Anyway, we are still evaluating what, if any could the issues be. > > - that A's budget is not diminished. > > If we're running B with A's priority, presumably it will get some amount > of cpu time above and beyond what it would normally have gotten during a > particular scheduling interval. > Right... > Perhaps it would make sense to charge B > what it would normally have gotten, and charge the excess amount to A? > Mmm.. That's right, but I'm not sure I get what happen while executing C... Anyway, it seems to me that we are getting closer to each other point of view... let's keep staying in touch! :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 18:24 ` Chris Friesen 2009-07-14 19:14 ` Raistlin @ 2009-07-15 22:14 ` Ted Baker 2009-07-16 7:17 ` Henrik Austad 2009-07-16 14:46 ` Chris Friesen 1 sibling, 2 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 22:14 UTC (permalink / raw) To: Chris Friesen Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote: > > - that A's budget is not diminished. > > If we're running B with A's priority, presumably it will get some amount > of cpu time above and beyond what it would normally have gotten during a > particular scheduling interval. Perhaps it would make sense to charge B > what it would normally have gotten, and charge the excess amount to A? First, why will B get any excess time, if is charged? There will certainly be excess time used in any context switch, including premptions and blocking/unblocking for locks, but that will come out of some task's budget. Given the realities of the scheduler, the front-end portion of the context-switch will be charged to the preempted or blocking task, and the back-end portion of the context-switch cost will be charged to the task to which the CPU is switched. In a cross-processor proxy situation like the one above we have four switches: (1) from A to C on processor #1; (2) from whatever else (call it D) that was running on processor #2 to B, when B receives A's priority; (3) from B back to D when B releasse the lock; (4) from C to A when A gets the lock. A will naturally be charged for the front-end cost of (1) and the back-end cost of (4), and B will naturally be charged for the back-end cost of (2) and the front-end cost of (3). The budget of each task must be over-provisioned enough to allow for these additional costs. This is messy, but seems unavoidable, and is an important reason for using scheduling policies that minimize context switches. Back to the original question, of who should be charged for the actual critical section. >From the schedulability analysis point of view, B is getting higher priority time than it normally would be allowed to execute, potentially causing priority inversion (a.k.a. "interference" or "blocking") to a higher priority task D (which does not even share a need for the lock that B is holding) that would otherwise run on the same processor as B. Without priority inheritance this kind of interferfence would not happen. So, we are benefiting A at the expense of D. In the analysis, we can either allow for all such interference in a "blocking term" in the analysis for D, or we might call it "preemption" in the analysis of D and charge it to A (if A has higher priority than D). Is the latter any better? I think not, since we now have to inflate the nominal WCET of A to include all of the critical sections that block it. So, it seems most logical and simplest to leave the charges where they naturally occur, on B. That is, if you allow priority inheritance, you allow tasks to sometimes run at higher priority than they originally were allocated, but not to execute more than originally budgeted. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 22:14 ` Ted Baker @ 2009-07-16 7:17 ` Henrik Austad 2009-07-16 23:13 ` Ted Baker 2009-07-16 14:46 ` Chris Friesen 1 sibling, 1 reply; 82+ messages in thread From: Henrik Austad @ 2009-07-16 7:17 UTC (permalink / raw) To: Ted Baker Cc: Chris Friesen, Raistlin, Peter Zijlstra, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thursday 16 July 2009 00:14:11 Ted Baker wrote: > On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote: > > > - that A's budget is not diminished. > > > > If we're running B with A's priority, presumably it will get some amount > > of cpu time above and beyond what it would normally have gotten during a > > particular scheduling interval. Perhaps it would make sense to charge B > > what it would normally have gotten, and charge the excess amount to A? > > First, why will B get any excess time, if is charged? My understanding of PEP is that when B executes through the A-proxy, B will consume parts of A's resources until the lock is freed. This makes sense when A and B runs on different CPUs and B is moved (temporarily) to CPU#A. If B were to use it's own budget when running here, once A resumes execution and exhaustes its entire budget, you can have over-utilization on that CPU (and under-util on CPU#B). > There will > certainly be excess time used in any context switch, including > premptions and blocking/unblocking for locks, but that will come > out of some task's budget. AFAIK, there are no such things as preemption-overhead charging to a task's budget in the kernel today. This time simply vanishes and must be compensated for when running a task through the acceptance-stage (say, only 95% util pr CPU or some such). > Given the realities of the scheduler, > the front-end portion of the context-switch will be charged to the > preempted or blocking task, and the back-end portion of the > context-switch cost will be charged to the task to which the CPU > is switched. > In a cross-processor proxy situation like the one > above we have four switches: (1) from A to C on processor #1; (2) > from whatever else (call it D) that was running on processor #2 to > B, when B receives A's priority; (3) from B back to D when B > releasse the lock; (4) from C to A when A gets the lock. A will > naturally be charged for the front-end cost of (1) and the > back-end cost of (4), and B will naturally be charged for the > back-end cost of (2) and the front-end cost of (3). > > The budget of each task must be over-provisioned enough to > allow for these additional costs. This is messy, but seems > unavoidable, and is an important reason for using scheduling > policies that minimize context switches. > > Back to the original question, of who should be charged for > the actual critical section. That depends on where you want to run the tasks. If you want to migrate B to CPU#A, A should be charged. If you run B on CPU#B, then B should be charged (for the exact same reasoning A should be charged in the first case). The beauty of PEP, is that enabling B to run is very easy. In the case where B runs on CPU#B, B must be updated statically so that the scheduler will trigger on the new priority. In PEP, this is done automatically when A is picked. One solution to this, would be to migrate A to CPU#B and insert A into the runqueue there. However, then you add more overhead by moving the task around instead of just 'borrowing' the task_struct. > From the schedulability analysis point of view, B is getting > higher priority time than it normally would be allowed to execute, > potentially causing priority inversion (a.k.a. "interference" or > "blocking") to a higher priority task D (which does not even share > a need for the lock that B is holding) that would otherwise run on > the same processor as B. Without priority inheritance this kind > of interferfence would not happen. So, we are benefiting A at the > expense of D. In the analysis, we can either allow for all such > interference in a "blocking term" in the analysis for D, or we > might call it "preemption" in the analysis of D and charge it to A > (if A has higher priority than D). Is the latter any better? If D has higher priority than A, then neither A nor B (with the locks held) should be allowed to run before D. > I > think not, since we now have to inflate the nominal WCET of A to > include all of the critical sections that block it. > > So, it seems most logical and simplest to leave the charges where > they naturally occur, on B. That is, if you allow priority > inheritance, you allow tasks to sometimes run at higher priority > than they originally were allocated, but not to execute more > than originally budgeted. Yes, no task should be allowed to run more than the budget, but that requires B to execute *only* on CPU#B. On the other hand, one could say that if you run PEP and B is executed on CPU#A, and A then exhausts its budget, you could blame A as well, as lock-contention is a common problem and it's not only the kernel's fault. Do we need perfect or best-effort lock-resolving? > Ted -- henrik ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 7:17 ` Henrik Austad @ 2009-07-16 23:13 ` Ted Baker 2009-07-17 0:19 ` Raistlin 2009-07-17 7:31 ` Henrik Austad 0 siblings, 2 replies; 82+ messages in thread From: Ted Baker @ 2009-07-16 23:13 UTC (permalink / raw) To: Henrik Austad Cc: Chris Friesen, Raistlin, Peter Zijlstra, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, Jul 16, 2009 at 09:17:09AM +0200, Henrik Austad wrote: > My understanding of PEP is that when B executes through the > A-proxy, B will consume parts of A's resources until the lock is > freed. This makes sense when A and B runs on different CPUs and > B is moved (temporarily) to CPU#A. If B were to use it's own > budget when running here, once A resumes execution and exhaustes > its entire budget, you can have over-utilization on that CPU > (and under-util on CPU#B). To be sure we are using A and B the same way here: B is holding a lock. A wants that lock. A grants its priority B until B releases the lock. How to look at the charges for usage seems not to have a perfect solution. That is, you can't get around the fact that either: (1) B is using some its time at the wrong priority (priority inversion) or (2) B is using some of A's time to do B's work (the right priority, but the wrong task) The right way to resolve this conflict seems to depend a lot on where B runs, as well as whether you are managing budget per-CPU (partitioned model) or managing it globally (free migration model). 1) In a global scheduling model, it does not matter where B runs. We want to charge B's critical section to B, since our schedulability analysis is based on the processor usage of each task. It would be broken if A could be charged random bits of time for the execution of other tasks. So, we must charge B. 2) In a partitioned scheuling model, we worry about the CPU utilization of each processor. We have several cases, depending where B runs when it runs with A's priority: 2.1) If B gets to run on A's processor, we have a problem with processor overload unless we charge the usage to A. This is still a bad thing, though, since we then need to over-provision A to allow for this. 2.2) If B runs on its own processor, we want to use B's budget. 2.3) A third variatin is that all the critical sections for A and B run on a separate synchronization procesor. In that case, the synchronization processor needs its own budget, and in effect we the critical sections of tasks A and B become aperiodic tasks in their own right. > AFAIK, there are no such things as preemption-overhead charging > to a task's budget in the kernel today. This time simply > vanishes and must be compensated for when running a task through > the acceptance-stage (say, only 95% util pr CPU or some such). I don't see that anything can or does "vanish". Consider this sequence of events on one processor: 1. task A is running, and calls scheduler (synchronously or maybe via IRQ, say from timer) 2. scheduler computes how much time A has used recently, and charges it to A's budget The overhead of the IRQ handler and scheduler are therefore charged to A. 3. scheduler switches to B 4. B calls scheduler 6. scheduler computes how much time B has used recently, and charges it to B's budget The rest of the overhead of the context switch from A to B, including cache misses and page faults immediately following 3 are now effectively charged to B. > > Back to the original question, of who should be charged for > > the actual critical section. > That depends on where you want to run the tasks. If you want to > migrate B to CPU#A, A should be charged. If you run B on CPU#B, > then B should be charged (for the exact same reasoning A should > be charged in the first case). As I mentioned above, it also depends on whether you are using a partitioned or global scheduling policy for your analysis. > The beauty of PEP, is that enabling B to run is very easy. In > the case where B runs on CPU#B, B must be updated statically so > that the scheduler will trigger on the new priority. In PEP, > this is done automatically when A is picked. One solution to > this, would be to migrate A to CPU#B and insert A into the > runqueue there. However, then you add more overhead by moving > the task around instead of just 'borrowing' the task_struct. > > From the schedulability analysis point of view, B is getting > > higher priority time than it normally would be allowed to execute, > > potentially causing priority inversion (a.k.a. "interference" or > > "blocking") to a higher priority task D (which does not even share > > a need for the lock that B is holding) that would otherwise run on > > the same processor as B. Without priority inheritance this kind > > of interferfence would not happen. So, we are benefiting A at the > > expense of D. In the analysis, we can either allow for all such > > interference in a "blocking term" in the analysis for D, or we > > might call it "preemption" in the analysis of D and charge it to A > > (if A has higher priority than D). Is the latter any better? > If D has higher priority than A, then neither A nor B (with the > locks held) should be allowed to run before D. Yes, but I only meant D has higher priority than B. A may have higher priority than D, but with multiple processors we can't just focus on the one highest priority task. We need to look at the (number of CPUs) highest priority tasks, or we will may end up with our system behaving as if we have only one processor. This is a key difference with multiprocessor scheduling, and also with processor "share" scheduilng. Expediting the highest priority task is *not* always the best strategy. Assuming that each task has some amount of slack, there is no great benefit for completing a high-priority task early, if that means causing other tasks to miss their deadlines. That is, if by delaying the highest-priority task a little bit on one processor you can keep more processors busy, you may be able to successfully schedule a set of tasks that could not be scheduled to meet deadlines otherwise. (I can give you an example of how this happens if you want, but it would tedious if you can get my point without the example.) > Yes, no task should be allowed to run more than the budget, but > that requires B to execute *only* on CPU#B. As mentioned above, that depends on whether you are applying a global or partitioned model. With a global scheduling model the budget is not per-CPU. > On the other hand, one could say that if you run PEP and B is > executed on CPU#A, and A then exhausts its budget, you could > blame A as well, as lock-contention is a common problem and it's > not only the kernel's fault. Do we need perfect or best-effort > lock-resolving? Yes. For application locks, with application-managed partitioned scheduling, you can blame the application designer for building in cross-CPU contention. For kernel (hidden from the application) locks, it is a harder problem. I don't think there is a perfect solution for the partitioned scheduling model. As mentioned above, you get to choose between B causing short-term priority inversion for other tasks on B's processor (but not using more than its budget on the average), or B causing budget over-run for A on A's processor. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 23:13 ` Ted Baker @ 2009-07-17 0:19 ` Raistlin 2009-07-17 7:31 ` Henrik Austad 1 sibling, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-17 0:19 UTC (permalink / raw) To: Ted Baker Cc: Henrik Austad, Chris Friesen, Peter Zijlstra, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 2110 bytes --] On Thu, 2009-07-16 at 19:13 -0400, Ted Baker wrote: > To be sure we are using A and B the same way here: > B is holding a lock. > A wants that lock. > A grants its priority B until B releases the lock. > > How to look at the charges for usage seems not to have a perfect > solution. That is, you can't get around the fact that either: > > [...] > > The right way to resolve this conflict seems to depend a lot on > where B runs, as well as whether you are managing budget per-CPU > (partitioned model) or managing it globally (free migration > model). > > 1) In a global scheduling model, it does not matter where B runs. > We want to charge B's critical section to B, since our > schedulability analysis is based on the processor usage of each > task. It would be broken if A could be charged random bits of > time for the execution of other tasks. So, we must charge B. > Mmm... Why can't we make B able to exploit A's bandwidth to speed up critical section completion, to the benefit of A as well? I mean, it depends on how analysis is being carried out, and it is probably good or bad depending on what you care most, e.g., making all task's deadline or isolating the various components between each other... But I wouldn't say it is "broken". Especially because I implemented it once... Very rough, very experimental, far from being ready for anything different than some experiments... But it worked! :-) > 2) In a partitioned scheuling model, we worry about the > CPU utilization of each processor. We have several cases, depending > where B runs when it runs with A's priority: > Ok, I'm not going into this, since I need a little bit more time to figure out the details... I'm concentrating on global scheduling for now! :-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 23:13 ` Ted Baker 2009-07-17 0:19 ` Raistlin @ 2009-07-17 7:31 ` Henrik Austad 1 sibling, 0 replies; 82+ messages in thread From: Henrik Austad @ 2009-07-17 7:31 UTC (permalink / raw) To: Ted Baker Cc: Chris Friesen, Raistlin, Peter Zijlstra, Douglas Niehaus, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 10080 bytes --] On Friday 17 July 2009 01:13:23 Ted Baker wrote: > On Thu, Jul 16, 2009 at 09:17:09AM +0200, Henrik Austad wrote: > > My understanding of PEP is that when B executes through the > > A-proxy, B will consume parts of A's resources until the lock is > > freed. This makes sense when A and B runs on different CPUs and > > B is moved (temporarily) to CPU#A. If B were to use it's own > > budget when running here, once A resumes execution and exhaustes > > its entire budget, you can have over-utilization on that CPU > > (and under-util on CPU#B). > > To be sure we are using A and B the same way here: > B is holding a lock. > A wants that lock. > A grants its priority B until B releases the lock. I tried to be consistent, and I still can't see where I messed up, but this is what I meant. With the mixed-in CPU-notation: I got the feeling that partitioned scheduling was the preferred (or easiest to analyze), so I assumed that we had shifted from my initial global motivation to a more mature partitioned view :-) By CPU#A I menat the CPU where A is currently scheduled (in a partitioned setup). > How to look at the charges for usage seems not to have a perfect > solution. That is, you can't get around the fact that either: > > (1) B is using some its time at the wrong priority (priority inversion) > > or > > (2) B is using some of A's time to do B's work (the right > priority, but the wrong task) > > The right way to resolve this conflict seems to depend a lot on > where B runs, as well as whether you are managing budget per-CPU > (partitioned model) or managing it globally (free migration > model). Yes. This has been one of my point all the way, but I realize I've been rather bad at actually showing that point. > 1) In a global scheduling model, it does not matter where B runs. > We want to charge B's critical section to B, since our > schedulability analysis is based on the processor usage of each > task. It would be broken if A could be charged random bits of > time for the execution of other tasks. So, we must charge B. Yes, I agree completely. > 2) In a partitioned scheuling model, we worry about the > CPU utilization of each processor. We have several cases, depending > where B runs when it runs with A's priority: > > 2.1) If B gets to run on A's processor, we have a problem > with processor overload unless we charge the usage to A. This > is still a bad thing, though, since we then need to over-provision > A to allow for this. > > 2.2) If B runs on its own processor, we want to use B's budget. > > 2.3) A third variatin is that all the critical sections for A and > B run on a separate synchronization procesor. In that case, the > synchronization processor needs its own budget, and in effect we > the critical sections of tasks A and B become aperiodic tasks in > their own right. An interesting solution. However, this means that the kernel need to analyze all tasks before they can run as it not only need to keep track of total "raw utilization" (the computational time required), but also on how much time spent while holding locks. Throw inn a few conditional statements, and finding this becomes somewhat of a challenge. > > AFAIK, there are no such things as preemption-overhead charging > > to a task's budget in the kernel today. This time simply > > vanishes and must be compensated for when running a task through > > the acceptance-stage (say, only 95% util pr CPU or some such). > > I don't see that anything can or does "vanish". Consider this > sequence of events on one processor: > > 1. task A is running, and calls scheduler (synchronously or maybe > via IRQ, say from timer) > > 2. scheduler computes how much time A has used recently, and charges > it to A's budget > > The overhead of the IRQ handler and scheduler are therefore > charged to A. > > 3. scheduler switches to B > > 4. B calls scheduler > > 6. scheduler computes how much time B has used recently, > and charges it to B's budget > > The rest of the overhead of the context switch from A to B, including cache > misses and page faults immediately following 3 are now > effectively charged to B. Good point, it has to be this way. I see a lot of code in kernel/sched_stats.h that I should have been aware of. My apologies. > > > Back to the original question, of who should be charged for > > > the actual critical section. > > > > That depends on where you want to run the tasks. If you want to > > migrate B to CPU#A, A should be charged. If you run B on CPU#B, > > then B should be charged (for the exact same reasoning A should > > be charged in the first case). > > As I mentioned above, it also depends on whether you are using > a partitioned or global scheduling policy for your analysis. Yes. I guess we should split the thread in 2, or shift the topic as we are discussing locking and how to resolve these. > > The beauty of PEP, is that enabling B to run is very easy. In > > the case where B runs on CPU#B, B must be updated statically so > > that the scheduler will trigger on the new priority. In PEP, > > this is done automatically when A is picked. One solution to > > this, would be to migrate A to CPU#B and insert A into the > > runqueue there. However, then you add more overhead by moving > > the task around instead of just 'borrowing' the task_struct. > > > > > From the schedulability analysis point of view, B is getting > > > higher priority time than it normally would be allowed to execute, > > > potentially causing priority inversion (a.k.a. "interference" or > > > "blocking") to a higher priority task D (which does not even share > > > a need for the lock that B is holding) that would otherwise run on > > > the same processor as B. Without priority inheritance this kind > > > of interferfence would not happen. So, we are benefiting A at the > > > expense of D. In the analysis, we can either allow for all such > > > interference in a "blocking term" in the analysis for D, or we > > > might call it "preemption" in the analysis of D and charge it to A > > > (if A has higher priority than D). Is the latter any better? > > > > If D has higher priority than A, then neither A nor B (with the > > locks held) should be allowed to run before D. > > Yes, but I only meant D has higher priority than B. A may have > higher priority than D, but with multiple processors we can't just > focus on the one highest priority task. We need to look at the > (number of CPUs) highest priority tasks, or we will may end > up with our system behaving as if we have only one processor. Yes, well, that will always be a problem. but on the other hand, B *has* higher priority than D, since B is blocking A, and A has higher priority than D. This becomes, as you say, messy in a partitioned system as a priority does not mean *anything* outside a CPU. The same goes for a deadline as you must see the deadline wrt to the other tasks in the same domain (I specifically choose domain here as we can apply that to all types, partitioned, clustered and global). This is what makes me believe that a global algorithm will make several of these problems smaller, or even vanish completely, even if you have problems with caches going cold, high memory-bus traffic at frequent task preemption, strong lock-contention on the rq-locks etc. These are all implementation problems, not small problems, but they can be avoided and mended. A partitioned scheduler is a bit like a broken hammer, you cannot really expect to build a nice house with it :) > This is a key difference with multiprocessor scheduling, > and also with processor "share" scheduilng. Yes, you can get into quite a mess if you're not careful. > Expediting the highest priority task is *not* always the > best strategy. Assuming that each task has some amount of slack, > there is no great benefit for completing a high-priority task > early, if that means causing other tasks to miss their deadlines. which was the whole point of the initial email, to present EFF (or M^2LLF, or whatever is the most appropriate name for it) and get some feedback :-) > That is, if by delaying the highest-priority task a little bit > on one processor you can keep more processors busy, > you may be able to successfully schedule a set of tasks that > could not be scheduled to meet deadlines otherwise. > > (I can give you an example of how this happens if you want, but it > would tedious if you can get my point without the example.) No, no, I think I understand what you're getting at. Dhall's effect would be one such. > > Yes, no task should be allowed to run more than the budget, but > > that requires B to execute *only* on CPU#B. > > As mentioned above, that depends on whether you are > applying a global or partitioned model. With a global > scheduling model the budget is not per-CPU. > > > On the other hand, one could say that if you run PEP and B is > > executed on CPU#A, and A then exhausts its budget, you could > > blame A as well, as lock-contention is a common problem and it's > > not only the kernel's fault. Do we need perfect or best-effort > > lock-resolving? > > Yes. For application locks, with application-managed partitioned > scheduling, you can blame the application designer for building > in cross-CPU contention. > > For kernel (hidden from the application) locks, it is a harder > problem. Which locks would that be? If a syscall results in lock-contention, shouldn't the app be aware of those locks? > I don't think there is a perfect solution for the partitioned > scheduling model. I know! :-) > As mentioned above, you get to choose between B > causing short-term priority inversion for other tasks on B's > processor (but not using more than its budget on the average), or > B causing budget over-run for A on A's processor. > > Ted -- henrik [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 22:14 ` Ted Baker 2009-07-16 7:17 ` Henrik Austad @ 2009-07-16 14:46 ` Chris Friesen 2009-07-16 22:34 ` Ted Baker 1 sibling, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-16 14:46 UTC (permalink / raw) To: Ted Baker Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Ted Baker wrote: > On Tue, Jul 14, 2009 at 12:24:26PM -0600, Chris Friesen wrote: > >>> - that A's budget is not diminished. >> If we're running B with A's priority, presumably it will get some amount >> of cpu time above and beyond what it would normally have gotten during a >> particular scheduling interval. Perhaps it would make sense to charge B >> what it would normally have gotten, and charge the excess amount to A? > So, it seems most logical and simplest to leave the charges where > they naturally occur, on B. That is, if you allow priority > inheritance, you allow tasks to sometimes run at higher priority > than they originally were allocated, but not to execute more > than originally budgeted. I had considered this myself, as the simplicity is appealing. In this scenario, we're going to disrupt B's scheduling pattern by forcing it to borrow from its future cpu allocation in order to free up the lock. It will then be forced to block for a while to make up for the over-use of it's cpu budget. I was worried that this "bursty" behaviour could cause problems, but on reflection it seems like if the application is well-behaved to start with, maybe we can assume that lock-hold times will be small enough that this shouldn't cause significant problems. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 14:46 ` Chris Friesen @ 2009-07-16 22:34 ` Ted Baker 2009-07-16 23:07 ` Raistlin 0 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-16 22:34 UTC (permalink / raw) To: Chris Friesen Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, Jul 16, 2009 at 08:46:52AM -0600, Chris Friesen wrote: > > So, it seems most logical and simplest to leave the charges where > > they naturally occur, on B. That is, if you allow priority > > inheritance, you allow tasks to sometimes run at higher priority > > than they originally were allocated, but not to execute more > > than originally budgeted. > In this scenario, we're going to disrupt B's scheduling pattern by > forcing it to borrow from its future cpu allocation in order to free up > the lock. It will then be forced to block for a while to make up for > the over-use of it's cpu budget. I guess by "disrupting" B's scheduling pattern, you mean only in the sense of delayed budget enforcement. That is, finishing the critical sections is something that B would do next in any case, and something that B would need to consume CPU time to do in any case. It is B doing its own normal work, but just getting a chance to complete that work sooner than it might otherwise. In this situation, you do want to allow B's budget to go negative (borrowing against B's future CPU budget to do B's work) rather than charging the task A who loaned B priority, since otherwise B ends up getting a still bigger windfall (charging B's work to A's budget). Ted BTW. I believe I understood what you are saying here, but I noticed that we are using words in different ways, especially "block" and "sleep", and I'm not sure about some of the other messages. It would be helpful if we had some common terminology. In particular, it might be useful to agree on distinct names for the following following different reasons for a task not to be executing. 1) actively looping on spin-lock, using CPU cycles 2) waiting for a condition or event, like timer or I/O completion, not using CPU cycles, not allowed to execute until the event or condition occurs 3) waiting for a CPU budget replenishment, not using CPU cycles, 4) conceptually allowed to execute, but preempted on all available processors (by higher priority or non-preemptable tasks) 5) conceptually allowed to execute, but prevented from executing by lower-priority tasks that are not currently preemptable Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 22:34 ` Ted Baker @ 2009-07-16 23:07 ` Raistlin 0 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-16 23:07 UTC (permalink / raw) To: Ted Baker Cc: Chris Friesen, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 3030 bytes --] On Thu, 2009-07-16 at 18:34 -0400, Ted Baker wrote: > > In this scenario, we're going to disrupt B's scheduling pattern by > > forcing it to borrow from its future cpu allocation in order to free up > > the lock. It will then be forced to block for a while to make up for > > the over-use of it's cpu budget. > > I guess by "disrupting" B's scheduling pattern, you mean only in > the sense of delayed budget enforcement. That is, finishing the > critical sections is something that B would do next in any case, > and something that B would need to consume CPU time to do in any > case. It is B doing its own normal work, but just getting > a chance to complete that work sooner than it might otherwise. > > In this situation, you do want to allow B's budget to go negative > (borrowing against B's future CPU budget to do B's work) rather > than charging the task A who loaned B priority, since otherwise B > ends up getting a still bigger windfall (charging B's work to A's > budget). > When bandwidth and budget are in place, allowing a task to cross these boundaries if executing inside a critical section is another approach I always have liked... Provided you have a mean to make it pay back for its borrowed bandwidth, as you are properly saying. However, I also think this raises an issue: what if something weird happen inside the critical section, such that the task overcame its declared/estimated duration? If no enforcing is in place, aren't we, at least temporary, breaking the isolation? And I think bandwidth isolation between a possibly "failing" task and the rest of the system is exactly what we are trying to achieve when we use bandwidth reservation mechanisms? Thus, in scenarios where component isolation is what we care most, and especially if good estimations of execution, critical section duration and blocking times are hardly possible, proxy execution/bandwidth inheritance approaches ensures the following: 1. the task executing the critical section gets some "bonus bandwidth" to speedup the completion of that code, but not in an uncontrolled manner: i.e., it can use _only_ the bandwidths of tasks blocked on that cs; 2. whatever goes on inside a task, including critical sections, it _only_ will affect the capability of making their deadline of: - the task executing the cs; - the tasks interacting with him, i.e., accessing the same cs. Any other component/task in the system will be completely unaware of anything! Obviously, this comes at the price of overhead, tricky implementation, etc. ... I don't know if it is always worth, probably it is not, but I think it is yet something remarkable! :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 14:48 ` Chris Friesen 2009-07-14 15:19 ` James H. Anderson 2009-07-14 16:48 ` Raistlin @ 2009-07-15 21:45 ` Ted Baker 2009-07-15 22:12 ` Chris Friesen 2 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-15 21:45 UTC (permalink / raw) To: Chris Friesen Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Tue, Jul 14, 2009 at 08:48:26AM -0600, Chris Friesen wrote: > Let's call the highest priority task A, while the task holding the lock > (that A wants) is called B. Suppose we're on an dual-cpu system. > > According to your proposal above we would have A busy-waiting while B > runs with A's priority. When B gives up the lock it gets downgraded and > A acquires it and continues to run. > > Alternately, we could have A blocked waiting for the lock, a separate > task C running, and B runs with A's priority on the other cpu. When B > gives up the lock it gets downgraded and A acquires it and continues to run. > > >From an overall system perspective we allowed C to make additional > forward progress, increasing the throughput. On the other hand, there > is a small window where A isn't running and it theoretically should be. > If we can bound it, would this window really cause so much difficulty > to the schedulability analysis? I have two questions: (1) As Jim Anderson points out in a later comment, is priority inheritance (of any form) what you really want on an SMP system? It does not give a good worst-case blocking time bound. (2) If you do want priority inheritance, how do I implement the mechanics of the mechanics of the unlock operation of B on one processor causing C to be preempted on the other processor, simply and promptly? A general solution needs to account for having multiple tasks in the role of A for any given B, and possibly chains of such tasks (and, of course, potential deadlock cycles). For a relatively simple example, A1 (on CPU1) -blocked_by-> B (on CPU2) C (lower priority on CPU1) A2 (preempts C on CPU1) -blocked_by-> B (CPU2) A3 (on CPU3) -blocked_by-> B (on CPU2) B releases the lock that is blocking A1, A2, and A3. Do I need to wake up A1, A2, and A3? Maybe I should only wake up A2 and A3? Can I pick just one to wake up? Then, how do I implement the wake-ups? I and a student of mine implemented something like this on a VME-bus based SMP system around 1990. We decided to only wake up the highest (global) priority task. (In the case above, either A2 or A3, depending on priority.) We did this using compare-and-swap operatoins, in a way that I recall ended up using (for each lock) something like one global spin-lock variable, one "contending" variable per CPU, and one additional local spinlock variable per CPU to avoid bus contention on the global spin-lock variable. We used a VME-bus interrupt for the lock-releasing CPU to invoke the scheduler of the CPU of the task selected to next receive the lock. The interrupt handler could then do the job of waking up the corresponding contending task on that CPU. It worked, but such global locks had a lot more overhead than other locks, mostly due to the inter-processor interrupt. So, we ended up distinguishing global locks from per-CPU locks to lower the overhead when we did not need it. We were using a partitioned scheduling model, or else this would be a bit more complicated, and I would be talking about the CPU selected to run the task selected to next receive the lock. Is there a more efficient way to do this? or are you all ready to pay this kind of overhead on every lock in your SMP system? Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 21:45 ` Ted Baker @ 2009-07-15 22:12 ` Chris Friesen 2009-07-15 22:52 ` Ted Baker 0 siblings, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-15 22:12 UTC (permalink / raw) To: Ted Baker Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Ted Baker wrote: > I have two questions: > > (1) As Jim Anderson points out in a later comment, > is priority inheritance (of any form) what you really want > on an SMP system? It does not give a good worst-case blocking > time bound. As Peter mentioned, it's not so much a matter of whether it's desired or not. It's required, at least in terms of supporting priority inheritance for pthread mutexes. I haven't been involved with the linux-rt side of things, but I believe Peter implied that they used PI fairly heavily there to try to minimize priority inversion issues because it was infeasible to analyze all the various locks. The kernel is over 10 million lines of code, so any locking changes pretty much have to fit into the existing framework with minimal changes. > (2) If you do want priority inheritance, how do I implement the > mechanics of the mechanics of the unlock operation of B on one > processor causing C to be preempted on the other processor, simply > and promptly? <snip> > I and a student of mine implemented something like this on a > VME-bus based SMP system around 1990. We decided to only wake up > the highest (global) priority task. (In the case above, either A2 > or A3, depending on priority.) We did this using compare-and-swap > operatoins, in a way that I recall ended up using (for each lock) > something like one global spin-lock variable, one "contending" > variable per CPU, and one additional local spinlock variable per > CPU to avoid bus contention on the global spin-lock variable. We > used a VME-bus interrupt for the lock-releasing CPU to invoke the > scheduler of the CPU of the task selected to next receive the > lock. The interrupt handler could then do the job of waking up > the corresponding contending task on that CPU. > > It worked, but such global locks had a lot more overhead than > other locks, mostly due to the inter-processor interrupt. > So, we ended up distinguishing global locks from per-CPU > locks to lower the overhead when we did not need it. > > We were using a partitioned scheduling model, or else this > would be a bit more complicated, and I would be talking about the > CPU selected to run the task selected to next receive the lock. > > Is there a more efficient way to do this? or are you all > ready to pay this kind of overhead on every lock in your SMP > system? Peter would have a better idea of the low-level implementation than I, but I suspect that the general idea would be similar, unless we were to consider migrating the task chosen to run to the current cpu. That would save the IPI, at a cost of task migration. (Which of course wouldn't be possible in the case of cpu affinity.) As to the additional cost, POSIX distinguishes between PI mutexes and regular mutexes. Any additional penalty due to PI should be limited to the mutexes which actually have PI enabled. If an application can limit itself to "safe" syscalls and has enough knowledge of its own locking, then it could presumably use regular mutexes and avoid the overhead of PI. I'm not sure the same could apply to the kernel in general...Peter would have to address that as I'm not familiar with the linux-rt changes. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 22:12 ` Chris Friesen @ 2009-07-15 22:52 ` Ted Baker 0 siblings, 0 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 22:52 UTC (permalink / raw) To: Chris Friesen Cc: Raistlin, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Wed, Jul 15, 2009 at 04:12:00PM -0600, Chris Friesen wrote: > As Peter mentioned, it's not so much a matter of whether it's desired or > not. It's required, at least in terms of supporting priority > inheritance for pthread mutexes. I haven't been involved with the > linux-rt side of things, but I believe Peter implied that they used PI > fairly heavily there to try to minimize priority inversion issues > because it was infeasible to analyze all the various locks. The kernel > is over 10 million lines of code, so any locking changes pretty much > have to fit into the existing framework with minimal changes. ... > Any additional penalty due to PI should be limited to > the mutexes which actually have PI enabled. If an application can limit > itself to "safe" syscalls and has enough knowledge of its own locking, > then it could presumably use regular mutexes and avoid the overhead of PI. > > I'm not sure the same could apply to the kernel in general...Peter would > have to address that as I'm not familiar with the linux-rt changes. I understand the need to support the (largely broken for SMP) POSIX real-time API, but from what I read of the scheduler it seems that support for priority inheritance on mutexes has greatly complicated the kernel, and that the cost extends to applications that do not use priority inheritance mutexes. I especially don't see why why priority inheritance mutexes should ever be used by the kernel itself. Is there a way to provide in the kernel whatever minimal hooks are needed to support the API for users who want it, with less impact? My experience on IEEE PASC and the Austin Group, including work with POSIX validation test suites, has shown that many of the POSIX specifications leave a huge amount of room for implementation freedom -- much larger than I thought when I balloted them. I found this out when I wrote test programs that implementors argued went beyond the specs, when I complained as user about POSIX implementations that I thought did not meet the specs, and when I read interpretation requests from other users and implementors. So, I think it would be worth the effort to read the POSIX specification for PTHREAD_PRIO_INHERIT carefully, with a lawyerly attitude, to see if it really requires going to the lengths the Linux kernel currently goes. That is, look for loopholes that permit Linux to do the right thing. I can't do this today, but maybe I will another day. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 10:47 ` Raistlin 2009-07-14 11:03 ` Peter Zijlstra 2009-07-14 14:48 ` Chris Friesen @ 2009-07-17 13:35 ` Giuseppe Lipari 2 siblings, 0 replies; 82+ messages in thread From: Giuseppe Lipari @ 2009-07-17 13:35 UTC (permalink / raw) To: Raistlin Cc: Chris Friesen, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 3973 bytes --] Dear all, a few consideration on the BWI scheme, that was mentioned by Peter and by Raistlin a few messages ago. I do not know very well the PEP scheme, from the discussion until now the basic idea looks pretty similar to our protocol. The BWI is described in this paper: [1] http://feanor.sssup.it/~lipari/papers/rtss2001-bwi.ps.gz (I just removed the password, feel free to download it, I hope IEEE lawyers will not chase me!). In essence, when a task is blocked on a lock, its budget may be consumed by the task holding the lock. A task can be assigned one or more pairs (budget,deadline), one is its original one, the others are "inherited" when holding a lock on which other tasks are blocked. This scheme has advantages and disadvantages. The advantages are: 1) Isolation. It is easy to see that the budget of a task can be stolen only by tasks that share common locks, directly or indirectly. For the sake simplicity, consider only non nested locks. If two tasks A and B do not share any lock, then they cannot influence each other. (In the paper, this property is generalized to nested locks). This property is useful if we want to isolate the behavior of one (group of) task(s) from the others. For example, a "hard real-time" task (group) must be isolated from soft real-time tasks. Under EDF other classical protocols (like PI, SRP) do not have the same property, because letting a task use a critical section for longer than expected can jeopardize the schedulability of the any other tasks. 2) Simpler admission test. With other schemes (PI, SRP), it is necessary to compute blocking times for all tasks, which depend on the length of the critical sections. The blocking times are then used in the admission test formula. However, if one blocking time is wrongly calculated, at run-time any task can miss its deadline. With BWI, instead, the admission test is the simpler \sum U_i \leq 1, and doesnot depend on the length of the critical section. 3) Under the assumption that tasks do not block inside a critical section, BWI can be implemented in a fairly simple way. 4) It is indeed possible (under certain conditions) to verify the schedulability of a hard real-time task H by knowing the length of all critical sections of all tasks that share some lock with H. The very complex procedure is explained in the paper. However, BWI has some disadvantages 1) It is only for single processors. 2) From a schedulability point of view, if we want to guarantee that every task always respects its deadline, when computing its budget we must calculate the "interference" of other tasks. Since, we have to do this for every "hard" task, we may end up counting the same critical section several times, thus wasting some bandwidth. This is better explained in the paper (section 6.4.1). 3) If a task is allowed to block in the middle of a critical section (for example, due to a page fault), the implementation becomes much more complex. Concerning point 1, as Raistlin pointed out (see its e-mails), he is working on extending BWI to multiprocessors, also from a theoretical point of view. It is possible that the result will be similar to the one obtained by the Dougleas Niehaus with PEP. We hope he will be able to add some "theoretical spice"! Concerning point 2, this cannot be avoided. We believe that BWI is useful for open systems (where tasks are created/killed dynamically) and for soft real-time systems. However, under certain conditions, it allows to schedule and analyse hard real-time tasks, even when mixed with soft real-time tasks. Concerning point 3, this is the most difficult point. In fact, achieving an efficient implementation in this case seems very unlikely. However, we believe that blocking inside a critical section it is probably a rare event, and then maybe it is possible to afford some extra overhead in such unfortunate cases. I hope I clarified some obscure issues with BWI! Regards, Giuseppe Lipari [-- Attachment #2: giuseppe_lipari.vcf --] [-- Type: text/x-vcard, Size: 181 bytes --] begin:vcard fn:Giuseppe Lipari n:Lipari;Giuseppe email;internet:giuseppe.lipari@sssup.it tel;work:+39 050882030 tel;fax:+39 050882003 tel;cell:+39 3480718908 version:2.1 end:vcard ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 15:44 ` Raistlin 2009-07-13 16:33 ` Chris Friesen @ 2009-07-13 17:25 ` Peter Zijlstra 2009-07-13 18:14 ` Noah Watkins 2009-07-13 17:28 ` Peter Zijlstra 2 siblings, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-13 17:25 UTC (permalink / raw) To: Raistlin Cc: Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Mon, 2009-07-13 at 17:44 +0200, Raistlin wrote: > > PIP doesn't suffer this, but does suffer the pain from having to > > reimplement the full schedule function on the waitqueues, which when you > > have hierarchical scheduling means you have to replicate the full > > hierarchy per waitqueue. > > > And, further than this, at least from my point of view, if you have > server/group based scheduling, and in general some kind of budgeting or > bandwidth enforcing mechanism in place, PIP is far from being a > solution... I think you can extend PIP to include things like bandwidth inheritance too. Instead of simply propagating the priority through the waitqueue hierarchy, you can pass the actual task around, and having this task you can indeed consume its bandwidth etc.. But sure, hierarchical scheduling and things really complicate the waitqueue implementation. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 17:25 ` Peter Zijlstra @ 2009-07-13 18:14 ` Noah Watkins 2009-07-13 20:13 ` Ted Baker 2009-07-14 9:15 ` Peter Zijlstra 0 siblings, 2 replies; 82+ messages in thread From: Noah Watkins @ 2009-07-13 18:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari > I think you can extend PIP to include things like bandwidth > inheritance > too. Instead of simply propagating the priority through the waitqueue > hierarchy, you can pass the actual task around, and having this > task you > can indeed consume its bandwidth etc.. I think I understand what you are suggesting by "pass the actual task around". If not, this message might seem out of place. Consider this locking chain/graph: A-->L1-->B-->L2-->C D-->L3-->E-->L2 The way I understand what you are suggesting is that tasks A,B,D,E (or some representation of them) can be passed around such that when C executes it consumes some resource associated with the the tasks it is blocking (A,B,D,E). Obviously which tasks and what quantities are dependent on scheduling semantics and configuration. The biggest implementation hurdle we have encountered is that as tasks are passed forward along a locking chain knowledge about the structure of the locking chain is lost. For example, as C releases L2, C must be able to distinguish between A and D as having been passed to C's location through B or E, respectively. Keeping a representation of the locking chain as a full graph is the solution we have used. Another solution is to allow A and D to re- block and walk the locking chain again, but obviously some thundering- hurd issues arise. noah ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 18:14 ` Noah Watkins @ 2009-07-13 20:13 ` Ted Baker 2009-07-13 21:45 ` Chris Friesen 2009-07-14 9:15 ` Peter Zijlstra 1 sibling, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-13 20:13 UTC (permalink / raw) To: Noah Watkins Cc: Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari I am happy that Peter has included me in this discussion. I will have more to say after I have seen a few more messages, absorbed the context better, and had some time to think things over. I guess that, being a newcomer to this discussion, I may re-open some issues that have already been discussed. However, I would like to comment on a few things right off. I would hope that all of the people working on the kernel could agree on some basic goals, principles, and policies, which could be kept stable, and would guide detailed design decisions. In the real-time domain, I would hope that a goal would be to support application-level analysis of schedulability, in a modular, decomposable way. That is, it should be possible to write the time-critical portions of any application in a way that if the system calls the application makes all succeed, the application gets a guaranteed level of service that enables it to meet its timing constraints, regardless of whatever else is going on in the system. One part of this is support for a scheduling policy that enables the application to request some guaranteed minimum level of CPU bandwidth, and at the same time imposes an upper limit on how much it can interfere with other tasks. One way this model has been abstracted is in terms of a pairs of parameters for each of these two bounds: a level of service, expressed as a fraction of CPU(s), and a lag time. The lag time is affected by a number of specific factors, including the maximum time that a thread may be blocked due to waiting for a lock. The priority inheritance protocol (PIP) has been proposed as a way of bounding the duration of such blocking in the context of very simple task models (such as fixed-priority preemptive scheduling of periodic tasks), where the blocking is viewed as "priority inversion". As several message mentioned, and as I observed in the kernel code (to my disgust), the Linux PIP implementation is a nightmare: extremely heavy weight, involving maintenance of a full wait-for graph, and requiring updates for a range of events, including priority changes and interruptions of wait operations. I recognize that this complexity is a product of the desire to provide an implementation that does the right thing in all cases, but one needs keep a sense of proportion. When one ends up having to solve a more complex mutual exclusion problem (on the wait-for graph and task priorities) in order to implement a mutual exclusion primitive, you have a case of abstraction inversion-- something is out of whack. It is sad enough that we have to implement PTHREAD_PRIO_INHERIT for POSIX portable applications, but to use this internally within the OS, or to build more complexity on top of it, seems totally wrong. For schedulability analysis, one just needs a way to bound the duration of priority inversion. Simple non-preemption (Linux spinlock_t) is sufficient for that, and it is easy to implement. You just have to be careful not to voluntarily suspend (give up the processor) while holding a lock. The SRP (Ada Ceiling_Locking and POSIX PTHREAD_PRIO_PROTECT) is s slight refinement, also extremely lightweight to implement with reasonable restrictions (e.g., no blocking while holding a lock, priority changes deferred while a lock is held). The priority inheritance protocol (PIP) has always been a problem for schedulability analysis, because priority inversion is not bounded to a single critical section; one must account for the worst-case chain of blockings -- even before one considers the distributed overhead (meaning overhead paid by non-users of the feature) and the implementation complexity. The only selling point for PIP has been the ability of a thread to suspend itself while holding a lock, such as to wait for completion of an I/O operation. I would argue that this practice is generally a sign of poor design, and it certainly throws out the notion of bounding the priority inversion due to blocking on a lock for schedulability analysis -- since now the lock-holding time can depend on I/O completion time, timers, etc. I do have sympathy for what you seem to be calling the "proxy" implementation model of priority inheritance. That is, the scheduler choose a process, than if it was blocked, walk down the wait-for chain until it found a process that was not blocked. A beauty of this approach is that you only need to maintain the one wait-for link, and you only modify it when a task blocks. There is a small overhead on the lock and unlock operations, but no need to overhead, since one never traverses these links unless blocking has occurred and one about to schedule a blocked process. In particular, one gets around a case that I found very nasty if one tries to do all the inheritance-caused priority changes explicitly; that is, when mutex-wait operation times out you have to recompute all the priorities, and that this involves traversal of the reverse-wait-for relation, which is many-one. The first case of its use that I saw published was for the real-time Unix variant developed for the Harris SMP computers, which were sold under the name Nighthawk. I don't recall how they handled some of the details I've seen mentioned in these recent messages. I have not been able to find any on-line publications for that version of Unix (which I think may be what I see cited a few places on the web as CX/RT). I once had a copy of the book Real-time Unix Systems, which I think might have touched on this (http://www.flipkart.com/real-time-unix-systems-borko/0792390997-y6w3fq6ipb), but a student walked away with my copy. I just sent an e-mail to Bork Fuhrt, one of the authors, and will let you know if I can get any information from him. This model does appear to extend to models of scheduling appropriate for virtual processors, which enforce both upper and lower bounds on processor bandwidth (e.g., sporadic server, constant-bandwidth server), we do want to allow tasks that have reached their current budget allocation to continue if they are holding a lock, long enough to release the lock. However, that can also be accomplished using non-preemmption, or the SRP. Regarding the notion of charging proxy execution to the budget of the client task, I have grave concerns. It is already hard enough to estimate the amount of budget that a real-time task requires, without this additional complication. Ted Baker ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 20:13 ` Ted Baker @ 2009-07-13 21:45 ` Chris Friesen 2009-07-14 11:16 ` Raistlin 2009-07-15 23:11 ` Ted Baker 0 siblings, 2 replies; 82+ messages in thread From: Chris Friesen @ 2009-07-13 21:45 UTC (permalink / raw) To: Ted Baker Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Ted Baker wrote: > I recognize that this complexity is a product of the desire to > provide an implementation that does the right thing in all cases, > but one needs keep a sense of proportion. When one ends up having > to solve a more complex mutual exclusion problem (on the wait-for > graph and task priorities) in order to implement a mutual > exclusion primitive, you have a case of abstraction inversion-- > something is out of whack. Given that the semantics of POSIX PI locking assumes certain scheduler behaviours, is it actually abstraction inversion to have that same dependency expressed in the kernel code that implements it? > For schedulability analysis, one just needs a way to bound the > duration of priority inversion. Simple non-preemption (Linux > spinlock_t) is sufficient for that, and it is easy to implement. > You just have to be careful not to voluntarily suspend (give up > the processor) while holding a lock. The whole point of mutexes (and semaphores) within the linux kernel is that it is possible to block while holding them. I suspect you're going to find it fairly difficult to convince people to spinlocks just to make it possible to provide latency guarantees. > The only selling point for PIP has been the ability of a thread to > suspend itself while holding a lock, such as to wait for > completion of an I/O operation. You're comparing a full-featured PI implementation with a stripped-down PP (priority protection, aka priority ceiling) approach. In an apples-to-apples comparison, the selling point for PI vs PP is that under PIP the priority of the lock holder is automatically boosted only if necessary, and only as high as necessary. On the other hand, PP requires code analysis to properly set the ceilings for each individual mutex. > I would argue that this practice > is generally a sign of poor design, and it certainly throws out > the notion of bounding the priority inversion due to blocking on a > lock for schedulability analysis -- since now the lock-holding > time can depend on I/O completion time, timers, etc. Certainly if you block waiting for I/O while holding a lock then it impacts the ability to provide latency guarantees for others waiting for that lock. But this has nothing to do with PI vs PP or spinlocks, and everything to do with how the lock is actually used. > Regarding the notion of charging proxy execution to the budget of > the client task, I have grave concerns. It is already hard enough > to estimate the amount of budget that a real-time task requires, > without this additional complication. Agreed. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 21:45 ` Chris Friesen @ 2009-07-14 11:16 ` Raistlin 2009-07-15 23:11 ` Ted Baker 1 sibling, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 11:16 UTC (permalink / raw) To: Chris Friesen Cc: Ted Baker, Noah Watkins, Peter Zijlstra, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 2970 bytes --] On Mon, 2009-07-13 at 15:45 -0600, Chris Friesen wrote: > > The only selling point for PIP has been the ability of a thread to > > suspend itself while holding a lock, such as to wait for > > completion of an I/O operation. > > You're comparing a full-featured PI implementation with a stripped-down > PP (priority protection, aka priority ceiling) approach. In an > apples-to-apples comparison, the selling point for PI vs PP is that > under PIP the priority of the lock holder is automatically boosted only > if necessary, and only as high as necessary. On the other hand, PP > requires code analysis to properly set the ceilings for each individual > mutex. > I think I agree with this. Moreover, talking about server/group based scheduling and considering BWI/PEP, which are natural extensions of PI, you get the very nice property of being independent from the actual knowledge of the blocking time(s): you can run your scheduling test only considering the bandwidth assigned to each component... And this is, at least for now and as far as I know, not possible if you go for preventivate-blocking approaches like P(C)P or SRP, and also for BROE or SIRAP I think. I mean, if you only want to be sure to isolate applications and/or components among themselves, without any knowledge of the blocking times and critical section access patterns of the task running inside such components, you can do it! Just pick up the bandwidths and see if they fit with a scheduling test unmodified by any --very difficult to find out, actually-- blocking term. I know, this does not cover all the possible situations, and that it is biased toward _soft_ real-time workloads, but I think it's a meaningful use-case, especially considering Linux... Anyway, it is more than possible that this belief is due to lack of knowledge of mine about some other already existing solution... If so, please, any pointer to it/them is more than welcome. :-) > > Regarding the notion of charging proxy execution to the budget of > > the client task, I have grave concerns. It is already hard enough > > to estimate the amount of budget that a real-time task requires, > > without this additional complication. > > Agreed. > Well, indeed, I agre with this as well... But it yields the, to me, very nice property described above (and in the other e-mail). However, I'm light year far from proposing it as the "solutions for all evils"! I know that, for instance, overhead and code twisting are severe issues. The all I hope is to be able and have the time to give it a try, and try to guess if it is worth. :-D Regards again, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 21:45 ` Chris Friesen 2009-07-14 11:16 ` Raistlin @ 2009-07-15 23:11 ` Ted Baker 2009-07-16 7:58 ` Peter Zijlstra 2009-07-16 15:17 ` Chris Friesen 1 sibling, 2 replies; 82+ messages in thread From: Ted Baker @ 2009-07-15 23:11 UTC (permalink / raw) To: Chris Friesen Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote: > Given that the semantics of POSIX PI locking assumes certain scheduler > behaviours, is it actually abstraction inversion to have that same > dependency expressed in the kernel code that implements it? ...> > The whole point of mutexes (and semaphores) within the linux kernel is > that it is possible to block while holding them. I suspect you're going > to find it fairly difficult to convince people to spinlocks just to make > it possible to provide latency guarantees. The abstraction inversion is when the kernel uses (internally) something as complex as a POSIX PI mutex. So, I'm not arguing that the kernel does not need internal mutexes/semaphores that can be held while a task is suspended/blocked. I'm just arguing that those internal mutexes/semaphores should not be PI ones. > ... the selling point for PI vs PP is that under PIP the > priority of the lock holder is automatically boosted only if > necessary, and only as high as necessary. The putative benefit of this is disputed, as shown by Jim and Bjorn's work with LITMUS-RT and others. For difference to be noted, there must be a lot of contention, and long critical sections. The benefit of less frequent priority boosting and lower priorities can be balanced by more increased worst-case number of context switches. > On the other hand, PP requires code analysis to properly set the > ceilings for each individual mutex. Indeed, this is difficult, but no more difficult than estimating worst-case blocking times, which requires more extensive code analysis and requires consideration of more cases with PI than PP. If determining the exact ceiling is too difficult. one can simply set the ceiling to the maximum priority used by the application. Again, I don't think that either PP or PI is appropriate for use in a (SMP) kernel. For non-blocking locks, the current no-preeemption spinlock mechanism works. For higher-level (blocking) locks, I'm attracted to Jim Anderson's model of non-preemptable critical sections, combined with FIFO queue service. > Certainly if you block waiting for I/O while holding a lock then it > impacts the ability to provide latency guarantees for others waiting for > that lock. But this has nothing to do with PI vs PP or spinlocks, and > everything to do with how the lock is actually used. My only point there was with respect to application-level use of POSIX mutexes, that if an application needs to suspend while holding a mutex (e.g., for I/O) then the application will have potentially unbounded priority inversion, and so is losing the benefit from priority inheritance. So, if the only benefit of PRIO_INHERIT over PRIO_PROTECT is being able to suspend while holding a lock, there is no real benefit. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 23:11 ` Ted Baker @ 2009-07-16 7:58 ` Peter Zijlstra 2009-07-16 8:52 ` Thomas Gleixner ` (2 more replies) 2009-07-16 15:17 ` Chris Friesen 1 sibling, 3 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-16 7:58 UTC (permalink / raw) To: Ted Baker Cc: Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Wed, 2009-07-15 at 19:11 -0400, Ted Baker wrote: > On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote: > > > Given that the semantics of POSIX PI locking assumes certain scheduler > > behaviours, is it actually abstraction inversion to have that same > > dependency expressed in the kernel code that implements it? > ....> > > The whole point of mutexes (and semaphores) within the linux kernel is > > that it is possible to block while holding them. I suspect you're going > > to find it fairly difficult to convince people to spinlocks just to make > > it possible to provide latency guarantees. > > The abstraction inversion is when the kernel uses (internally) > something as complex as a POSIX PI mutex. So, I'm not arguing > that the kernel does not need internal mutexes/semaphores that > can be held while a task is suspended/blocked. I'm just arguing > that those internal mutexes/semaphores should not be PI ones. > > > ... the selling point for PI vs PP is that under PIP the > > priority of the lock holder is automatically boosted only if > > necessary, and only as high as necessary. > > The putative benefit of this is disputed, as shown by Jim and > Bjorn's work with LITMUS-RT and others. For difference to be > noted, there must be a lot of contention, and long critical > sections. The benefit of less frequent priority boosting and > lower priorities can be balanced by more increased worst-case > number of context switches. > > > On the other hand, PP requires code analysis to properly set the > > ceilings for each individual mutex. > > Indeed, this is difficult, but no more difficult than estimating > worst-case blocking times, which requires more extensive code > analysis and requires consideration of more cases with PI than PP. > > If determining the exact ceiling is too difficult. one can simply > set the ceiling to the maximum priority used by the application. > > Again, I don't think that either PP or PI is appropriate for use > in a (SMP) kernel. For non-blocking locks, the current > no-preeemption spinlock mechanism works. For higher-level > (blocking) locks, I'm attracted to Jim Anderson's model of > non-preemptable critical sections, combined with FIFO queue > service. Right, so there's two points here I think: A) making most locks preemptible B) adding PI to all preemptible locks I think that we can all agree that if you do A, B makes heaps of sense, right? I just asked Thomas if he could remember any numbers on this, and he said that keeping all the locks non-preemptible had at least an order difference in max latencies [ so a 60us (A+B) would turn into 600us (! A) ], this means a proportional decrease for the max freq of periodic tasks. This led to the conviction that the PI overheads are worth it, since people actually want high freq tasks. Of course, when the decreased period is still sufficient for the application at hand, the non-preemptible case allows for better analysis. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 7:58 ` Peter Zijlstra @ 2009-07-16 8:52 ` Thomas Gleixner 2009-07-16 12:17 ` Raistlin 2009-07-16 22:15 ` Ted Baker 2 siblings, 0 replies; 82+ messages in thread From: Thomas Gleixner @ 2009-07-16 8:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Ted Baker, Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, 16 Jul 2009, Peter Zijlstra wrote: > On Wed, 2009-07-15 at 19:11 -0400, Ted Baker wrote: > > Again, I don't think that either PP or PI is appropriate for use > > in a (SMP) kernel. For non-blocking locks, the current > > no-preeemption spinlock mechanism works. For higher-level > > (blocking) locks, I'm attracted to Jim Anderson's model of > > non-preemptable critical sections, combined with FIFO queue > > service. > > Right, so there's two points here I think: > > A) making most locks preemptible > B) adding PI to all preemptible locks > > I think that we can all agree that if you do A, B makes heaps of sense, > right? > > I just asked Thomas if he could remember any numbers on this, and he > said that keeping all the locks non-preemptible had at least an order > difference in max latencies [ so a 60us (A+B) would turn into 600us (! > A) ], this means a proportional decrease for the max freq of periodic > tasks. That are numbers from about 3 years ago. I need to redo the tests as we did lot of lock breaks and shortening of preempt/irq disabled sections since then, but when we started preempt-rt there was no real choice as the limited number of developers simply did not allow to analyse and fix all the long lasting critical sections. We were busy enough to fix all the locking problems which were unearthed. :) Also we did not have the tools to analyse the length of critical sections back then. It's definitely worthwhile to revisit this, but that's going to be a multi man years effort to distangle complex code pathes like the network stack, disk i/o and other known sources of trouble. The next challenge will be how to monitor the code base for regressions and keeping thousands of developers aware of these requirements. > This led to the conviction that the PI overheads are worth it, since > people actually want high freq tasks. > > Of course, when the decreased period is still sufficient for the > application at hand, the non-preemptible case allows for better > analysis. Agreed, but the real challenge of providing real time services in Linux is the wide variety of use cases. We simply need to accept that people want to use it from high frequency industrial control tasks while running a GUI, a webserver and a database on the same machine to heavily threaded enterprise class Java applications which do not care that much about latencies but need the correctness guarantees and of course everything in between. I'm well aware that we try to create something which does everything except of playing the National Anthem, but simply restricting the use cases is not an option and would be exceptionally boring as well :) Thanks, tglx ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 7:58 ` Peter Zijlstra 2009-07-16 8:52 ` Thomas Gleixner @ 2009-07-16 12:17 ` Raistlin 2009-07-16 12:59 ` James H. Anderson 2009-07-16 22:15 ` Ted Baker 2 siblings, 1 reply; 82+ messages in thread From: Raistlin @ 2009-07-16 12:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Ted Baker, Chris Friesen, Noah Watkins, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 2227 bytes --] On Thu, 2009-07-16 at 09:58 +0200, Peter Zijlstra wrote: > > Again, I don't think that either PP or PI is appropriate for use > > in a (SMP) kernel. For non-blocking locks, the current > > no-preeemption spinlock mechanism works. For higher-level > > (blocking) locks, I'm attracted to Jim Anderson's model of > > non-preemptable critical sections, combined with FIFO queue > > service. > But, if I remember well how FMLP works, there is not that much difference between them two... I mean, if you, at any (kernel|user) level define a short critical section, this is protected by a non-preemptive FIFO "queue lock", which is how Linux implements --at least on x86-- spinlocks! :-O Also, I'm not sure I can find in the FMLP paper information about the possibility of a task to suspend itself (e.g., I/O completion) while holding a short lock... I assume this is not recommended, but may be wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse and correct me. :-) On the other hand, if with "blocking locks" you intended the long ones, I think they hare dealt right with suspension and priority inheritance, even in there. > Right, so there's two points here I think: > > A) making most locks preemptible > B) adding PI to all preemptible locks > > I think that we can all agree that if you do A, B makes heaps of sense, > right? > I don't know about all, but I do... I hope I'm not offending anyone saying that I like priority inheritance!! :-P > Of course, when the decreased period is still sufficient for the > application at hand, the non-preemptible case allows for better > analysis. > Sure! I am impressed as well by the amazing results approaches like the FMLP give in term of schedulabiulity analysis, and think a little bit of spinning would not hurt that much, but not to the point of moving all the locking to spinlocks. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 12:17 ` Raistlin @ 2009-07-16 12:59 ` James H. Anderson 2009-07-16 13:37 ` Peter Zijlstra 0 siblings, 1 reply; 82+ messages in thread From: James H. Anderson @ 2009-07-16 12:59 UTC (permalink / raw) To: Raistlin Cc: Peter Zijlstra, Ted Baker, Chris Friesen, Noah Watkins, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern B. Brandenburg Raistlin wrote: > Also, I'm not sure I can find in the FMLP paper information about the > possibility of a task to suspend itself (e.g., I/O completion) while > holding a short lock... I assume this is not recommended, but may be > wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse > and correct me. :-) > > This is a really excellent point and something I probably should have mentioned. We developed the FMLP strictly for real-time (only) workloads. We were specifically looking at protecting memory-resident resources (not I/O). The FMLP would have to be significantly extended to work in settings where these assumptions don't hold. Thanks for pointing this out. -Jim ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 12:59 ` James H. Anderson @ 2009-07-16 13:37 ` Peter Zijlstra 0 siblings, 0 replies; 82+ messages in thread From: Peter Zijlstra @ 2009-07-16 13:37 UTC (permalink / raw) To: James H. Anderson Cc: Raistlin, Ted Baker, Chris Friesen, Noah Watkins, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Bjoern B. Brandenburg On Thu, 2009-07-16 at 08:59 -0400, James H. Anderson wrote: > > Raistlin wrote: > > Also, I'm not sure I can find in the FMLP paper information about the > > possibility of a task to suspend itself (e.g., I/O completion) while > > holding a short lock... I assume this is not recommended, but may be > > wrong, and, in that case, I hope Prof. Anderson and Bjorn will excuse > > and correct me. :-) > > > > > This is a really excellent point and something I probably should have > mentioned. We developed the FMLP strictly for real-time (only) > workloads. We were specifically looking at protecting memory-resident > resources (not I/O). The FMLP would have to be significantly extended > to work in settings where these assumptions don't hold. One thing I've been thinking about is extending lockdep to help verify things like this. If we were to annotate a syscall/trap with something like: lockdep_assume_rt() And teach lockdep about non-RT blocking objects, we could validate that the callchain down from lockdep_assume_rt() would not indeed contain a non-RT resource, but also that we don't take locks which might in other another code path. That is, suppose: sys_foo() lockdep_assume_rt() mutex_lock(&my_lock) vs sys_bar() mutex_lock(&my_lock) down_read(&mm->mmap_sem) vs page-fault down_read(&mm->mmap_sem) lock_page(page) Would indeed generate a warning because mmap_sem is known to block on IO, and there is a dependency (through sys_bar()) between my_lock and mmap_sem, therefore sys_foo()'s assumption is invalid. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 7:58 ` Peter Zijlstra 2009-07-16 8:52 ` Thomas Gleixner 2009-07-16 12:17 ` Raistlin @ 2009-07-16 22:15 ` Ted Baker 2009-07-16 22:34 ` Karthik Singaram Lakshmanan 2 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-16 22:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, Jul 16, 2009 at 09:58:32AM +0200, Peter Zijlstra wrote: > > Again, I don't think that either PP or PI is appropriate for use > > in a (SMP) kernel. For non-blocking locks, the current > > no-preeemption spinlock mechanism works. For higher-level > > (blocking) locks, I'm attracted to Jim Anderson's model of > > non-preemptable critical sections, combined with FIFO queue > > service. > > Right, so there's two points here I think: > > A) making most locks preemptible > B) adding PI to all preemptible locks > > I think that we can all agree that if you do A, B makes heaps of sense, > right? Maybe. That depends on how much it costs to provide PI for those locks, and assumes that everyone (all application and OS tasks) can agree on a global meaning for priority in the system. I have always liked the simplicify of a global notion of priority, which is the core idea of global preemptive SMP scheduling. However, some authors have pointed out scenarios for *partitioned* multiprocessor scheduling where expediting the highest priority task on one processor may result in idling other processors unnecessarily, ultimately resulting in missed deadlines. For argument's sake, I'll assume that those scenarios are pathological, and that a good designer who want to partition can arrange the task-to-cpu assignments and priorities in a way that prevents them. There are two elements in this discussion that will require resolution in such a global priority scheme: (1) how to rank EDF deadlines vs. fixed priorities; (2) what to do about tasks that are scheduled by dynamic priority schemes. In the latter context, I'm thinking first of aperiodic server scheduling schemes like SCHED_SPORADIC, but the problem occurs with any scheme where a task's priority varies routinely. (You already have a bunch of implementation code complexity from user-initiated priority changes, like pthread_sched_setpolicy(), but those kinds of changes don't happen often.) I think both (1) and (2) are covered by what I think has been termed here the "PEP" scheme or "proxy" scheduling, i.e., implementing priority inheritance not by explicitly comparing and adjusting priorities, but by allowing the system scheduler to use whatever algorithms it may to choose a task A to execute on each processor, and then if that task A is blocked by a lock held by another task B, instead execute B on A's processor if B is not already executing. However, this still leaves the question of how to choose which of several blocked tasks to grant a lock to, when the lock is released. If you want to do that according to priority (whichq it seems one ought to, for consistency) it seems you now have to maintain a priority orderd lock queue. That means you are back into looking at explicit representation of inherited priorities again, unless you can find a way to use the scheduler to choose who to grant the lock to. Some proponents of the original POSIX mutex scheme intended to solve this by unblocking all the tasks contending for the mutex, and letting them re-try locking it. This does get around the explicit priority problem. Whichever task the scheduler chooses first will get the lock. The problem is that you have the overhead of unblocking all those tasks (itself a priority inversion, since you are unblocking some "low priority" tasks). On a single processor (or, on an SMP if you are lucky), the lock will be released again soon, and all these unblocked tasks will get into their critical sections without having to block again. However, with back luck, on an SMP, all but one will bang up against the spin-lock, and have be blocked again. This will generate extra context switches on every CPU.... not a good thing. This scheme also does not seem to work well for partitioned scheduling, or any scheme with per-cpu run queues, since the scheduling is being done in an uncoordinated way on multiple processors. So, maybe Jim's model of FIFO service in queues is the right one? I'ts predictable. Even if it can cause some unnecesary priority inversion, the priority inversion should be bounded. I still conceptually prefer the idea of granting locks to contending tasks in priority order, of course. It is just a question of whether you want to have to agree (1) that all scheduling is based on priority, and (2) pay the price for either (2a) dynamically re-ordering all those queues every time a task gains or loses priority (due to inheritance, or whatever), or (2b) pay the O(n) price of scanning the queue for the currently highest-priority task when you grant the lock. If you go this way, I would favor the latter. In any system that does not already have poor performance due to excessive lock contention, the queues should rarely have more than one member. Assuming integrity of the queue is maintained by the corresponding lock itself, it is much easier to do this scanning at the point the lock is released than to support (asynchronous) queue reordering for every potential priority change. > I just asked Thomas if he could remember any numbers on this, and he > said that keeping all the locks non-preemptible had at least an order > difference in max latencies [ so a 60us (A+B) would turn into 600us (! > A) ], this means a proportional decrease for the max freq of periodic > tasks. Those numbers are convincing for the value of preemptable locks. > This led to the conviction that the PI overheads are worth it, since > people actually want high freq tasks. As argued above, I don't see that they necessarily argue for PI on those preempable locks. > Of course, when the decreased period is still sufficient for the > application at hand, the non-preemptible case allows for better > analysis. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 22:15 ` Ted Baker @ 2009-07-16 22:34 ` Karthik Singaram Lakshmanan 2009-07-16 23:38 ` Ted Baker 0 siblings, 1 reply; 82+ messages in thread From: Karthik Singaram Lakshmanan @ 2009-07-16 22:34 UTC (permalink / raw) To: Ted Baker Cc: Peter Zijlstra, Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Raj Rajkumar, dionisio > I still conceptually prefer the idea of granting locks to > contending tasks in priority order, of course. It is just a > question of whether you want to have to agree (1) that all > scheduling is based on priority, and (2) pay the price for either > (2a) dynamically re-ordering all those queues every time a task > gains or loses priority (due to inheritance, or whatever), or (2b) > pay the O(n) price of scanning the queue for the currently > highest-priority task when you grant the lock. If you go this > way, I would favor the latter. In any system that does not > already have poor performance due to excessive lock contention, > the queues should rarely have more than one member. Assuming > integrity of the queue is maintained by the corresponding lock > itself, it is much easier to do this scanning at the point the > lock is released than to support (asynchronous) queue reordering > for every potential priority change. > Just chiming in that from an implementation perspective, we could use a priority bitmap of active tasks contending for the lock. An implementation similar to the one used by the O(1) scheduler can be of great use here. Hardware support like "find_first_bit" can drastically reduce the time taken to search for the highest-priority task pending on the lock. Given realistic values for the number of distinct priority values required by most practical systems, such an implementation could prove effective. Thanks, Karthik ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 22:34 ` Karthik Singaram Lakshmanan @ 2009-07-16 23:38 ` Ted Baker 2009-07-17 1:44 ` Karthik Singaram Lakshmanan 0 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-16 23:38 UTC (permalink / raw) To: Karthik Singaram Lakshmanan Cc: Peter Zijlstra, Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Raj Rajkumar, dionisio On Thu, Jul 16, 2009 at 05:34:25PM -0500, Karthik Singaram Lakshmanan wrote: > Just chiming in that from an implementation perspective, we could use > a priority bitmap of active tasks contending for the lock. An > implementation similar to the one used by the O(1) scheduler can be of > great use here. Hardware support like "find_first_bit" can drastically > reduce the time taken to search for the highest-priority task pending > on the lock. Given realistic values for the number of distinct > priority values required by most practical systems, such an > implementation could prove effective. This does not solve the problem of avoiding queue reordering in response to dynamic priority changes, since where you insert the task in the queue (including the bit setting for the priority and the linking/unlinking) depends on the curent priority. This queue reordering is a huge pain, since you need to do it not only whenever a priority is explicitly changed by the user; you need to do it whenever a task gains or loses priority through inheritance. The latter can happen asynchronously, in response to a time-out or signal, for example. By the way, the use of bit-maps is very appealing for the O(1) operations, including bit-scan, especially if your machine has atomic set/test bit instructions. We used this structure for some single-processor RT kernels in the 1980's. The task scheduling and sleep/wakeup operations were just a few instructions. However the use of bit-maps forced us to set a fixed limit on the number of tasks in the system. We also could not change priorities without doing an ugly (non-atomic) re-shuffling of the structures. You are proposing one bit per priority level, of course, rather than one bit per task. This allows you to use linked lists within priorities, but you lose the one-instruction suspend/unsuspend. It is not immediately obvious how to extend this technique to deadline-based scheduling. There can only be a finite number of distinct deadlines in a system at any one time. So at any point in time a finite number of bits is sufficient. The problem is that the deadlines are advancing, so a fixed mapping of bit positions to deadlines does not work. There is one way this can be used with EDF, at least for a single processor, which is related to the way I came up with the SRP. If you keep a stack of preempting tasks (earliest deadline on top), the positions in the stack do map nicely to a bit vector. You then need some other structure for the tasks with deadlines later than the bottom of your stack. I don't see how this generalizes to SMP. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 23:38 ` Ted Baker @ 2009-07-17 1:44 ` Karthik Singaram Lakshmanan 0 siblings, 0 replies; 82+ messages in thread From: Karthik Singaram Lakshmanan @ 2009-07-17 1:44 UTC (permalink / raw) To: Ted Baker Cc: Peter Zijlstra, Chris Friesen, Noah Watkins, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari, Raj Rajkumar, dionisio > This does not solve the problem of avoiding queue reordering in > response to dynamic priority changes, since where you insert the > task in the queue (including the bit setting for the priority and > the linking/unlinking) depends on the curent priority. > I completely agree with this. The bit map needs to be modified for dynamic priority changes. However, atomic operations (like Compare-And-Swap) can be used to modify the bitmap. Barring any (highly unlikely but bounded) contention scenarios from other processors, this implementation would still continue to be efficient than maintaining a priority-ordered queue. > By the way, the use of bit-maps is very appealing for the O(1) > operations, including bit-scan, especially if your machine has > atomic set/test bit instructions. We used this structure for some > single-processor RT kernels in the 1980's. The task scheduling > and sleep/wakeup operations were just a few instructions. However > the use of bit-maps forced us to set a fixed limit on the number > of tasks in the system. We also could not change priorities > without doing an ugly (non-atomic) re-shuffling of the structures. > > You are proposing one bit per priority level, of course, rather > than one bit per task. This allows you to use linked lists within > priorities, but you lose the one-instruction suspend/unsuspend. > We can still maintain the O(1) instruction-suspend/unsuspend, if we maintain a counter for each priority level. (a) When a task suspends on a lock, we can do an atomic increment of the counter for its priority level and set the bit on the priority map of tasks waiting for the lock. Attaching the task to a per-priority-level linked list queue should take O(1) assuming that there is a tail pointer. (b) When the lock is released, find the highest priority bit set on the lock's priority map, index into the appropriate per-priority-level linked list, and wake up the task at the head of the queue (delete operation with O(1)). Do an atomic decrement of the counter, if it reaches zero unset the bit on the priority map. While there are still contention issues that arise with updating the linked list and counters, these are restricted to tasks within the same priority level (highly unlikely that multiple tasks from the same priority level decide to acquire the same lock within a window of a couple of instructions), and should be much less than the contention arising from all the tasks in the system. > It is not immediately obvious how to extend this technique to > deadline-based scheduling. > > There can only be a finite number of distinct deadlines in a > system at any one time. So at any point in time a finite number > of bits is sufficient. The problem is that the deadlines are > advancing, so a fixed mapping of bit positions to deadlines does > not work. > > There is one way this can be used with EDF, at least for a single > processor, which is related to the way I came up with the SRP. If > you keep a stack of preempting tasks (earliest deadline on top), > the positions in the stack do map nicely to a bit vector. > > You then need some other structure for the tasks with deadlines > later than the bottom of your stack. > Yes. I did not think about EDF based schedulers when I proposed the implementation mechanism. I believe we can work on the SRP idea to get a corresponding version for EDF. > I don't see how this generalizes to SMP. > I agree that there needs to be more work to generalize the idea to EDF based schedulers on SMP, however, the idea still applies to fixed-priority scheduling in the SMP context. Many SMP processors support atomic instructions (for example: http://software.intel.com/en-us/articles/implementing-scalable-atomic-locks-for-multi-core-intel-em64t-and-ia32-architectures/). We can utilize these instructions to efficiently implement such locks at least in fixed-priority schedulers (like Deadline Monotonic Schedulers) for SMP systems. Thanks - Karthik ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-15 23:11 ` Ted Baker 2009-07-16 7:58 ` Peter Zijlstra @ 2009-07-16 15:17 ` Chris Friesen 2009-07-16 21:26 ` Ted Baker 1 sibling, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-16 15:17 UTC (permalink / raw) To: Ted Baker Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Ted Baker wrote: > On Mon, Jul 13, 2009 at 03:45:11PM -0600, Chris Friesen wrote: > >> Given that the semantics of POSIX PI locking assumes certain scheduler >> behaviours, is it actually abstraction inversion to have that same >> dependency expressed in the kernel code that implements it? > ...> >> The whole point of mutexes (and semaphores) within the linux kernel is >> that it is possible to block while holding them. I suspect you're going >> to find it fairly difficult to convince people to spinlocks just to make >> it possible to provide latency guarantees. > > The abstraction inversion is when the kernel uses (internally) > something as complex as a POSIX PI mutex. So, I'm not arguing > that the kernel does not need internal mutexes/semaphores that > can be held while a task is suspended/blocked. I'm just arguing > that those internal mutexes/semaphores should not be PI ones. This ties back to your other message with the comment about implementing userspace PI behaviour via some simpler "loopholes". If the application is already explicitly relying on PI pthread mutexes (possibly because it hasn't got enough knowledge of itself to do PP or to design the priorities in such a way that inversion isn't a problem) then presumably priority inversion in the kernel itself will also be an issue. If a high-priority task makes a syscall that requires a lock currently held by a sleeping low-priority task, and there is a medium priority task that wants to run, the classic scenario for priority inversion has been achieved. >> On the other hand, PP requires code analysis to properly set the >> ceilings for each individual mutex. > > Indeed, this is difficult, but no more difficult than estimating > worst-case blocking times, which requires more extensive code > analysis and requires consideration of more cases with PI than PP. I know of at least one example with millions of lines of code being ported to linux from another OS. The scheduling requirements are fairly lax but deadlock due to priority inversion is a highly likely. They compare PI and PP, see that PP requires up-front analysis, so they enable PI. I suspect there are other similar cases where deadlock is the real issue, and hard realtime isn't a concern (but low latency may be desirable). PI is simple to enable and doesn't require any thought on the part of the app writer. >> Certainly if you block waiting for I/O while holding a lock then it >> impacts the ability to provide latency guarantees for others waiting for >> that lock. But this has nothing to do with PI vs PP or spinlocks, and >> everything to do with how the lock is actually used. > > My only point there was with respect to application-level use of > POSIX mutexes, that if an application needs to suspend while > holding a mutex (e.g., for I/O) then the application will have > potentially unbounded priority inversion, and so is losing the > benefit from priority inheritance. So, if the only benefit of > PRIO_INHERIT over PRIO_PROTECT is being able to suspend while > holding a lock, there is no real benefit. At least for POSIX, both PI and PP mutexes can suspend while the lock is held. From the user's point of view, the only difference between the two is that PP bumps the lock holder's priority always, while PI bumps the priority only if/when necessary. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 15:17 ` Chris Friesen @ 2009-07-16 21:26 ` Ted Baker 2009-07-16 22:08 ` Chris Friesen 0 siblings, 1 reply; 82+ messages in thread From: Ted Baker @ 2009-07-16 21:26 UTC (permalink / raw) To: Chris Friesen Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, Jul 16, 2009 at 09:17:32AM -0600, Chris Friesen wrote: > If a high-priority task A makes a syscall that requires a lock currently > held by a sleeping low-priority task C, and there is a medium priority B > task that wants to run, the classic scenario for priority inversion has > been achieved. I think you don't really mean "sleeping" low-priority task C, since then the priority inheritance would do no good. I guess you mean that C has been/is preempted by B (and for global SMP, there is some other medicum priority task B' that is eligible to run on A's processor). That could be a priority inversion scenario. BTW, if migration is allowed the probability of this kind of thing (and hence the payoff for PIP) goes down rapidly with the number of processors. > I know of at least one example with millions of lines of code being > ported to linux from another OS. The scheduling requirements are fairly > lax but deadlock due to priority inversion is a highly likely. They > compare PI and PP, see that PP requires up-front analysis, so they > enable PI. > > I suspect there are other similar cases where deadlock is the real > issue, and hard realtime isn't a concern (but low latency may be > desirable). PI is simple to enable and doesn't require any thought on > the part of the app writer. I'm confused by your reference to deadlock. Priority inheritance does not prevent deadlock, even on a single processor. > At least for POSIX, both PI and PP mutexes can suspend while the lock is > held. From the user's point of view, the only difference between the > two is that PP bumps the lock holder's priority always, while PI bumps > the priority only if/when necessary. You are right that POSIX missed the point of priority ceilings, by allowing suspension. However, there is still a difference in context-switching overhead. Worst-case, you have twice as many context switches per critical section with PIP as with PP. In any case, for a multiprocessor, PP is not enough. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 21:26 ` Ted Baker @ 2009-07-16 22:08 ` Chris Friesen 2009-07-16 23:54 ` Ted Baker 0 siblings, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-16 22:08 UTC (permalink / raw) To: Ted Baker Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari Ted Baker wrote: > On Thu, Jul 16, 2009 at 09:17:32AM -0600, Chris Friesen wrote: > >> If a high-priority task A makes a syscall that requires a lock currently >> held by a sleeping low-priority task C, and there is a medium priority B >> task that wants to run, the classic scenario for priority inversion has >> been achieved. > > I think you don't really mean "sleeping" low-priority task C, > since then the priority inheritance would do no good. I guess you > mean that C has been/is preempted by B (and for global SMP, there > is some other medicum priority task B' that is eligible to run on > A's processor). That could be a priority inversion scenario. My terminology is getting sloppy. Yes, I meant preempted. >> I suspect there are other similar cases where deadlock is the real >> issue, and hard realtime isn't a concern (but low latency may be >> desirable). PI is simple to enable and doesn't require any thought on >> the part of the app writer. > > I'm confused by your reference to deadlock. Priority inheritance > does not prevent deadlock, even on a single processor. Sloppy terminology again. Priority inversion. If all apps are soft-realtime and B is a pure cpu hog (which can effectively happen on heavily loaded server systems) then A will never get to run. >> At least for POSIX, both PI and PP mutexes can suspend while the lock is >> held. From the user's point of view, the only difference between the >> two is that PP bumps the lock holder's priority always, while PI bumps >> the priority only if/when necessary. > > You are right that POSIX missed the point of priority ceilings, > by allowing suspension. The vast majority of apps written for Linux are POSIX apps, so for this discussion we need to bear that behaviour in mind. > However, there is still a difference in context-switching > overhead. Worst-case, you have twice as many context switches > per critical section with PIP as with PP. On the other hand, with PI the uncontended case can be implemented as atomic operations in userspace. With PP we need to issue at least two syscalls per lock/unlock cycle even in the uncontended case (to handle the priority manipulations). Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-16 22:08 ` Chris Friesen @ 2009-07-16 23:54 ` Ted Baker 0 siblings, 0 replies; 82+ messages in thread From: Ted Baker @ 2009-07-16 23:54 UTC (permalink / raw) To: Chris Friesen Cc: Noah Watkins, Peter Zijlstra, Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Thu, Jul 16, 2009 at 04:08:47PM -0600, Chris Friesen wrote: > > However, there is still a difference in context-switching > > overhead. Worst-case, you have twice as many context switches > > per critical section with PIP as with PP. > > On the other hand, with PI the uncontended case can be implemented as > atomic operations in userspace. With PP we need to issue at least two > syscalls per lock/unlock cycle even in the uncontended case (to handle > the priority manipulations). Needing syscalls to change the priority of a thread may be an artifact of system design, that might be correctable. Suppose you put the effective priority of each thread in a per-thread page that is mapped into a fixed location in the thread's address space (and different locations in the kernel memory). It is nice to have such a page for each thread in any case. I don't recall whether Linux already does this, but it is a well proven technique. Taking a PP lock then involves: 1) push old priority on the thread's stack 2) overwrite thread's priority with max of the lock priority and the thread priority 3) try to grab the lock (test-and-set, etc.) ... conditionally queue, etc. Releasing the PP lock then involves: 1) conditionally find another thread to grant the lock to, call scheduler, etc., otherwise 2) give up the actual lock (set bit, etc.) 3) pop the old priority from the stack, and write it back into the per-thread location Of course you also have explicit priority changes. The way we handled those was to defer the effect until the lock release point. This means keeping two priority values (the nominal one, and the effective one). Just as you need conditional code to do the ugly stuff that requires a kernel trap in the case that the lock release requires unblocking a task, you need conditional code to copy the copy the new nominal priority to the effective priority, if that is called for. We were able to combine these two conditions into a single bit test, which then branched out to handle each of the cases, if necessary. I can't swerar there are nocomplexities in Linux that might break this scheme, since we were not trying to support all the functionality now in Linux. Ted ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 18:14 ` Noah Watkins 2009-07-13 20:13 ` Ted Baker @ 2009-07-14 9:15 ` Peter Zijlstra 2009-07-14 19:07 ` Raistlin 1 sibling, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-14 9:15 UTC (permalink / raw) To: Noah Watkins Cc: Raistlin, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Mon, 2009-07-13 at 11:14 -0700, Noah Watkins wrote: > > I think you can extend PIP to include things like bandwidth > > inheritance > > too. Instead of simply propagating the priority through the waitqueue > > hierarchy, you can pass the actual task around, and having this > > task you > > can indeed consume its bandwidth etc.. > > I think I understand what you are suggesting by "pass the actual task > around". If not, this message might seem out of place. > > Consider this locking chain/graph: > > A-->L1-->B-->L2-->C > D-->L3-->E-->L2 C | L2 / \ B E | | L1 L3 | | A D > The way I understand what you are suggesting is that tasks A,B,D,E > (or some representation of them) can be passed around such that when > C executes it consumes some resource associated with the the tasks it > is blocking (A,B,D,E). Obviously which tasks and what quantities are > dependent on scheduling semantics and configuration. > > The biggest implementation hurdle we have encountered is that as > tasks are passed forward along a locking chain knowledge about the > structure of the locking chain is lost. For example, as C releases > L2, C must be able to distinguish between A and D as having been > passed to C's location through B or E, respectively. > > Keeping a representation of the locking chain as a full graph is the > solution we have used. Another solution is to allow A and D to re- > block and walk the locking chain again, but obviously some thundering- > hurd issues arise. Right, we too keep the full graph (as Ted has noted with disgust). Douglas mentions that since there is no priority in things like proportional bandwidth schedulers (CFS), priority inheritance cannot work. I would beg to differ in that a straight forward extension of the protocol can indeed work, even for such cases. One needs to change the sorting function used in the waitqueues (Li) to reflect the full scheduler function (which in itself can be expressed as a (partial?) sorting function. One needs to pass along the 'highest' task, not simply the priority. And, one needs to cancel and re-acquire this task's contention whenever the leading task (C) gets preempted. We let the scheduler use and consume the task that is passed up as being the 'highest' in C's stead (it might actually be C). For example, suppose the above block graph and add a single unrelated task F, all of equal and unit (1) weight. Now traditionally that'd result in 2 runnable tasks of equal weight, such that they both get 50% cpu time (assuming single cpu). However, PEP and the above modified PIP will in fact result in C receiving an effective weight of 5 versus 1 for F. The sorting function for proportional bandwidth schedulers is the typical least service received one -- such that the task with least service will be deemed the 'highest' one. Now suppose that that task would be found to be A, we'll run C with A's values until its quanta is consumed and we'd force preempt. Normally that would lead to the selection of F, however if we first cancel and retry A, leading to a re-sorting of the graph, we might well find that E is the 'highest' task in the graph, and also more eligible than F. Furthermore Ted raises the point: > Regarding the notion of charging proxy execution to the budget of > the client task, I have grave concerns. It is already hard enough > to estimate the amount of budget that a real-time task requires, > without this additional complication. I hope the above illustrates the use of this, furthermore I think Dario and team did some actual analysis on it wrt deadline schedulers. Dario could you expand? ~ Peter ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 9:15 ` Peter Zijlstra @ 2009-07-14 19:07 ` Raistlin 0 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 19:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Noah Watkins, Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 3239 bytes --] On Tue, 2009-07-14 at 11:15 +0200, Peter Zijlstra wrote: > > Regarding the notion of charging proxy execution to the budget of > > the client task, I have grave concerns. It is already hard enough > > to estimate the amount of budget that a real-time task requires, > > without this additional complication. > > I hope the above illustrates the use of this, furthermore I think Dario > and team did some actual analysis on it wrt deadline schedulers. Dario > could you expand? > Sure, I can try, again! :-) I think what I said trying to answer Chris is of same help here as well, and that it already clarified things at least a little bit. Does it? Anyway, I think we are looking at the problem from slightly different points standpoints. I definitely agree with Prof. Baker with the fact that estimating execution times is very difficult, and I think this extends to the estimation of durations of critical sections too. Moreover, there are scenarios where you don't really need strong knowledge of these time intervals because, e.g., you are mainly interested in: - providing an applications/components with some kind of quality of service --i.e., not hard real-time-- guarantees; - isolate the applications/components among themselves. This is, I think, quite typical for soft real-time workloads that could run on Linux, better if preempt-rt, without many problems. With this in mind, what we think is interesting of BWI-like approaches is, indeed, that they do require virtually no knowledge about tasks' behaviour, at least to provide timing isolation... No tasks' wcet, no tasks' accessed mutexes, no tasks' blocking times: just a budget and a period of a server/group each application is enclosed within. This is of some relevance, we think, not only in the real-time world, but also when robustness, availability and dependability starts being considered. I think I already said how the thing basically works, and I don't want to bother you all by explaining it again, unless you ask me for some clarification... All I can add is that going this way is, actually, moving toward trying to _remove_ the need of having exact prediction of task execution and blocking. Finally, but I already said this as well, if you need hard behaviour and you have the data about tasks' wcet and blocking times, the paper I pointed to (the last mail, with working links! :-D) contain all it is needed to compute the group's budget properly taking into account interferences... It is, brutally simplifying, something like using as budget for task A's group the sum of A's wcet plus all the blocking times of tasks that blocks on it. And Peter is right, such analysis is done for EDF on UP configurations. Finally, I unfortunately am not able to provide any clue on how this applies to fair schedulers (e.g., SFQ/CFS), but don't think it's impossible... :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 15:44 ` Raistlin 2009-07-13 16:33 ` Chris Friesen 2009-07-13 17:25 ` Peter Zijlstra @ 2009-07-13 17:28 ` Peter Zijlstra 2009-07-14 19:47 ` Raistlin 2 siblings, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-13 17:28 UTC (permalink / raw) To: Raistlin Cc: Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari On Mon, 2009-07-13 at 17:44 +0200, Raistlin wrote: > What we would like to achieve is some set of rules that, extending the > UP ones, yield a situation which is both: > - analyzable from the real-time theorist's point of view... Which is > (un?)fortunately what we are :-) > - possible to implement... Which is not always (un!)fortunately obvious > here :-) I would very much like a proper theoretical foundation for whatever we end up with ;-) > Very basically: from the analysis point of view one easy and effective > solution would be to have the blocked-running tasks --i.e., the tasks > blocked on some lock that have been left on the rq to proxy-execute the > lock owner-- busy waiting while the lock owner is running. This allows > for retaining a lot of nice properties BWI already has, as far as > analyzability is concerned. Right, practically we cannot do this, since we expose the block graph to userspace and you could in userspace construct a program that would exploit this spinning to DoS the system. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 17:28 ` Peter Zijlstra @ 2009-07-14 19:47 ` Raistlin 0 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 19:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Douglas Niehaus, Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Ted Baker, Dhaval Giani, Noah Watkins, KUSP Google Group, Tommaso Cucinotta, Giuseppe Lipari [-- Attachment #1: Type: text/plain, Size: 2000 bytes --] On Mon, 2009-07-13 at 19:28 +0200, Peter Zijlstra wrote: > > - analyzable from the real-time theorist's point of view... Which is > > (un?)fortunately what we are :-) > > - possible to implement... Which is not always (un!)fortunately obvious > > here :-) > > I would very much like a proper theoretical foundation for whatever we > end up with ;-) > Well, thus let's hope to manage in it! :-D > > Very basically: from the analysis point of view one easy and effective > > solution would be to have the blocked-running tasks --i.e., the tasks > > blocked on some lock that have been left on the rq to proxy-execute the > > lock owner-- busy waiting while the lock owner is running. This allows > > for retaining a lot of nice properties BWI already has, as far as > > analyzability is concerned. > > Right, practically we cannot do this, since we expose the block graph to > userspace and you could in userspace construct a program that would > exploit this spinning to DoS the system. > I know and I share this belief with you, as I wrote in my original mail... Even if I must say that it would not be a _real_ spinning, such as raw_spinlock_t, with irq and preemption disabled (like in FMLP), but only a mean to --preemptively-- consume some budget to avoid schedulability to be lost! Thus, I'm not sure it could be the basis for a DoS, especially if tasks are, individually, as a group or as a whole class, enclosed within groups (we used to call them reservations as well). Anyway I'm aware that CPU bandwidth would be wasted, and that this is an issue in systems like Linux... That's why we are striving for something better... :-P Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
[parent not found: <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net>]
* Re: RFC for a new Scheduling policy/class in the Linux-kernel [not found] ` <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net> @ 2009-07-13 23:47 ` Douglas Niehaus 2009-07-14 7:27 ` Chris Friesen 0 siblings, 1 reply; 82+ messages in thread From: Douglas Niehaus @ 2009-07-13 23:47 UTC (permalink / raw) To: Ted & Syauchen Baker Cc: 'Peter Zijlstra', 'Henrik Austad', 'LKML', 'Ingo Molnar', 'Bill Huey', 'Linux RT', 'Fabio Checconi', 'James H. Anderson', 'Thomas Gleixner', 'Ted Baker', 'Dhaval Giani', 'Noah Watkins', 'KUSP Google Group' Ted has written a lot that I should think about for some time before dealing with too many of the details, but one issue is worth clarifying, I think, at least The Group Scheduling framework is intended to permit flexible configuration of *any* scheduling semantics, RT or otherwise that is desired. We have done CPU share, application progress, and others. Further , it is intended to permit applications running under multiple programming models to coexist, although this requires higher-levels of the system Group Scheduling hierarchy to describe how conflicts among applications are resolved. In that context, I coined the term "proxy execution" (I have no idea if anyone else had coined it before, but it seems an obvious idea, so they may have) as a *generalization* of Priority Inheritance, because it was a way to think about the issue, and implement it correctly for a mixture of scheduling semantics, if not as cheaply as I might like. Thus, the idea is that if A -blocked-on-> B, then A is still considered by the scheduling hierarchy, under whatever semantics is may be using, priority or otherwise. If everyone is using priority semantics, then the choice of the proxy calculation is the *same as* it would be under Priority Inheritance, but arrived at by a different mechanism. If A is under explicit plan scheduling, and B is under CFS or CPU share, then if A would be chosen by the system if it were not blocked, then B is run as its proxy until it gets out of A's way. In this case no priorities exit, so Proxy Execution is not a "way of implementing Priority Inheritance". Ted is correct that for any specific configuration of system semantics support RT applications using any of several RT programming models, schedulability is a key feature. I think that it is possible, in many cases, and that GS will support schedulability in many cases when configured correctly. However, some of the many open questions include: (1) Which system services, *if any* can RT threads use and still support "acceptable" schedulability analysis under a given RT programming model? (1.1) Will the use of system services (system calls) by RT threads need to be explicitly supported by, perhaps, explicitly making the schedulers of *other* threads also using those system calls aware of that and take it into account to make those other tasks non-preemptible while holding system call related locks. (1.2) Can Linux SMP ever support this in an acceptable manner since locks associated with systems services are shared across CPU boundaries. THus, I can isolate a set of RT tasks on a CPU easily, and they are well isolated and can run under strict predictability, *until they use a system call that uses a lock*. Then, the system call is an interaction channel with every other thread on the system using the same system call. (1.2.1) The only alternative tot he obvious properties I can think of here is to make each CPU run a separate copy of the OS and use disjoint sets of devices. At which point we have a distributed system in a box. Fundamentally, I think solutions to this problem are simply more expensive than people are yet willing to accept. Alternately, of course, I may just be thinking that so salve my disappointment at not thinking of a cheap solution. (2) Is the overhead of various mechanisms for tackling RT scheduling and schedulability analysis going to be a barrier for SMP solutions, and if so for how long, before people may just have to accept that "it really does cost that much". Of more personal concern: (3) Is the generality of GS going to cost more than is feasible for some applications. I can confidently say "yes" to the simplest form of this question. The more subtle form is, "For what percentage of all possible applications will the generality cost of GS be acceptable in return for its configurability?" I think the percentage will be high, or I would not have kept working on it for so long. Whether any of those qualifying applications are RT is less clear. One final observation as I must go for now: The traditional levels of overhead viewed as acceptable for making some of these decisions may not be attainable in SMP systems seeking to support multiple threads running under different programming models. Unfortunately if a consensus on a new "minimum feasible overhead" is required, it could take quite a while, because if I think my approach is at the "new minimum" but others do not, I am faced with proving a negative - that no other approach can do it more cheaply, and that is, of course, notoriously difficult. -Doug Ted & Syauchen Baker wrote: > Before committing to a locking protocol, such as the (1) or (2) below, I > would like to see explanation of how the schedulability analysis would be > done with whatever model is being proposed. Since we are talking about SMP, > this should include (at least) an analysis for the partitioned scheduling > model and the global scheduling model, and preferably also for hybrids > (tasks restricted to various subsets of processors), at least for > fixed-priority scheduling, but preferably also for deadline scheduling, and > for aperiodic server scheduling algorithms of both classes. To do less, > would be making a huge commitment on behalf of the kernel support group and > future application programmers, without any assurance that the mechanism > will be workable in the long term. > > Witness the difficulties caused by premature standardization of POSIX > PTHREAD_PRIO_INHERIT, and the bug that is standardized SCHED_SPORADIC, to > see what we want to avoid. > > Ted > > -----Original Message----- > From: Douglas Niehaus [mailto:niehaus@ittc.ku.edu] > Sent: Saturday, July 11, 2009 10:41 PM > To: Peter Zijlstra > Cc: Henrik Austad; LKML; Ingo Molnar; Bill Huey; Linux RT; Fabio Checconi; > James H. Anderson; Thomas Gleixner; Ted Baker; Dhaval Giani; Noah Watkins; > KUSP Google Group > Subject: Re: RFC for a new Scheduling policy/class in the Linux-kernel > > Peter: > Perhaps you could expand on what you meant when you said: > > Thing is, both BWI and PEP seems to work brilliantly on > Uni-Processor > but SMP leaves things to be desired. Dhaval is currently working on > a > PEP implementation that will migrate all the blocked tasks to the > owner's cpu, basically reducing it to the UP problem. > > What is left to be desired with PEP on SMP? I am not saying it is > perfect, as I can think of a few things I would like to improve or > understand better, but I am curious what you have in mind. > > Absent a clearer idea of what you had in mind, I can certainly discuss > the tradeeoffs Noah and I have considered over time, and which we think > motivates our approach. > > When Noah and I have talked about this topic over the quite extended > time, several years, we have been working on it, there have always > seemed two choices: > > 1) Move the proxy (the resource owner) to the CPU with the blocked task > 2) Move the "scheduling profile" of the blocked task to the CPU where > the proxy is. > > For Proxy Execution under Group Scheduling we have considered both over > time. Consider the situation where thread A on CPU0 is blocked on a > resource held by thread B on CPU1. When we considered (1), it has the > advantage of ensuring that B will run on CPU0, unblocking A, if A (or B) > is still the best choice at the time it has been successfully moved from > CPU1 -> CPU0. That might not be true after the delay of moving the process. > > We decided to emphasize (2) because it was more interesting in our view > because it was cheaper and seemed no more complicated although its > complications are different than (1). Its complication is, of course, > that while we have worked out how to add the "avatar" of A to the set > considered by the GS hierarchy on CPU1, it depends on the scheduling > semantics as configured whether the avatar of A is chosen as "best" and > thus how long it will be until B runs long enough to release the > resource and unblock A on CPU1. > > We have always viewed that as complicated, but appropriately so to the > problem. It depends inherently on the semantics of threads A, B, and all > other threads on CPU1 that are ready to run, among whom the "best" is > chosen by the GS hierarchy. We think it is inherently the problem of the > scheduling configuration to take this trade-off into account. > > We have also thought being able to do both (1) and (2) is best, but > which is best to use in a given situation depends on the comparative > cost of (X) running B on CPU1 long enough to unblock A and (Y) the cost > of moving B from CPU1->CPU0 to run long enough to unblock A, and then > move it back from CPU0->CPU1 since its designed CPU assigned is on CPU1. > Our decision after many hours of discussion over many months has been > that the cost of (X) seems a lot more attractive than (Y). > > Part of our preference is that we are still working with semaphores as > resources. Since most critical sections are supposed to be "short", > then scheduling semantics taking the proxy execution periods into > account on the "foreign" CPUs would actually be easier/better than the > double thread movement. > > Both problems seem quite hard and I do not think I have yet "completely > figured it out". While the "mass migration" you say Dhaval is working on > would "reduce" the problem to the UP case, I think it would create more > complexity for analysis than it eliminates. A form of thrashing seems a > real danger. In this case, that threads would be moving from CPU to CPU > so much it would be a real drain on resources and constraint on system > performance. However, Dhaval my well understand the cost implications of > thread migration better than I do. > > The real core of the problem, it seems to me, is that the proxy > relationship among threads depends on what resources can be held by > them. I think that problem is "relatively easy" in the set of locks > associated with a multi-threaded application. > > When the resources causing blocking can be *any* lock in the kernel > associated with *any* system service that might be used by *any* thread > is is complicated enough to make my brain hurt. However, we thing the GS > framework makes it relatively easy, perhaps that would be better said as > "as easy as it can be", to implement any combination of thread migration > and avatars desired by a given scheduling semantics. > > In that sense Noah and I feel that GS is a "complete" framework in that > it is possible to configure any semantics desired as easily as it can be > done by any framework. Obviously that does not resolve the question of > what semantics it is best to desire for a given system which remains > quite complicated and highly dependent on the specific application > semantics. > > Noah and I thought the relatively low cost of creating the avatar was > quite attractive, and so we decided on a GS configuration using it to > experiment with in specifying the scheduling semantics. The first two > approaches we want to experiment with are (*) to view the composite > scheduling hierarchy for all CPUs as a whole, and let the avatar of A > take its chances on CPU1, and (**) to view resolution of blocking as > most important system wide, so we have the avatar viewed as "best" long > enough for its proxy to release the resource. > > The bottom line, in out view, is that no single semantics will be viewed > as either "best" or even acceptable for all applications as is the case > with schedulers, so we wanted to emphasize configurability. > > We have performed basic tests showing the avatars can be chosen and > resolve the blocking relationship. More complex tests await the > completion of our port of GS and the other KUSP subsystems to 2.6.29. > > Doug > > Peter Zijlstra wrote: > >> On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote: >> >> >>> Hi all! >>> >>> This is a proposal for a global [1], deadline driven scheduler for >>> real-time tasks in the Linux kernel. I thought I should send out an RFC >>> > to > >>> gather some feedback instead of wildy hack away at it. >>> >>> This proposed scheduler is a modified MLLF (modified Least Laxity First) >>> called Earliest Failure First (EFF) as it orders tasks according to when >>> they will miss their deadlines, not when the actual deadline is. >>> >>> >> <snip> >> >> Everybody agrees we want a deadline scheduler, we'll probably put a user >> interface into -rt shortly which should work for all the involved >> research groups so that we can share tests and have better comparisons. >> >> The only thing (aside from an utter lack of time to work on things >> recently) that has been holding us back is a proper solution to the >> priority inversion issue. >> >> I haven't fully read through the proposed algorithm below, and left it >> in place for the new people on CC. >> >> As already mentioned on IRC, the fact that you push the work to the last >> possible moment indicates that this algorithm will utterly fall apart on >> overload and would thus be unsuited for soft-rt loads, but I guess we >> could implement things like EDF-fm and keep this as a hard-rt class. >> >> >> >>> === Notation === >>> >>> - Take a set of tasks with corresponding attributes. This set and their >>> attributes are called the schedule, 'S' and contains *all* tasks for >>> the given scheduling class (i.e. all EFF-tasks). >>> >>> - Consider a multi-core system with 'm' processors. >>> >>> - Let the i'th task in the schedule be denoted tau_i. [3] >>> >>> - Each task will run in intervals, each 'round' is called a job. A task >>> consists of an infinite sequence of jobs. The k'th job of tau_i is >>> called tau_{i,k} >>> >>> - Each task has a set of (relative) attributes supplied when the task is >>> inserted into the scheduler (passed via syscall) >>> * Period T_i >>> * Deadline D_i >>> * WCET C_i >>> >>> - Each job (tau_{i,k}) has absolute attributes (computed from the >>> > relative > >>> tasks-attributes coupled with physical time). >>> * Release-time r_{i,k} >>> * Deadline d_{i,k} >>> * Allocated time so for a job, C_a(t, tau_{i,k}) >>> When C_a equals WCET, the jobs budget is exhausted and it should >>> start a new cycle. This is tested (see below) by the scheduler. >>> * Remaining time for the job, C_r(t, tau_{i,nk}) >>> >>> - The acceptance function for EFF screens new tasks on their expected >>> utilization. Depending on the mode and implementation, it can be based >>> on the period, or on the deadline. The latter will cause firmer >>> restraints, but may lead to wasted resources. >>> >>> U = C_i / T_i For SRT (bounded deadline tardiness) >>> U = C_i / D_i For HRT >>> >>> - A relative measure, time to failure, ttf, indicates how much time is >>> left before a job must be scheduled to run in order to avoid a >>> deadline-miss. This will decrease as time progresses and the job is >>> not granted CPU time. For tasks currently running on a CPU, this value >>> will be constant. >>> >>> Take a job with a WCET of 10ms, it has been allowed to run for 4 >>> ms so far. The deadline is 8 ms away. Then the job must be >>> scheduled to run within the next 4 ms, otherwise it will not be >>> able to finish in time. >>> >>> - An absolute value, time of failure (tof) can also be computed in a >>> static manner. For tasks not running on a CPU, the allocated time is >>> static. That means you can take the absolute deadline, subtract the >>> allocated time and you have the absolute point in time when a given >>> job will fail to meet its deadline. >>> >>> === Outline of scheduler === >>> >>> Store tasks in 2 queues. One of size m, containing all the tasks >>> currently running on the CPUs (queue R). The other will hold all >>> currently active tasks waiting to execute (queue W). >>> >>> queue R is sorted based on ttf (time to failure, the relative time left >>> until a task will miss it's deadline). As the tasks approaches the >>> absolute time of failure at the same rate C_a increases, ttf is >>> constant. R is only a 'map' of tasks to the CPUs. Position 0 in R >>> (i.e. smallest ttf) does not result in CPU#0, as the position->CPU will >>> be quite fluent. >>> >>> queue W is sorted based on absolute time of failure (tof). Since this is >>> a fixed point in time, and the tasks in W are not running (C_a is >>> unchanged), this value is constant. >>> >>> When a task is scheduled to run, a timer is set at the point in time >>> where it has exhausted it's budget (t_now + WCET - C_a). This is to >>> ensure that a runaway task does not grab the CPU. >>> >>> When a new task arrives, it is handled according the following rules: >>> - The system has one or more CPUs not running EFF-tasks. Pick any of the >>> free CPUs and assign the new job there. Set a timer to >>> >>> - All CPUs are busy, the new task has greater time to failure than the >>> head of W. The task is inserted into W at the appropriate place. >>> >>> - All CPUs are busy and the new task has smaller time to failure than >>> the head of W. The new task is compared to the last task in Q. If time >>> to failure is larger than the task at the tail, it is added to the >>> head of W. >>> >>> - If all CPUs are busy, and time to failure is smaller than the tail of >>> Q, the new task is a candidate for insertion. At this point the tasks >>> must be compared to see if picking one or the other will cause a >>> deadline-miss. If both will miss the deadline if the other is >>> scheduled, keep the existing running and place the new at the head of >>> W (as you will have a deadline-miss anyway unless the the task is >>> picked up by another CPU soon). >>> >>> - A task running on a CPU with ttf=0 should *never* be preempted with >>> another task. If all tasks in R have ttf=0, and a newly arrived task >>> has ttf=0, a deadline-miss is inevitable and switching tasks will only >>> waste resources. >>> >>> When a task in R finish (or is stopped due to the timer-limit), it is >>> removed from R, and the head of W is added to R, inserted at the >>> appropriate place. >>> >>> It has been some discussion lately (in particular on #linux-rt) about >>> the bandwidth inheritance (BWI) and proxy execution protocol (PEP). It >>> should be possible to extend EFF to handle both. As a side note, if >>> anyone has some good information about PEP, I'd like a copy :) >>> >>> Based on this, I think the utilization can be set as high as M >>> (i.e. full utilization of all CPUs), but the jitter can probably be >>> quite bad, so for jitter-sensitive tasks, a short period/deadline should >>> be used. >>> >>> There are still some issues left to solve, for instance how to best >>> handle sporadic tasks, and whether or not deadline-miss should be allow, >>> or just 'bounded deadline tardiness'. Either way, EFF should be able to >>> handle it. Then, there are problems concerning blocking of tasks. One >>> solution would be BWI or PEP, but I have not had the time to read >>> properly through those, but from what I've gathered a combination of BWI >>> and PEP looks promising (anyone with good info about BWI and PEP - feel >>> free to share! (-: ). >>> >>> >> Our SSSUP friends have a BWI paper here: >> >> http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf >> >> The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm >> not sure if he has any papers on it. >> >> Also, when talking about it at OSPERT last week Ted Baker (also on CC) >> said it reminded him of something else of which I seem to have forgotten >> the name. >> >> Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor >> but SMP leaves things to be desired. Dhaval is currently working on a >> PEP implementation that will migrate all the blocked tasks to the >> owner's cpu, basically reducing it to the UP problem. >> >> >> >>> 1) Before you freeze at 'global' and get all caught up on "This won't >>> ever scale to X", or "He will be haunted by Y" - I do not want to >>> extend a global algorithm to 2000 cores. I would like to scale to a >>> single *chip* and then we can worry about 2-way and 4-way systems >>> later. For the record, I've donned my asbestos suit anyway. >>> >>> >> My preferred approach here is to find a distributed algorithm that >> converges to the global one. >> >> >> >>> 2) http://austad.us/kernel/thesis_henrikau.pdf >>> >>> 3) Anyone want to include LaTeX-notation into an email-rfc? >>> >>> >> Not unheard of ;-) >> >> >> > > ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 23:47 ` Douglas Niehaus @ 2009-07-14 7:27 ` Chris Friesen 2009-07-14 7:44 ` Douglas Niehaus 0 siblings, 1 reply; 82+ messages in thread From: Chris Friesen @ 2009-07-14 7:27 UTC (permalink / raw) To: Douglas Niehaus Cc: Ted & Syauchen Baker, 'Peter Zijlstra', 'Henrik Austad', 'LKML', 'Ingo Molnar', 'Bill Huey', 'Linux RT', 'Fabio Checconi', 'James H. Anderson', 'Thomas Gleixner', 'Ted Baker', 'Dhaval Giani', 'Noah Watkins', 'KUSP Google Group' Douglas Niehaus wrote: > (1.1) Will the use of system services (system calls) by RT threads > need to be explicitly supported by, perhaps, explicitly making the > schedulers of *other* threads also using those system calls aware of > that and take it into account to make those other tasks non-preemptible > while holding system call related locks. > > (1.2) Can Linux SMP ever support this in an acceptable manner since > locks associated with systems services are shared across CPU boundaries. > THus, I can isolate a set of RT tasks on a CPU easily, and they are well > isolated and can run under strict predictability, *until they use a > system call that uses a lock*. Then, the system call is an interaction > channel with every other thread on the system using the same system call. This may be a terminology issue, but I think it would be more correct to say that the lock is the interaction channel, not the system call, and it is an interaction channel with every other entity on the system using the same lock. Depending on the specific lock, this could be other userspace tasks, kernel threads, or even hardware interrupt handlers. This goes back to your first question of which system services an RT task can use while maintaining schedulability analysis. Unfortunately this may be a moving target, since the exact answer depends on what locks are taken in the underlying kernel implementation. Chris ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 7:27 ` Chris Friesen @ 2009-07-14 7:44 ` Douglas Niehaus 0 siblings, 0 replies; 82+ messages in thread From: Douglas Niehaus @ 2009-07-14 7:44 UTC (permalink / raw) To: Chris Friesen Cc: Ted & Syauchen Baker, 'Peter Zijlstra', 'Henrik Austad', 'LKML', 'Ingo Molnar', 'Bill Huey', 'Linux RT', 'Fabio Checconi', 'James H. Anderson', 'Thomas Gleixner', 'Ted Baker', 'Dhaval Giani', 'Noah Watkins', 'KUSP Google Group' Chris Friesen wrote: > Douglas Niehaus wrote: > > >> (1.1) Will the use of system services (system calls) by RT threads >> need to be explicitly supported by, perhaps, explicitly making the >> schedulers of *other* threads also using those system calls aware of >> that and take it into account to make those other tasks non-preemptible >> while holding system call related locks. >> >> (1.2) Can Linux SMP ever support this in an acceptable manner since >> locks associated with systems services are shared across CPU boundaries. >> THus, I can isolate a set of RT tasks on a CPU easily, and they are well >> isolated and can run under strict predictability, *until they use a >> system call that uses a lock*. Then, the system call is an interaction >> channel with every other thread on the system using the same system call. >> > > > This may be a terminology issue, but I think it would be more correct to > say that the lock is the interaction channel, not the system call, and > it is an interaction channel with every other entity on the system using > the same lock. Depending on the specific lock, this could be other > userspace tasks, kernel threads, or even hardware interrupt handlers. > > Sorry, sloppiness on my part while typing quickly, resulting in the terminolgy problem of --- my using the wrong terminology. Yes, the lock, is the interaction channel. Admittedly, which locks are the interaction channels is correlated with which system services are used by threads, but sometimes more and sometimes less strongly correlated. When I explain it to less expert audiences than this, I tend to talk in terms of the system services because they at least know what some of them are, while many have no idea what the concurrency control in the kernel is. They can fairly easily see that if RT tasks use a range of services used by generic Linux tasks, then some interaction between RT and non-RT tasks is a result. Still, no excuses, only explanation. Sorry to have over-simplified to this audience. Thanks for clarifying. > This goes back to your first question of which system services an RT > task can use while maintaining schedulability analysis. Unfortunately > this may be a moving target, since the exact answer depends on what > locks are taken in the underlying kernel implementation. > > Chris > Yes. This is true and is also the point I was trying to make. When talking to people about RT over the years I have observed that it is often hard to communicate the full cost of the predictability required for RT tasks Extremely detailed information is required, and getting it can be expensive. This is one reason why supporting RT in Linux proper is even harder than it first appears. However, I think that while completely arbitrary use of Linux system services by RT tasks is extremely complicated, many RT applications can be happy using only a small subset of services, and so various classes of applications can be supported successfully with merely extreme effort, as opposed to completely insane effort. It is a really hard problem, no doubt, though. Doug ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-11 18:28 ` Peter Zijlstra 2009-07-12 2:40 ` Douglas Niehaus @ 2009-07-12 6:17 ` Henrik Austad 2009-07-13 9:55 ` Raistlin 2 siblings, 0 replies; 82+ messages in thread From: Henrik Austad @ 2009-07-12 6:17 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani On Saturday 11 July 2009 20:28:11 Peter Zijlstra wrote: > On Fri, 2009-07-10 at 23:50 +0200, Henrik Austad wrote: > > Hi all! > > > > This is a proposal for a global [1], deadline driven scheduler for > > real-time tasks in the Linux kernel. I thought I should send out an RFC > > to gather some feedback instead of wildy hack away at it. > > > > This proposed scheduler is a modified MLLF (modified Least Laxity First) > > called Earliest Failure First (EFF) as it orders tasks according to when > > they will miss their deadlines, not when the actual deadline is. > > <snip> > > Everybody agrees we want a deadline scheduler, we'll probably put a user > interface into -rt shortly which should work for all the involved > research groups so that we can share tests and have better comparisons. My plan is to start working on this with Bill's framework once that is stable enough and it has the expressiveness I need. > The only thing (aside from an utter lack of time to work on things > recently) that has been holding us back is a proper solution to the > priority inversion issue. > > I haven't fully read through the proposed algorithm below, and left it > in place for the new people on CC. > > As already mentioned on IRC, the fact that you push the work to the last > possible moment indicates that this algorithm will utterly fall apart on > overload and would thus be unsuited for soft-rt loads, but I guess we > could implement things like EDF-fm and keep this as a hard-rt class. Not pushing the work to the last instance per se, but scheduling out a running thread only when it is absolutely necessary to reduce the number of task-preemptions. When a new task (A) arrives you compare it to the head of the waiting queue (B), and if A has a 'graver' need for running than B, you move on to look at the tail of the running queue (C). If A has higher need than C, you let A run, but only if not running A will cause a dl-miss and not for B. > > [...] > > There are still some issues left to solve, for instance how to best > > handle sporadic tasks, and whether or not deadline-miss should be allow, > > or just 'bounded deadline tardiness'. Either way, EFF should be able to > > handle it. Then, there are problems concerning blocking of tasks. One > > solution would be BWI or PEP, but I have not had the time to read > > properly through those, but from what I've gathered a combination of BWI > > and PEP looks promising (anyone with good info about BWI and PEP - feel > > free to share! (-: ). > > Our SSSUP friends have a BWI paper here: > > http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf thanks! > The thing we call PEP was christened so by Douglas Niehaus (on CC), I'm > not sure if he has any papers on it. A general outline would also be great as I treat PEP the way I *think* it is after some discussions on IRC etc. > Also, when talking about it at OSPERT last week Ted Baker (also on CC) > said it reminded him of something else of which I seem to have forgotten > the name. > > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor > but SMP leaves things to be desired. Dhaval is currently working on a > PEP implementation that will migrate all the blocked tasks to the > owner's cpu, basically reducing it to the UP problem. I think that PEP will work better on MP, at least for the entire system, but not for a single thread. If you move the task holding the resource to the CPU of the blocked task, and let the holder consume part of the requestor's budget, it will be easier to to schedule the actual thread for running as you have complete control over the current CPU. More importantly, you can ensure that the total utilization is bounded, not only within the group (if you use EFF in a clustered setup), but also when the holder belongs to another group. If several threads want to run the holder through the proxy, things will of course get complicated, but the same goes for BWI. The downside of PEP, is that you introduce unpredictability with consumed resource as other threads can consume a threads resource. This makes it harder to analyze a single thread. > > 1) Before you freeze at 'global' and get all caught up on "This won't > > ever scale to X", or "He will be haunted by Y" - I do not want to > > extend a global algorithm to 2000 cores. I would like to scale to a > > single *chip* and then we can worry about 2-way and 4-way systems > > later. For the record, I've donned my asbestos suit anyway. > > My preferred approach here is to find a distributed algorithm that > converges to the global one. > > > 2) http://austad.us/kernel/thesis_henrikau.pdf > > > > 3) Anyone want to include LaTeX-notation into an email-rfc? > > Not unheard of ;-) perhaps this 'Web 2.0' will come up with a solution :) -- henrik -- henrik ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-11 18:28 ` Peter Zijlstra 2009-07-12 2:40 ` Douglas Niehaus 2009-07-12 6:17 ` Henrik Austad @ 2009-07-13 9:55 ` Raistlin 2009-07-13 10:14 ` Peter Zijlstra 2 siblings, 1 reply; 82+ messages in thread From: Raistlin @ 2009-07-13 9:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani [-- Attachment #1: Type: text/plain, Size: 1954 bytes --] On Sat, 2009-07-11 at 20:28 +0200, Peter Zijlstra wrote: > > There are still some issues left to solve, for instance how to best > > handle sporadic tasks, and whether or not deadline-miss should be allow, > > or just 'bounded deadline tardiness'. Either way, EFF should be able to > > handle it. Then, there are problems concerning blocking of tasks. One > > solution would be BWI or PEP, but I have not had the time to read > > properly through those, but from what I've gathered a combination of BWI > > and PEP looks promising (anyone with good info about BWI and PEP - feel > > free to share! (-: ). > > Our SSSUP friends have a BWI paper here: > > http://retis.sssup.it/~tommaso/publications/OSPERT-2008.pdf > And here we are! :-) The paper Peter pointed you to mainly describes the work I did some months ago to implement BandWidth Inheritance inside one real-time Linux variant of us (ReTiS Lab, in Pisa, Italy)... Feel free to ask anything related to it directly to me. It is exactly implemented as a "proxy execution" protocol and things were easy there, since --for now-- the framework I was talking about is UP-only! :-( Now we are back on work on it, especially thinking on how to extend the protocol to SMP architectures... > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor > but SMP leaves things to be desired. Dhaval is currently working on a > PEP implementation that will migrate all the blocked tasks to the > owner's cpu, basically reducing it to the UP problem. > Nice... Only one question, doesn't this impact with task affinity related issues? regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 9:55 ` Raistlin @ 2009-07-13 10:14 ` Peter Zijlstra 2009-07-13 16:06 ` Raistlin 0 siblings, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-13 10:14 UTC (permalink / raw) To: Raistlin Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani On Mon, 2009-07-13 at 11:55 +0200, Raistlin wrote: > > Thing is, both BWI and PEP seems to work brilliantly on Uni-Processor > > but SMP leaves things to be desired. Dhaval is currently working on a > > PEP implementation that will migrate all the blocked tasks to the > > owner's cpu, basically reducing it to the UP problem. > > > Nice... Only one question, doesn't this impact with task affinity > related issues? No :-), well maybe. Since the task is blocked and will this not actually run, we can place it on any CPU, only once it becomes runnable again should we take care to place it on a CPU that's part of its affinity mask. Of course, underlying this is the question of what to do for 'partitioned' tasks sharing a resource. I think we've talked about this a few times. Since these 'partitioned' tasks do share a resource, they're not really all that partitioned, and I think the best way to keep the system going is to simply Inherit/Donate/Proxy right through the partition. ~ Peter ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 10:14 ` Peter Zijlstra @ 2009-07-13 16:06 ` Raistlin 2009-07-14 8:42 ` Peter Zijlstra 0 siblings, 1 reply; 82+ messages in thread From: Raistlin @ 2009-07-13 16:06 UTC (permalink / raw) To: Peter Zijlstra Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani [-- Attachment #1: Type: text/plain, Size: 2362 bytes --] On Mon, 2009-07-13 at 12:14 +0200, Peter Zijlstra wrote: > > Nice... Only one question, doesn't this impact with task affinity > > related issues? > > No :-), well maybe. > :-D > Since the task is blocked and will this not actually run, we can place > it on any CPU, only once it becomes runnable again should we take care > to place it on a CPU that's part of its affinity mask. > Well, it's difficult to find an argument against this! :-) Anyway, maybe if, on some architecture, for some kind of application, the affinity may have been set to preserve some kind actual cache or memory locality for the task access pattern, maybe this could be an issue, couldn't it? :-) I mean, in some case where being sure of having a task running on a particular CPU is somehow of paramount importance... Anyway, I know, I'm just wandering around, I'm not that "affinity expert" and I'm far from having a real example of the scenario I tried to describe! :-( > Of course, underlying this is the question of what to do for > 'partitioned' tasks sharing a resource. I think we've talked about this > a few times. > Yes, there is a lot of chatting about partitioned task sets where resource sharing crosses the partition in the literature... > Since these 'partitioned' tasks do share a resource, they're not really > all that partitioned, ... and I definitely agree with you on that: where's partitioning? Anyway, I also think that, if that is true, e.g., for user-space locks/resources, there are resources that two tasks _have_ to share, even if being partitioned, e.g., when they come to compete, in kernel space, e.g., for a lock on the virtual terminal... And that's the only situation where such a problem make sense. These just to point out one more reason why we are trying something different (as explained in the other message), and light year far from saying that the approach you proposed is not the good one! On the contrary, I think all this make the problem even more interesting! :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-13 16:06 ` Raistlin @ 2009-07-14 8:42 ` Peter Zijlstra 2009-07-14 9:36 ` Raistlin 0 siblings, 1 reply; 82+ messages in thread From: Peter Zijlstra @ 2009-07-14 8:42 UTC (permalink / raw) To: Raistlin Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani On Mon, 2009-07-13 at 18:06 +0200, Raistlin wrote: > Anyway, maybe if, on some architecture, for some kind of application, > the affinity may have been set to preserve some kind actual cache or > memory locality for the task access pattern, maybe this could be an > issue, couldn't it? :-) > I mean, in some case where being sure of having a task running on a > particular CPU is somehow of paramount importance... Right, and you answered your own question :-), its _running_ that is important, so as long as its blocked (not running), you're free to place the task on another cpu if that helps out with something. ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel 2009-07-14 8:42 ` Peter Zijlstra @ 2009-07-14 9:36 ` Raistlin 0 siblings, 0 replies; 82+ messages in thread From: Raistlin @ 2009-07-14 9:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Henrik Austad, LKML, Ingo Molnar, Bill Huey, Linux RT, Fabio Checconi, James H. Anderson, Thomas Gleixner, Douglas Niehaus, Ted Baker, Dhaval Giani [-- Attachment #1: Type: text/plain, Size: 1279 bytes --] On Tue, 2009-07-14 at 10:42 +0200, Peter Zijlstra wrote: > On Mon, 2009-07-13 at 18:06 +0200, Raistlin wrote: > > Anyway, maybe if, on some architecture, for some kind of application, > > the affinity may have been set to preserve some kind actual cache or > > memory locality for the task access pattern, maybe this could be an > > issue, couldn't it? :-) > > I mean, in some case where being sure of having a task running on a > > particular CPU is somehow of paramount importance... > > Right, and you answered your own question :-), its _running_ that is > important, so as long as its blocked (not running), you're free to place > the task on another cpu if that helps out with something. > Yep! Re-reading both your and my comments I saw I misunderstood your point! :-( I agree thet you have to move some task and, moving the "blocked" ones, would allow the lock-owner to continue running in its place, which sounds good to me to. :-) Sorry! Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------- Dario Faggioli, ReTiS Lab, Scuola Superiore Sant'Anna, Pisa (Italy) http://blog.linux.it/raistlin / raistlin@ekiga.net / dario.faggioli@jabber.org [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 82+ messages in thread
* Re: RFC for a new Scheduling policy/class in the Linux-kernel @ 2009-07-16 17:54 Raj Rajkumar 0 siblings, 0 replies; 82+ messages in thread From: Raj Rajkumar @ 2009-07-16 17:54 UTC (permalink / raw) To: linux-kernel; +Cc: Raj Rajkumar Dear Jim, Ted and others: Let me first introduce myself. I work on real-time systems at Carnegie Mellon Univ and am one of the authors on the papers that introduced the priority inheritance and priority ceiling protocols, and some of their follow-ons. I was also a co-founder of TimeSys in 1996. My PhD student, Karthik Lakshmanan, and an SEI Sr. Staff Member, Dr. Dionisio de Niz, and I have begun to revisit multiprocessor scheduling and synchronization thanks in part to the surging interest in multicore processors. I just came to know about this message thread. I hope, as Jim did, that I can add something to the ongoing discussion - if not, please bear with me. First of all, let me agree completely with Jim that simplicity goes a very long way. Two relevant examples follow. First, fixed-priority scheduling has dominated most real-time systems over dynamic-priority scheduling) since (a) the practical performance differences are not significant (b) overheads are lower and (c) perhaps above all, less information is required. Secondly, priority ceiling/ceiling emulation protocols have much better properties than basic priority inheritance protocols but the latter are more widely used. The reason appears to be that PCP requires additional information which can also change at run-time. Overall, simplicity does win. Secondly, let me fully agree with Ted in that Linux-RT extensions should support bandwidth control & isolation. My group's work on "reserves" and "resource kernels" looked at isolating both temporal and spatial resources (please see http://www.ece.cmu.edu/~raj/publications.html#ResourceKernels) in the context of Linux. Karthik's work extended Linux/RK to distributed systems (Distributed Linux/RK). Thirdly, I also agree with Jim that non-preemptive locking and FIFO queueing are fine for mutexes in the kernel. The primary reason is that critical sections in the kernel are written by experts and tend to be short. And as it turns out, this is exactly how Linux SMP support has been for quite a few years. It looks to me like Jim and Bjoern name the kernel-mutex locking scheme (of non-preemption and FIFO queueing) as FMLP and advocate it for user-level mutexes. Jim: Please correct me if my interpretation is incorrect. I would like to propose a solution with 2 components. First, priority inheritance protocols not just prevent (potentially) unbounded priority inversion but offer less blocking for higher-priority tasks with typically tighter timing constraints. It is also well known that non-preemptive execution is an extreme/simplified form of the priority ceiling protocol, where every task is assumed to access every shared resource/mutex and hence every priority ceiling is the highest priority in the system. There are cases when this is fine (e.g. when all critical sections are *relatively* small as in the Linux kernel) and there are cases when this is not (e.g. when some user-level critical sections accessed only by lower-criticality tasks are *relatively* long compared to the timing constraints of higher priority tasks. Component 1: Let a priority ceiling be associated with each user-level mutex. When the mutex is locked, the corresponding critical section is executed at the priority ceiling value. The developer can choose this to be the highest priority in the system in which case it becomes a non-preemptive critical section. In addition, we could allow mutexes to either pick basic priority inheritance (desirable for local mutexes?) or the priority ceiling version (desirable for global mutexes shared across processors/cores). Component 2: The queueing discipline for tasks pending on locked mutexes is the second policy under discussion. Jim argues that it should be FIFO. Imagine a bunch of lower-priority tasks stuck in front of a higher-priority task, and the "priority inversion" time can be considerable. (Some including me would argue that it goes against the grain of using priority-based scheduling for real-time systems in the first place. Why not just use FIFO scheduling?). However, a practical compromise is easy to reach as well. Let the queueing policy (FIFO or priority-based) on mutexes be an option. FIFO for a "local" mutex would not be very helpful. And priority-queueing for a "global" mutex would help tasks with tight timing constraints. Proposal Summary 1. Associate a priority ceiling (priority at which critical sections are executed) with each (global) mutex. A macro HIGHEST_CEILING could represent a value that is higher than any other application-level priority. 2. Allow the developer to specify the queueing discipline on a (global) mutex. MUTEX_FIFO_QUEUEING and MUTEX_PRIORITY_QUEUEING are allowed options. I would appreciate any comments. Thanks for reading through a lot (if you reached this far :-) --- Raj http://www.ece.cmu.edu/~raj P.S. Some relevant references from a couple of decades back. [1] Rajkumar, R., "Real-Time Synchronization Protocols for Shared Memory Multiprocessors". The Tenth International Conference on Distributed Computing Systems, 1990. [2] Rajkumar, R., Sha, L., and Lehoczky J.P. "Real-Time Synchronization Protocols for Multiprocessors". Proceedings of the IEEE Real-Time Systems Symposium, December 1988, pp. 259-269. Global locking is costly - global critical sections could be moved to a "global synchronization processor" (and is described in one of the articles above). ^ permalink raw reply [flat|nested] 82+ messages in thread
end of thread, other threads:[~2009-07-18 20:20 UTC | newest]
Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-10 21:50 RFC for a new Scheduling policy/class in the Linux-kernel Henrik Austad
2009-07-11 18:28 ` Peter Zijlstra
2009-07-12 2:40 ` Douglas Niehaus
2009-07-12 15:31 ` Peter Zijlstra
2009-07-13 15:44 ` Raistlin
2009-07-13 16:33 ` Chris Friesen
2009-07-14 10:47 ` Raistlin
2009-07-14 11:03 ` Peter Zijlstra
2009-07-14 18:19 ` Raistlin
2009-07-14 14:48 ` Chris Friesen
2009-07-14 15:19 ` James H. Anderson
2009-07-14 16:31 ` Peter Zijlstra
2009-07-14 16:54 ` James H. Anderson
2009-07-14 19:28 ` Henrik Austad
2009-07-14 19:33 ` James H. Anderson
2009-07-15 21:53 ` Ted Baker
2009-07-17 7:40 ` Henrik Austad
2009-07-17 13:37 ` Ted Baker
2009-07-15 4:25 ` Bjoern B. Brandenburg
2009-07-15 20:55 ` Ted Baker
2009-07-15 21:53 ` Chris Friesen
2009-07-15 22:34 ` Ted Baker
2009-07-15 22:39 ` Dhaval Giani
2009-07-15 23:16 ` Ted Baker
2009-07-16 8:58 ` Peter Zijlstra
2009-07-16 9:11 ` Peter Zijlstra
2009-07-17 0:32 ` Raistlin
2009-07-17 0:43 ` Raistlin
2009-07-16 12:17 ` Raistlin
2009-07-16 23:29 ` Raistlin
2009-07-18 20:12 ` Michal Sojka
2009-07-14 17:16 ` James H. Anderson
2009-07-15 21:19 ` Ted Baker
2009-07-14 19:54 ` Raistlin
2009-07-14 16:48 ` Raistlin
2009-07-14 18:24 ` Chris Friesen
2009-07-14 19:14 ` Raistlin
2009-07-15 22:14 ` Ted Baker
2009-07-16 7:17 ` Henrik Austad
2009-07-16 23:13 ` Ted Baker
2009-07-17 0:19 ` Raistlin
2009-07-17 7:31 ` Henrik Austad
2009-07-16 14:46 ` Chris Friesen
2009-07-16 22:34 ` Ted Baker
2009-07-16 23:07 ` Raistlin
2009-07-15 21:45 ` Ted Baker
2009-07-15 22:12 ` Chris Friesen
2009-07-15 22:52 ` Ted Baker
2009-07-17 13:35 ` Giuseppe Lipari
2009-07-13 17:25 ` Peter Zijlstra
2009-07-13 18:14 ` Noah Watkins
2009-07-13 20:13 ` Ted Baker
2009-07-13 21:45 ` Chris Friesen
2009-07-14 11:16 ` Raistlin
2009-07-15 23:11 ` Ted Baker
2009-07-16 7:58 ` Peter Zijlstra
2009-07-16 8:52 ` Thomas Gleixner
2009-07-16 12:17 ` Raistlin
2009-07-16 12:59 ` James H. Anderson
2009-07-16 13:37 ` Peter Zijlstra
2009-07-16 22:15 ` Ted Baker
2009-07-16 22:34 ` Karthik Singaram Lakshmanan
2009-07-16 23:38 ` Ted Baker
2009-07-17 1:44 ` Karthik Singaram Lakshmanan
2009-07-16 15:17 ` Chris Friesen
2009-07-16 21:26 ` Ted Baker
2009-07-16 22:08 ` Chris Friesen
2009-07-16 23:54 ` Ted Baker
2009-07-14 9:15 ` Peter Zijlstra
2009-07-14 19:07 ` Raistlin
2009-07-13 17:28 ` Peter Zijlstra
2009-07-14 19:47 ` Raistlin
[not found] ` <002301ca0403$47f9d9d0$d7ed8d70$@tlh@comcast.net>
2009-07-13 23:47 ` Douglas Niehaus
2009-07-14 7:27 ` Chris Friesen
2009-07-14 7:44 ` Douglas Niehaus
2009-07-12 6:17 ` Henrik Austad
2009-07-13 9:55 ` Raistlin
2009-07-13 10:14 ` Peter Zijlstra
2009-07-13 16:06 ` Raistlin
2009-07-14 8:42 ` Peter Zijlstra
2009-07-14 9:36 ` Raistlin
-- strict thread matches above, loose matches on Subject: below --
2009-07-16 17:54 Raj Rajkumar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox